Spot the bug: autolinking URLs in text

by Jason on September 12, 2011 · 0 comments

Here’s some PHP code from last year that recently reared its ugly head.  Basically, it was designed to take a line of contact info from a business directory and automatically convert any URLs or email addresses to clickable links by identifying the patterns and wrapping them in the appropriate HTML:


function directory_linkify($contact) {

$rx_url = "#(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))#";

if (preg_match_all($rx_url, $contact, $matches) > 0) {

foreach ($matches[0] as $match) {

$contact = str_replace($match, "<a href = 'http://" . str_replace('http://', '', $match) . "' target = '_blank'>" . str_replace('http://', '', $match) . "</a>", $contact);

}

}

$rx_email = "#[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})#";

if (preg_match_all($rx_email, $contact, $matches) > 0) {

foreach ($matches[0] as $match) {

$contact = str_replace($match, "<a href = 'mailto:" . $match . "'>" . $match . "</a>", $contact);

}

}

return $contact;  }

It worked great until someone added two URLs to the same line, so given this:

Bob Smith, 1-800-888-8888, www.bobsmith.com, www.bobsmith.com/very_important_page.html

I was seeing this result:

Bob Smith, 1-800-888-8888, <a href=”http://www.bobsmith.com”>www.bobsmith.com</a>, <a href=”http://www.bobsmith.com”>www.bobsmith.com</a>/very_important_page.html

WTF? The second URL wasn’t catching the path.  But other tests with full path URLs were passing elsewhere.

At this point, and it doesn’t help that the bug appeared late at night when I should have been going to bed, the undisciplined whack-a-mole debugger in me came out in full force.  Maybe the order of the URLs mattered, maybe the regex I was using was faulty, maybe I needed to end that second URL with something different, maybe, maybe…

Of course, as with most bugs, it was pretty simple once I stepped back and actually looked at what was happening.  Let’s look at that foreach loop and assume the regex is finding URLs just fine, since it’s been working for a year already.

We’ve got two URLs in the $matches array: www.bobsmith.com and www.bobsmith.com/very_important_page.html. So we loop.

In step one, we change www.bobsmith.com to <a href = ‘http://www.bobsmith.com’>www.bobsmith.com</a>.

And in step two, we change www.bobsmith.com/very_important_page.html to <a href = ‘http://www.bobsmith.com/very_important_page.html’>www.bobsmith.com/very_important_page.html</a>.

Except we don’t!  Step one already changed that text to <a href = ‘http://www.bobsmith.com’>www.bobsmith.com</a>/very_important_page.html, which means the str_replace call in the second round of the loop never finds anything to replace.

This is a great case where you’ve got a simple process that happens to involve a complicated  distraction like a regex. When you’re debugging, it’s always a great idea to step through the function with a pencil and paper, and ideally with a second set of eyes, especially for relatively self-contained methods like this (which are the kinds you should strive to write anyway – hmm, maybe I should have split the email and URL detection out into separate methods…)

Anyway, in case you’re curious or need such a thing, here’s the replacement code that links properly. I doubt it’s perfect (I’m doing more PHP lately but this was from a while ago) so let me know if you spot any other bugs or opportunities for improvement!


function directory_linkify($contact) {

$rx_url = "#(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))#";

if (preg_match_all($rx_url, $contact, $matches) > 0) {

$index = 0;

foreach ($matches[0] as $match) {

$index = strpos($contact, $match, $index);

$contact = substr_replace($contact, "<a href = 'http://" . str_replace('http://', '', $match) . "' target = '_blank'>" . str_replace('http://', '', $match) . "</a>", $index, strlen($match));

}

}

// and then similar changes for email...

return $contact;

}

Beware of unused variables

by Jason on September 5, 2011 · 0 comments

When you’re making changes to a system, for the love of all that’s programmable, please be thorough.

If you’re changing a function, make sure you’ve cleaned up any variables that aren’t being used anymore (your IDE might be able to spot these for you, in which case there’s not a lot you can use for an excuse.)

If you’re changing a class and it’s internal framework stuff that won’t be used by anyone outside, make sure you remove functions if they’re not being called anymore.

If you’ve refactored global configuration variables or settings, make sure you’ve deleted all trace of them if they’re really gone.

If you don’t do these things, the next person to come along is going to totally miss her estimate on account of not realizing how many land mines need to be stepped around, and each one means more time added to the project, either in a big chunk right away by dealing with what you should have done already (and the testing and QA overhead that goes with it,) or in little increments every time anything around your old work needs to be touched (“what’s this? Oh, I don’t think it’s used, but I’m not sure…”)

Ask me know I know this…

And do all of these things right away.  Because if you wait, even a little bit, you’re going to find yourself in the same position as the new guy.

Ask me how I know that (in a smaller voice, at least in recent memory.)

The “I thought we might need to bring it back” defence doesn’t work in a source control environment. Every line of code you ever submitted is still there in the repository. So why not keep the really embarrassing stuff out of the head?

(And don’t get me started on how documentation lags the code… Yeah, it’s been a fun long weekend refactoring.)

I had a head slapper a while back where a Rails site I was working on wasn’t updating the created_at and updated_at fields properly in any of the tables.

(For those who don’t know, it’s a handy feature in ActiveRecord where these datetime columns get automatically refreshed on creates and saves – well, somewhat handy, until you over-rely on the updated_at field and then other business logic causes it to change when you don’t expect it, but I digress…)

Anyway, at the time I was working on a 3.0 beta rev, so I figured it was just something that was going to sort itself out, and I added some quick before_save and before_create handlers to the record to keep things in sync so I could work on the real problems (there were time and budget constraints and I was the only one who ever looked at those fields,) but I found it was still happening quite some time later on the official 3.0.x branch.

It turns out the problem was that these tables hadn’t been created through a standard Rails migration with a t.timestamps call; they’d been made up through a series of SQL scripts that had various explicit calls to create table.

And in there, the created_at field was defined as datetime not null default ’1900-01-01 00:00:00′.

Gotcha.

So here’s the lesson: if created_at has a default, Rails/ActiveRecord will use the default, just like it does for any other column in the db.  You want these fields to be nullable without a default, so if this is happening to you, run a migration like this:

change_column :table_name, :created_at, :datetime, :null => true, :default => nil

Then you can get rid of embarrassing workarounds that shouldn’t have been there in the first place, especially since they now reflect a severe lack of understanding of how ActiveRecord works, at least compared to your newfound knowledge :)

I just answered this on Stack Overflow, so I thought I’d share: when you use a rich text editor in a web page, such as nicEdit, if you type a URL or an email address into the text box in IE, it’ll automatically turn it into a link.  On Firefox, it doesn’t.  So what’s the deal, and can you turn it off?

tl;dr: It’s a Microsoft design decision, but as of IE9 you can disable it through javascript.

Here’s the setup: Windows-based browsers instantiate a MSHTML rich text editor to do their rich text stuff.  There’s a setting in there, called IDM_AUTOURLDETECT_MODE that determines if  URL auto-linking is on or off.  It’s an ActiveX control, so if you’re putting it into a desktop app, you can use IOleCommandTarget::Exec and set it to whatever you want.  If you’re building a web page, however, you’re out of luck, or at least you used to be.

There are some settings that you can tweak through Javascript, but until recently this wasn’t one of them.  The document.execCommand takes a string parameter to specify the command you’re accessing, but there wasn’t a mapping in the back end to turn that into a value that the editor knew what to do with.

As of IE9, Microsoft has enabled a mapping for this called AutoUrlDetect, so now, if you’re lucky enough to have an IE user visit with IE9, you can call document.execCommand(“AutoUrlDetect”, false, false) and autolinking will be disabled.

For other versions of IE, life continues to suck.  There aren’t any really good ways I can see to get around this without nuking all links in the text box, some of which might have been legitimately placed there.

In the grand scheme of things, it’s a minor annoyance, but one that I’m glad there’s a fix for as well as documentation around.  There’s always something that a client will notice happening in one browser and not others, and for whatever reason this might be flagged as a show stopper in your project, because of course you can do anything if you put your mind to it, right?  At least now the problem is known and is on the (slow) way out, and Microsoft has info on the original problem and the fix in case my word isn’t good enough :)

Oh, and in case you want to go the other way: if you try the opposite of that code in Firefox on a machine with IE9 installed (so document.execCommand(“AutoUrlDetect”, false, true);), it doesn’t seem to work, which is probably just as well, since I’ve no idea what the equivalent would be on non-Windows machines.  All in all, this seems like the kind of thing that javascript should control.

ActiveRecord count vs size

by Jason on June 3, 2011 · 0 comments

Picture the following fake Rails code:

@companies = Company.all
[some stuff]
@companies.reject! {|company| company.staff_count < 10}

And assume there are 50 companies in the database, with 10 of them having a staff count less than 10.

What’s the value of @companies.count?

As I found this morning, it’s 50, not 40.

Calling count on an ActiveRecord result set will trigger a new SQL query to get the count, so it has nothing to do with the current state of the object.

@companies.size, on the other hand, will return the correct number.

And yes, I’m aware of the code smell in that example (and the underlying actual code it was based on.)  A scope would probably make more sense instead of the reject call, but even then count would be going to the database to check its numbers.

This is an example of where it’s best not to assume anything, but also to check your development logs for the SQL calls your page is generating.  There are opportunities for optimization that can come out of that, for sure, but there are also times when you’ll get a hint that your code is doing something you didn’t quite expect it to.

Does anyone know of a good Rails logfile analyzer?

There are times where you do something once, and it feels like an ugly hack, but then you keep doing it on other projects because you just can’t figure out a better way, and then you talk to a few more programmers, and it turns out that what you’re doing is pretty much the best approach.

I don’t think all design patterns are born this way, but that’s how this one came about.  Here’s the smash and grab pattern.

Basically, when you have to interface with an external system, whether it’s a data API or more of a platform like a Facebook or Salesforce app, you want to get what you need and get out as fast as you can.  Don’t try to modify everything you’re doing to suit the platform, just take the data that you require and port it over to your core framework as early as possible.

For a basic API, we’re talking about putting the data in a format that’s consistent with the rest of your application.  For a more intensive app, you might be looking at a thin iframe layer or something.

It’s an extension of the adapter design pattern, basically, but adding a “take what you need and get out” mentality underlines the point that this interface is not the basis of your application (well, it likely shouldn’t be anyway) – your core business logic, other integration points, and future plans shouldn’t be relying on one interface to set the tone for everything you do.

And for the record, I’m not a design pattern guru.  They influence my work and I’m sure I could explain my code using the basic styles, but I don’t approach each problem with a checklist or try to make them fit.  They’re handy tools for describing core concepts, but in earlier years I found myself trying to make sure every class had at least one design pattern represented, and just like how the smash and grab reminds me that I shouldn’t rely too much on any one data source, I also need to remember not to rely on patterns for everything I consciously do.

The web’s made it easy to get design pattern information, but I’m still glad I have my Gang of Four book.  For an intro to the core concepts, you could do a lot worse than Head First Design Patterns. [affiliate links]

C#, List.Find() and Predicates

by Jason on May 24, 2011 · 10 comments

This post was original posted on my personal site in October 2007, but it’s a better fit here and since it gets a bunch of visits every day I figured it’d be better to move it than kill it.  Enjoy, updates at the bottom.

I’m a huge (ab)user of the .NET generic collection classes, but I hate cluttering code with foreach() loops every time that I need to find an item in a collection. Enter the Find() method, which takes a predicate and does the work for you so you can keep focusing on the stuff that’s actually interesting:

List<SomeObject> myObjects = new List<SomeObject>();
/* .. load objects up somehow .. */
SomeObject desiredObject =
    myObjects.Find(delegate(SomeObject o) { return o.Id == desiredId; });

Update, post-migration: In the comments on the original version, Vinit correctly pointed out that in later .NET versions you could just use

SomeObject desiredObject = myObjects.Find( o => o.Id == desiredId);

Which is a lot cleaner, and I never did get around to writing about lambda abuse, but seriously, it’s cool stuff.  The best way to learn how to refactor old C# code, in my opinion, is to get a copy of Resharper, which will make all kinds of suggestions.  Your old code will get cleaner, and you’ll be a faster programmer – obviously don’t go around refactoring everything blindly in a release that’s just supposed to have a spelling mistake fixed in it, but make the changes when test windows present themselves, and then just code in your old style and let Resharper help you get up to speed.

That’s not an affiliate link, I get no money from them, but I won’t use Visual Studio without that tool because consciously choosing to write without it is basically admitting that mediocrity is just find fine with you. (updated to preserve the epic typo LeroyGaman pointed out while clarifying the overall sentence… Yeeha!)

Quick tips for demos and mocks

by Jason on May 17, 2011 · 0 comments

A few quick ideas that came out of some conversations at Rails Pub Night:

iPhone web demos need icons

Nobody wants to wait for you to load up your example code.  Make a folder of links so you can bring up the sample quickly instead of opening Safari, typing in the URL while juggling something else you’re holding (yes, last night it was beer,) and then waiting for the wireless to crap out.  Launching from the home screen gets you right to that delightful no signal stage.  Which you then deal with by…

Have video backups of everything cool

If you don’t have wireless access, your web-based demo might suck, so make a quick screencam movie of anything you need to demo and keep it on your desktop, USB key, and mobile device.  This is going to sound 30% stupid, but if you demo your app via a video playing on an iPad 2, it will seem much cooler than if it was running on a computer where people could actually use it.

Use those same videos for mockup presentations

Giving someone a link and waiting for feedback can result in… interesting distractions, especially if it’s an early mock where some of the content is placeholder.  Adding an audio track lets you walk someone through a feature while also being able to explain what’s not done.  Before you send something like this out though, be sure to find out if your intended recipient has a sound card and/or headphones.

Use GotoMeeting for more structured presentations

You can do a more interactive presentation than you can with the screencam technique if you use a desktop conferencing solution like GotoMeeting, which I think is around $50/month.  This means you have to actually schedule the call, but allows for a much more in depth discussion.

That about does it for now.  And yes, these are notes to myself as much as they are to you or to your programming team.

Generating unique IDs for web forms

by Jason on May 16, 2011 · 2 comments

fingerprintI spent some time Friday repairing a ridiculously bad PHP form that could (and may still be) fodder for four or five blog posts, primarily around security, but somewhere during the refactoring I had to examine an approach to generating a unique ID that was required to track something.

In general, there are three schools of thought to generating a unique ID:

1) Use a GUID

Globally Unique Identifiers. These typically look like this: 835af40d-e275-45fb-beb9-a98cdc0726bd.  They were popularized (at least, when I started using them) with Microsoft’s COM platform as unique keys.  The problem is, they’re not always unique (the PHP page on them has examples using a bunch of random numbers, which is pretty darned unique, but not guaranteed, and the page I pulled that example GUID from says “the generation algorithm is unique enough that if 1,000,000,000 GUIDs per second were generated for 1 year the probability of a duplicate would be only 50%.”  So unique, except for when they aren’t.  Which is a hell of an obscure edge case to chase.  Plus they look ugly, especially if you want to put them in a URL.

2) Use a database key

MySQL has auto increment tables, so every time you add a row, there can be a field with a numeric key that the DB generates for you based on what else is in the table, your increment settings, and so on.  The catch here is that it’s not hard to guess the next number in the sequence, so if your ID is something that will trigger a database pull, say on a multi-step application form, you want to think twice about the ramifications of someone being able to guess other object IDs.

3) Some made-up piece of crap

Typically this is based on timestamps, because it’s the easiest thing to work with, but it’s only unique for keys that aren’t generated at the exact same time.  In my case, I pulled “magic” timestamp-based random number code from the form I was working with and threw it into a simple PHP script I could call from the command line.  Just hitting up and enter (my shell’s setting for repeating the last command) at a decent pace I was able to get the same “unique” ID several times in a row.

The three keys to a unique ID

In my mind, a unique ID should be unique (duh,) hard to guess, and reasonably type-able.  These factors are all on a sliding scale depending on your needs, and all get more expensive as you get closer to perfection, both on their own and in how they impact the other two factors.

For me, uniqueness is the biggest deal.  I don’t want to be chasing duplicate key issues for those one in a billion cases that happen way too often (i.e. more than once) for my liking, and more importantly that you think might have happened and take away from identifying the real problem.

Hard to guess directly impacts easy to type.  For simple applications, as long as there are more than a thousand possibilities per actual ID, I’m happy, and if I need more than that I’m liable to tie it into an actual authentication system.  For public applications, you’re vulnerable to brute force attacks by bots, but that requires a different overall strategy anyway.

For easy to type, sometimes that’s because the ID shows up in the URL, and sometimes it’s simply to help with troubleshooting.  I’ve made mistakes looking through logs where a 30+ character key was only one letter different than another and I didn’t realize it.  So call it easy to read.  Of course, if you make it too short, it’s easy to guess.  One way to expand the number space is to use letters and numbers for numeric keys and encode your IDs into a more compressed (yet reversible) format, like how TinyURL does it.

Here’s what I did

My solution isn’t perfect, it’s not what I would do if I was building a form from scratch, but it solved my problem of making a “unique-r” key.  As I mentioned, the code used a timestamp-based system to make an ID, but multiple hits at the same time would cause duplicates.  All I did was append a unique sequence number to the end, through the magic of concatenating numbers as if they were strings.

This is a handy facility to have around, I’ve found: create a table in MySQL with only one column that’s an auto-increment primary key.  Now, when you insert a row with ID=null and then query the last inserted ID, you get a number that’s pretty much guaranteed (subject to your DB architecture, but if it’s not you’ve got deeper problems) to be unique.

I took that ID, and appended it along with a dot to the original numeric key.  The dot was important to differentiate it from an ID that happened to end with that sequence number, and it wasn’t being stored as an integer anyway so I could get away with it.

Oh, I also multiplied the result by a salt factor just to increase the working set a little, but that wasn’t really necessary for my purposes.

Again, your choices will vary based on your needs, but I’d suggest you ask yourself the worst case scenario for a duplicate or a correct guess, both from a customer impact and a developer productivity perspective.

Photo by fazen.

Since this is the second time this year I’ve had to fix this, I thought I should write a few things down this time.  If you’ve got an ASP.NET site that just sends out the occasional email, you might have some code that looks like this:

public void Send(string from, string to, string subject, string body)
 {
var msgMail = new MailMessage(from, to, subject, body);
msgMail.IsBodyHtml = true;
var server = new SmtpClient("localhost");
server.DeliveryMethod = SmtpDeliveryMethod.PickupDirectoryFromIis;
server.UseDefaultCredentials = true;
server.Send(msgMail);
 }

(Simplified, not for production, yadda yadda yadda)

And maybe it even works. Until you switch to or do your initial deploy to a new Windows 2008 server.  And you get the dreaded “cannot get IIS pickup directory” error.

Now, I regularly call myself the world’s most dangerous system administrator.  It’s not what I do.  I’m not especially good at it, but sometimes I need to get in there and do stuff so I can get paid for my actual work.  The rest of this post is a starter checklist on things to look into if you’re getting this error.  If you have further questions, I probably won’t be able to help you, so I apologize in advance.  If you manage to find answers for your further questions, it’d be awesome if you could share them in the comments so other people can gain more value – one of the reasons I’m writing this up on a Saturday when I’d love to be with my family is because it took me an hour to solve this, even though I’d solved the problem once before, and not too long ago.  There aren’t a lot of good solution guides online, so let’s work together to build one!

Also, securing your SMTP server is a whole other post, and the balance between sending directly from your web server and using a dedicated service, deliverability, and other email issues are separate from this post.  Boom.  Now let’s get on with it.

1) Is SMTP installed?

Your web server might not even have the SMTP service installed, so get into the server manager and make sure that’s in the feature list.  If it’s not, you’ll need to add it.

Make sure SMTP is installed on your server

2) Is SMTP configured?

Just having SMTP installed isn’t enough.  For starters, you want to make sure the service is set to start automatically when your server boots up (I once made some nice coin troubleshooting a client setup where mail used to send but wasn’t anymore – they’d rebooted and the mail service stayed down, was all.)  Go into services and make sure that the service is in there, it’s already started, and has a startup type of automatic:

Is SMTP running?After that, you’ll want to play with the SMTP server settings – I don’t know a lot about these, to be honest, so I’m not going to get into them, but you need to know that SMTP is still part of IIS 6.0, so you’ll find it in the IIS 6.0 Manager.  One quick tip though, because it might get missed – some receiving servers require a fully qualified domain name, and your server might have a name like “Server15″ or something, which isn’t fully qualified, so in the SMTP properties, under the Delivery tab, click the Advanced button and put a real hostname in the fully qualified domain name field.

3) Is the firewall open?

You might get hit with some aggressive firewall rules, so now’s a good time to troubleshoot your SMTP setup in general, which you can do by telnetting to localhost on port 25 and manually sending yourself a sample mail.  Here’s an approximation of the flow to do that:

telnet localhost 25
HELO somehost.com
MAIL FROM: yourname@somedomain.com
RCPT TO: the_to_address@somedomain.com
DATA
Subject: the subject of the mail
Blah blah blah message body

Then hit enter, period, enter to finish the mail.

If you can’t connect, or the mail doesn’t arrive, there’s something else wrong that you’ll need to fix, because it doesn’t matter if the pickup directory’s available.

4) Does IIS have access to the metabase?

Here’s where we get to the actual problem – chances are, the process that runs your website doesn’t have access to the IIS metabase that stores the name of the pickup directory, and this is the general cause of your error.

You’re going to need to download the IIS 6.0 resource kit.  There’s a script you can use called metaacl.vbs but it didn’t work for me.  Download the kit from here, install it, and then run Metabase Explorer (search for it, but mine was in Program Files (x86)\IIS Resources\Metabase Explorer)

You want to add read permissions to SmtpSvc for your IIS process (I added the IIS_IUSRS group, check your settings to see what you’re running as.)

SMTP settings in Metabase Explorer

5) Does IIS have access to the pickup directory?

This is just a bonus step, because you might get a permissions error the next time you try to send mail from your website.  You need to add write permissions to the actual pickup directory for your IIS user (in my case, I added a permission for IIS_IUSRS to c:\inetpub\mailroot)

That’s more or less what it took to get things running for me, but as with any server troubleshooting, it’s possible that I clicked some magic checkbox somewhere along the way that’s also key to the process, so if you have questions, post ‘em in the comments, and if you know the answer to one of those questions, because as I said, I likely don’t, then post that as well and I’ll do my best to incorporate that data into the core post.