Toronto Programmer Jason Doucette

Software developer for hire

Jason Doucette is a Toronto software developer with over 20 years experience covering a wide range of technologies, who currently specializes in taking over existing software projects for audits, rehabs, and when necessary, rewrites. Click here to get in touch.

Microsoft Certified Professional

View Jason Doucette's profile on LinkedIn

Copyright © 2019
Thrust Labs

Spot the bug: autolinking URLs in text

September 12, 2011 By Jason

Here’s some PHP code from last year that recently reared its ugly head.  Basically, it was designed to take a line of contact info from a business directory and automatically convert any URLs or email addresses to clickable links by identifying the patterns and wrapping them in the appropriate HTML:


function directory_linkify($contact) {

$rx_url = "#(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))#";

if (preg_match_all($rx_url, $contact, $matches) > 0) {

foreach ($matches[0] as $match) {

$contact = str_replace($match, "<a href = 'http://" . str_replace('http://', '', $match) . "' target = '_blank'>" . str_replace('http://', '', $match) . "</a>", $contact);

}

}

$rx_email = "#[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})#";

if (preg_match_all($rx_email, $contact, $matches) > 0) {

foreach ($matches[0] as $match) {

$contact = str_replace($match, "<a href = 'mailto:" . $match . "'>" . $match . "</a>", $contact);

}

}

return $contact;  }

It worked great until someone added two URLs to the same line, so given this:

Bob Smith, 1-800-888-8888, www.bobsmith.com, www.bobsmith.com/very_important_page.html

I was seeing this result:

Bob Smith, 1-800-888-8888, <a href=”http://www.bobsmith.com”>www.bobsmith.com</a>, <a href=”http://www.bobsmith.com”>www.bobsmith.com</a>/very_important_page.html

WTF? The second URL wasn’t catching the path.  But other tests with full path URLs were passing elsewhere.

At this point, and it doesn’t help that the bug appeared late at night when I should have been going to bed, the undisciplined whack-a-mole debugger in me came out in full force.  Maybe the order of the URLs mattered, maybe the regex I was using was faulty, maybe I needed to end that second URL with something different, maybe, maybe…

Of course, as with most bugs, it was pretty simple once I stepped back and actually looked at what was happening.  Let’s look at that foreach loop and assume the regex is finding URLs just fine, since it’s been working for a year already.

We’ve got two URLs in the $matches array: www.bobsmith.com and www.bobsmith.com/very_important_page.html. So we loop.

In step one, we change www.bobsmith.com to <a href = ‘http://www.bobsmith.com’>www.bobsmith.com</a>.

And in step two, we change www.bobsmith.com/very_important_page.html to <a href = ‘http://www.bobsmith.com/very_important_page.html’>www.bobsmith.com/very_important_page.html</a>.

Except we don’t!  Step one already changed that text to <a href = ‘http://www.bobsmith.com’>www.bobsmith.com</a>/very_important_page.html, which means the str_replace call in the second round of the loop never finds anything to replace.

This is a great case where you’ve got a simple process that happens to involve a complicated  distraction like a regex. When you’re debugging, it’s always a great idea to step through the function with a pencil and paper, and ideally with a second set of eyes, especially for relatively self-contained methods like this (which are the kinds you should strive to write anyway – hmm, maybe I should have split the email and URL detection out into separate methods…)

Anyway, in case you’re curious or need such a thing, here’s the replacement code that links properly. I doubt it’s perfect (I’m doing more PHP lately but this was from a while ago) so let me know if you spot any other bugs or opportunities for improvement!


function directory_linkify($contact) {

$rx_url = "#(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))#";

if (preg_match_all($rx_url, $contact, $matches) > 0) {

$index = 0;

foreach ($matches[0] as $match) {

$index = strpos($contact, $match, $index);

$contact = substr_replace($contact, "<a href = 'http://" . str_replace('http://', '', $match) . "' target = '_blank'>" . str_replace('http://', '', $match) . "</a>", $index, strlen($match));

}

}

// and then similar changes for email...

return $contact;

}

Filed Under: PHP, Spot the bug Tagged With: autolinking, bugs, debugging, php, php code, php script, urls

  • Home
  • Hire Me

Topics

  • ASP.NET
  • Best Practices
  • Case Studies
  • Heroku
  • How To
  • Javascript
  • Joomla
  • Methodologies
  • Mobile
  • PHP
  • Podcasting
  • PostgreSQL
  • Ruby on Rails
  • Security
  • Spot the bug
  • Video
  • WordPress