Don’t Do Regular Expressions, Use The DOM

I’m as guilty of this as anyone – I have a lump of HTML that I need to extract information from. So, I write a quick regular expression, knowing full well that they’re not appropriate for the job. But I do it anyway.

This time, I decided to try doing things a better way.

Here’s the problem I’m trying to solve. In o2, (here’s a feature preview for you!) we’re experimenting with the idea of having post tags inline with the post content, instead of as a separate text field, like in P2. So, when a user saves a post with “#foo” in it, this needs to be extracted and saved as a tag “foo”.

With a regular expression, extraction seems pretty easy at first:

$tags = array();
preg_match_all( '/#[\w-]+/', $content, $tags );

That works on a simple text string, but things start to get complicated pretty quickly. What happens when you enter a URL, like http://pento.net/#foo? Or even worse, enter the URL in a tag like <a href="http://pento.net/#foo">...</a>? In both of these cases, “#foo” clearly isn’t meant to be a tag, so your regular expression quickly becomes a mess. Eventually, it gets to the point where you can’t even guarantee it’ll work under all cases.

Enter DOM parsing.

We’re all pretty familiar with dealing with the DOM, thanks to JavaScript, but it remains a less popular choice on the server side. PHP has various built in libraries to help, and there are plenty of wrappers for the PHP libs, as well as independent implementations, some of which are listed here. There are pros and cons to each option, so far nothing has appeared with the ubiquity of jQuery.

For this exercise, we’ll use PHP’s native DOM extension.

To begin with, let’s create a function to extract the tags from a new post, and save them.

function process_tags( $new, $old, $post ) {
    if ( 'publish' !== $new )
        return;

    $tags = find_tags( $post->post_content );

    wp_set_post_tags( $post->ID, $tags, false );
}
add_action( 'transition_post_status', 'process_tags', 12, 3 );

So far, this is all pretty straight forward. Our find_tags() function is where all the magic happens.

static function find_tags( $content ) {
    $tags = array();

    $dom = new DOMDocument;
    $dom->loadHTML( '<?xml encoding="UTF-8">' . $content );

    $xpath = new DOMXPath( $dom );
    $textNodes = $xpath->query( '//text()' );

    foreach ( $textNodes as $textNode ) {
        $parent = $textNode;
        while ( $parent ) {
            if ( ! empty( $parent->tagName ) && in_array( strtolower( $parent->tagName ), array( 'pre', 'code', 'a' ) ) ) {
                continue 2;
            }
            $parent = $parent->parentNode;
        }

        $matches = array();
        if ( preg_match_all( '/(?:^|\s)#([\w-]+)\b/', $textNode->nodeValue, $matches ) ) {
            $tags = array_merge( $tags, $matches[1] );
        }
    }

    return $tags;
}

The easiest way to explain how this works is to walk through it, so let’s do that now. We’ll feed find_tags() some basic HTML:

<p>#foo <a href="http://pento.net/?a=b&amp;c=d#bar">#baz</a> text</p>

Line 5: We load our HTML into the DOM. The <?xml encoding="UTF-8"> is to force DOMDocument to treat our HTML as being encoded as UTF-8 – by default it assumes ISO-8859-1 (latin1).

Line 7-8: DOMDocument supports XPath selectors, which saves us so much hassle. If you’re not familiar with XPath, it’s kind of like jQuery selectors, but for XML. So, with the //text() selector, we grab an array of all the text nodes in the HTML, “#foo “, “#baz” and ” text”. This fixes one of our big problems, detecting if something is inside of a HTML tag – the DOM library does all of the heavy lifting for us.

Line 10: Now we need to check each text node, to see if it contains a tag.

Line 11-17: But before we do that, we need to make sure we’re not inside a tag we don’t care about. In this example, we assume that anything inside a <pre>, <code> or <a> tag isn’t a post tag, so we can safely ignore it. This loop walks up through the text node’s parents, to make sure it’s not inside one of these tags. This eliminates the “#baz” text node, which is inside an <a> tag.

Line 19-22: Finally, we check the text node for tags, finding the “#foo” tag.

The code is significantly longer than a regular expression, but it has a couple of clear advantages:

The function operates exactly as you expect, it only finds tags where you want it to.
The regular expression to find tags remains simple, it doesn’t have to care about the hundreds of edge cases you might encounter.

So there you have it. DOM parsing in PHP isn’t a land of monsters, it’s actually pretty easy to wrap your head around, and write code that does exactly what you want it to do.

For an amusing postscript: While writing this post, I ran into a problem with a HTML minification plugin removing the blank lines in the code blocks, because it was just blindly removing all blank lines. By using a DOM parser, instead, it would’ve been able to remove blank lines from everywhere except inside <pre> or <code> tags.

UPDATE (2013-12-19): Fixed a few bugs in the sample code. Props mdawaffe.

4 comments

Derek Springer says:

December 19, 2013 at 3:26 pm

Have you done any tests to determine the relative efficiencies of the two methods? I’d be curious to know if there’s significant overhead related to parsing the content into a DOM then doing an XPath search vs just the regex.
1. Gary says:
  
  December 19, 2013 at 3:53 pm
  Let’s find out! Here’s a test that runs each variation 1000 times, so we can get some timing data.
  
  https://gist.github.com/pento/8034553
  
  And, here are the results as run on my laptop:
```
$ php regex-vs-dom.php 
preg_match_all:        0.0018429756164551
DOM (new document):    0.073803901672363
DOM (cached document): 0.0078930854797363
```
  So, preg_match_all is clearly faster, but it seems that caching the DOMDocument gives significant performance improvements. Of course, these numbers are still so low that they’re basically negligible when run in a WordPress plugin.
  1. Gary says:
    
    December 19, 2013 at 7:13 pm
    Well, it turns out I made a mistake in the regular expression I used, it wasn’t finding tags correctly. I’ve updated the Gist, and re-run the test:
    
    $ php regex-vs-dom.php preg_match_all: 0.071609973907471 DOM (new document): 0.14637613296509 DOM (cached document): 0.074976205825806
    
    So, the slowest part of the DOM process is creating the document – if you reuse that document a lot, it can catch up with the preg_match_all method.
marcgutt says:

March 26, 2015 at 9:30 am

With regex you will have many problems if the source is similar to this:

var foo = ' Link

First: Run this in your browser and you will see three links. Of course this code is not valid, but it works and that is the main problem for your regex.

Now try to regex match all links. But do not forget to ignore the script-Tag and HTML Comments. This will drive you crazy ^^

Comments are closed.