Don’t Do Regular Expressions, Use The DOM

I’m as guilty of this as anyone – I have a lump of HTML that I need to extract information from. So, I write a quick regular expression, knowing full well that they’re not appropriate for the job. But I do it anyway.

This time, I decided to try doing things a better way.

Here’s the problem I’m trying to solve. In o2, (here’s a feature preview for you!) we’re experimenting with the idea of having post tags inline with the post content, instead of as a separate text field, like in P2. So, when a user saves a post with “#foo” in it, this needs to be extracted and saved as a tag “foo”.

With a regular expression, extraction seems pretty easy at first:

$tags = array();
preg_match_all( '/#[\w-]+/', $content, $tags );

That works on a simple text string, but things start to get complicated pretty quickly. What happens when you enter a URL, like http://pento.net/#foo? Or even worse, enter the URL in a tag like <a href="http://pento.net/#foo">...</a>? In both of these cases, “#foo” clearly isn’t meant to be a tag, so your regular expression quickly becomes a mess. Eventually, it gets to the point where you can’t even guarantee it’ll work under all cases.

Enter DOM parsing.

We’re all pretty familiar with dealing with the DOM, thanks to JavaScript, but it remains a less popular choice on the server side. PHP has various built in libraries to help, and there are plenty of wrappers for the PHP libs, as well as independent implementations, some of which are listed here. There are pros and cons to each option, so far nothing has appeared with the ubiquity of jQuery.

For this exercise, we’ll use PHP’s native DOM extension.

To begin with, let’s create a function to extract the tags from a new post, and save them.

function process_tags( $new, $old, $post ) {
	if ( 'publish' !== $new )
		return;

	$tags = find_tags( $post->post_content );

	wp_set_post_tags( $post->ID, $tags, false );
}
add_action( 'transition_post_status', 'process_tags', 12, 3 );

So far, this is all pretty straight forward. Our find_tags() function is where all the magic happens.

static function find_tags( $content ) {
	$tags = array();

	$dom = new DOMDocument;
	$dom->loadHTML( '<?xml encoding="UTF-8">' . $content );

	$xpath = new DOMXPath( $dom );
	$textNodes = $xpath->query( '//text()' );

	foreach ( $textNodes as $textNode ) {
		$parent = $textNode;
		while ( $parent ) {
			if ( ! empty( $parent->tagName ) && in_array( strtolower( $parent->tagName ), array( 'pre', 'code', 'a' ) ) ) {
				continue 2;
			}
			$parent = $parent->parentNode;
		}

		$matches = array();
		if ( preg_match_all( '/(?:^|\s)#([\w-]+)\b/', $textNode->nodeValue, $matches ) ) {
			$tags = array_merge( $tags, $matches[1] );
		}
	}

	return $tags;
}

The easiest way to explain how this works is to walk through it, so let’s do that now. We’ll feed find_tags() some basic HTML:

<p>#foo <a href="http://pento.net/?a=b&amp;c=d#bar">#baz</a> text</p>

Line 5: We load our HTML into the DOM. The <?xml encoding="UTF-8"> is to force DOMDocument to treat our HTML as being encoded as UTF-8 – by default it assumes ISO-8859-1 (latin1).

Line 7-8: DOMDocument supports XPath selectors, which saves us so much hassle. If you’re not familiar with XPath, it’s kind of like jQuery selectors, but for XML. So, with the //text() selector, we grab an array of all the text nodes in the HTML, “#foo “, “#baz” and ” text”. This fixes one of our big problems, detecting if something is inside of a HTML tag – the DOM library does all of the heavy lifting for us.

Line 10: Now we need to check each text node, to see if it contains a tag.

Line 11-17: But before we do that, we need to make sure we’re not inside a tag we don’t care about. In this example, we assume that anything inside a <pre>, <code> or <a> tag isn’t a post tag, so we can safely ignore it. This loop walks up through the text node’s parents, to make sure it’s not inside one of these tags. This eliminates the “#baz” text node, which is inside an <a> tag.

Line 19-22: Finally, we check the text node for tags, finding the “#foo” tag.


The code is significantly longer than a regular expression, but it has a couple of clear advantages:

  • The function operates exactly as you expect, it only finds tags where you want it to.
  • The regular expression to find tags remains simple, it doesn’t have to care about the hundreds of edge cases you might encounter.

So there you have it. DOM parsing in PHP isn’t a land of monsters, it’s actually pretty easy to wrap your head around, and write code that does exactly what you want it to do.

For an amusing postscript: While writing this post, I ran into a problem with a HTML minification plugin removing the blank lines in the code blocks, because it was just blindly removing all blank lines. By using a DOM parser, instead, it would’ve been able to remove blank lines from everywhere except inside <pre> or <code> tags.

UPDATE (2013-12-19): Fixed a few bugs in the sample code. Props mdawaffe.

How To Backup Before Automatic Updates

When I originally wrote Automatic Updater, I had an action trigger just before the update, so site owners could easily take a backup snapshot, in case of a catastrophic failure. Now that we have Automatic Updates in WordPress Core, however, there’s no such action, and with good reason.

Backups Are Slow

We’re not talking 10-20 seconds slow. We’re talking minutes, or even hours, if the backup includes everything you’ve every uploaded. Some hosts kill processes that take too long, so WordPress might never get to update. Alternatively, it’s possible the backup could still be running when WordPress tries to run the update again, which could potentially cause conflicts.

So with this in mind, how should you do backups before an automatic update runs?

Backup Incrementally

If your backup software doesn’t know how to make an incremental backup, you need to get rid of it, and buy better backup software. I’m naturally biased towards VaultPress, which my employer makes, but there are plenty of good options available. Incremental backups are faster, so it’s easier for you to take backups on a more regular basis – even multiple times a day!

Schedule Your Backups

Instead of running your backup exactly when the update runs, you can schedule a cron job to run 5 minutes before (adjust as needed for the time your backups take). It’s a little tricky to determine if an update is going to run, so here’s a gist you’re welcome to use for your own purposes.

Have fun, stay safe, and remember to test your backups!

Automatic Updates

There are few people more excited than I about the recent WordPress 3.7 release – it’s amazing to see Automatic Updates land in WordPress Core, thanks to the hard work of Dion, Nacin, and the excellent testing and input of thousands of developers, we’ve shipped a great feature. With 3.7.1 being automatically rolled out as I type this, it’s truly amazing to see it all come to life on a grand scale.

So, with that now live on millions of sites, what’s next for my old Automatic Updater plugin? Well, it still has some life in it yet. I’ve just released a version 1.0, which strips out all of the old update code in favour of the shiny new Core code, as well as adding a few new features. To match its new and evolved role, I’ve renamed it to Advanced Automatic Updates – it lets you into all of the advanced options that the Core Automatic Updates feature provides.

So, whether you’re a long time Automatic Updater user who wants to continue having a UI for setting up your preferred update options, or you’re a new user who wants to tweak the “under the hood” options of WordPress’ Automatic Updates, you should go ahead and download Advanced Automatic Updates now!

Advanced Automatic Updates can also be found on GitHub – pull requests accepted!

A WordPress Adventure

I like to think of working at Automattic as a Choose Your Own Adventure career. Over the past couple of years, I’ve worked on a wide range of projects, from VideoPress, to the WordPress iOS App, through to Jetpack Likes, Two Step Authentication and most recently o2, the upcoming successor to P2. If I (or any of my colleagues) feel it’s time to mix things up, it’s as simple as deciding to work on something different. So, it’s hardly a surprise that I’m moving to a new project, except this time we’re trying an experiment.

It’s no secret that WordPress.com is the largest WordPress install in the world – behind all the custom plugins and themes is a single copy of WordPress. WordPress is the core of all of our day-to-day work, and we contribute on a regular basis – over 40 Automatticians contributed to the latest WordPress 3.6 release alone! But, when your day-to-day work involves working on things other than the WordPress Core project, it can be hard to allocate time to do some core work.

This is where the experiment comes in: for the entire WordPress 3.7 development and release cycle, I’m dropping all of my usual work, instead working full time on WordPress Core.

So, what am I be working on? Well, you may recall during Matt’s State of the Word talk, he mentioned that we’d be introducing Automatic Updates for minor WordPress releases. In what is clearly a massive coincidence, over the past year I’ve been experimenting with WordPress Automatic Updates, under the guise of my Automatic Updater plugin. (In other words, Dion and I will be copy/pasting some code we’ve already written, then taking the next few months off. :-D) If we can’t get away with that, however, there’s always plenty of work to be done on WordPress Core. We currently have around 3200 open tickets to get through, so “pick a ticket and fix it” contributions are a good thing to do!

As for the experimental part, this is the first instance of what we’d like to become an ongoing thing – every WordPress release cycle, a few Automatticians can drop their usual work, instead devoting time to contributing back to the WordPress Core project. The wider WordPress community has been one of the many factors contributing to Automattic’s success, it’s only right that we return the love.

So, that’s about it. Time to get on with writing some WordPress core code.

Fitness Trackers: What I Want

There’s a whole lot of fitness trackers on the market these days, and they all seem to offer different functionality. So, here’s a summary of the things I’m after in a tracker, and who comes closest:

Pedometer

Pretty much everything has a pedometer these days, and they’re all pretty accurate. It’s a useful metric for keeping track of my base level of activity during the day. Of course, because I use a treaddesk, I’m also after something that measures walking while my arms aren’t necessarily moving, which most wrist-based trackers struggle with.

Sleep Tracker

I like to keep track of my sleep, and sleep quality. No-one functions on poor sleep, so it’s a good metric to watch.

Connectivity

The device should wirelessly sync to your computer or phone, for transferring your data. Bluetooth is ideal here, for compatibility with the maximum number of devices. If you don’t manage to sync for several days, the device should be capable of storing your data for syncing later.

Battery

Any tracking device is useless if it spends all of its life in a charging cradle. The battery should last at least 5 days, and it should recharge quickly. Bonus features would be a micro-USB connection (instead of a custom cradle), and a notification when the battery is getting low.

Style

Of course this is a personal thing, but if you’re going to be wearing a device 24×7, it needs to look good.

Data

Having a nice web interface or mobile app is a necessity for quickly seeing overviews of the data you’re tracking. Just as important, though, is the ability to export the raw data (either through an export function or an API), for more in-depth personal analysis.

Bonus: Bio-Data

While not strictly necessary for me, a device that sits on your wrist 24×7 should be able to monitor basic bio-information. Temperature, heart rate, perspiration, and blood-oxygen levels come to mind.

Bonus: Bio Alarm Clock

It’s kind of a buzzword name, but the concept is cool – along with the sleep tracker, the device can also monitor when you’re at the lightest stage of your sleep cycle (near your morning alarm) and decide to wake you a little earlier, so you feel more refreshed when you wake.

Fitbit One Fitbit Flex Nike Fuelband Jawbone Up Basis Amiigo
Pedometer
Treaddesk
Sleep Tracker
Connectivity Bluetooth Bluetooth Bluetooth Plug Bluetooth Bluetooth
Battery 5-7 days
Custom Cable
5-7 days
Custom Cable
1-4 days
Custom Cable
10 days
Custom Cable
4 days
Custom Cable
6 days
Custom Unit
Style Small clip on, fits in pocket Plain wrist band, black or slate Plain wrist band, solid black, translucent black or white Small wrist band, 8 colour options Large watch, with changeable bands Plain wrist band, black or white with various highlights
Data Web, Mobile, API, Export (premium) Mobile, API Mobile, API, Export Web, Mobile Mobile
Bonus: Bio-Data
Bonus: Bio Alarm

Conclusion

So, it’s fairly clear that no-one has properly solved the fitness tracker problem yet. The wrist band seems to be the more popular method, and Amiigo’s wristband + detachable clip shows interesting promise, though hasn’t been released yet.

What’s next, then? Battery life is settling at around a week or so, I’d expect the next generation to be all around there. Extra sensors are a must have, and Amiigo is paving the way for recognising more than just walking – they claim to be able to recognise all sorts of gym activities, too.

I’d really like to see devices switch over to micro-USB for charging. I don’t like cables at the best of times, so being able to reduce the number of cables I have lying around is always a plus. There have been watches released in the past that are capable of charging from the movement of your arms, which is also cute. If I’m going to be using the energy, it should at least go somewhere useful.

Finally, I think there’s much more interesting data analysis to be done – the companies behind these devices are collecting massive amounts of data on people’s exercise habits, I’m certain there useful patterns to be picked out, which can lead to more efficient methods of exercise to fit in with our increasingly sedentary and time-poor society.

I’m looking forward to the future of fitness tracking devices with great interest!

Why I won’t be going back to PAX Australia

First off, a disclaimer. I have been, and remain a massive Penny Arcade fan. I read the comic, I watch the series, and I thoroughly enjoyed what they did with Strip Search. This article is about why PAX, the Penny Arcade Expo, isn’t for me.

Like many folks, I was excited when PAX Australia was announced – I didn’t hesitate to buy a 3 day pass. I’d seen the schedule for previous PAX-es (PAX-en? PAX-ii?), and was happy to assume PAX Australia would be of similar calibre. The good news is, I was right to make that assumption. The PAX Australia schedule is great.

So, what’s the problem?

It certainly wasn’t the attendees. Everyone was there to have a good time, a bunch of people had put excellent work into their cosplay outfits, and for a crazy novelty, queues were polite and orderly. The Enforcers (event staff), though all volunteers, were doing a great job of helping everyone out. The expo part was a decent size, with a huge variety of exhibitors showing off cool things.

The problem was the size. PAX managed to be both way too big and way too small, all at the same time. I was following along with the #PAXAus Twitter stream while I was travelling on the train, and I should’ve realised something was up with these early tweets.

“Well, that’s just the early-bird queue”, I thought to myself. I was okay with missing the opening keynote, I figured I’d still get there in plenty of time to see the next session.

How wrong I was.

First off, there was the entrance queue – about 500m long when I joined, and a solid 30 minute meander through the Melbourne Showgrounds carpark, which several people wittier than I commented on:

“At least it’s more orderly than the mobs at Big Day Out”, I reasoned to myself. “There seems to be plenty of room to move when we get inside.” And there was plenty of room to move – between the queues for each session. I joined the queue for the first Q&A session about 60 minutes before it was due to start, deciding I’d given myself plenty of time.

(It was unfortunate that, either through the the PAX organisers not notifying them, or underestimates of how much extra capacity they’d need, the Optus mobile towers had died under the weight of 10,000 people looking to the internet to entertain them while they queued.)

30 minutes after the session started, I was still 100 people or so from the front of the queue when we were informed that the session was full, and this was now the queue for the next session – an hour and a half later. Now, I like the guys at Rooster Teeth, but I didn’t feel like queuing 90 minutes (plus the time to actually get inside) to see them, so I decided to see what was going on elsewhere.

By now it was about midday, and I went to check out the exhibition hall. As with all exhibitions, the crowds were thick and slow moving, so I took my time looking around at the various displays and wares. I noted a few booths with small crowds were demoing games with Oculus Rift headsets, I figured I’d be able to check them out later. At one end, there was a great game of Johan Sebastian Joust going on, which I enjoyed immensely.

The Indie Showcase had a bunch of cool games, and there were some great booths by some of the bigger companies, too. There was only one company I noticed pushing the “No Booth Babes” rule.

Having gotten a feel for the exhibition hall, the next session I was interested in was starting in about 45 minutes, so I went there to queue up. I needn’t have bothered, the queue was already twice as long as would fit in the room.

Mildly annoyed, I checked the schedule, the next session was over 90 minutes away – and here was where I made a fatal error – I thought that would be enough time for me to check out some of the other areas. Oh, how wrong I was. I went to join the queue about an hour before it started, and it was already significantly longer than would fit in the theatre. Having learned my lesson the last two times, skipping the session and grabbing some lunch sounded like the best option.

Yeah… not so much. It’s now 2pm, I’m hungry, and more than a little annoyed at my experience so far. I’m really excited about the Oculus Rift, so going to see a demo of it seemed like a good option. I found a booth with 4 demo units, that looked like the most capable of moving people through – until I joined that queue. After not moving for 10 minutes, I asked a couple of guys towards the front, “Hey, how long have you folks been waiting here?” one of them looked at his watch, “Oh, since about 12:30 or so.”

With the prospect of yet another 90 minute wait ahead of me, I decided to call it a day, and an expo. I’d been at PAX for a total of 4 hours, spent the vast majority of that in one queue or another, and really hadn’t seen anything I couldn’t catch on YouTube. It was an unfortunate waste of a 3 day pass to barely spend a morning there, but I’m okay with chalking that up to a learning experience.

So, how do we fix this?

In short, I don’t know. I’m not an event planner, I don’t know how to make these things run smoothly. But I go to music festivals where the food queues are less than 10 minutes, the bathroom queues don’t exist, and it’s possible to see everything you want to see, so I know it must be possible.

UPDATE: Some of the feedback I’ve received on this article is that it sounds like I’m totally down on PAX – nothing could be further from the truth. PAX, as imagined by the Penny Arcade crew, is the gaming convention I’ve always wanted, it’s just that the implementation left much to be desired. No two people experience the same PAX, some folks are there for the expo, others for the panels, others for the gaming (whether it be card games, board games, role playing, Warhammer, console or PC), and that’s part of the magic. For me, I go to conventions to hear people speak, and that just isn’t possible to do at PAX Australia.

Did you go to PAX Australia? How did you deal with the queues? What about other PAX-es, or similar expos and conferences?