Partitioning the WordPress Comments Table

WordPress sites can get big. Really big. When you’re looking at a site of Cheezburger, Engadget or Techcrunch proportions, you get hundreds of comments per post, on dozens of posts per day, which adds up to millions of comments per year.

In order to keep your site running in top condition, you don’t want to be running queries against tables with lots of rarely accessed rows, which is what happens with most comments – after the post drops off the front page, readership drops, so the comments are viewed much less frequently. So, what we want to do is remove these old comments from the primary comment table, but keep them handy, for when people read the archives.

Enter partitioning.

The idea of MySQL partitioning is that it splits tables up into multiple logical tablespaces, based on your criteria. Running a query on a single partition of a large table is much faster than running it across the entire table, even with appropriate indexes.

In the case of the WordPress comments table, splitting it up by the `comment_post_ID` seems to be the most appropriate . This should keep the partitions to a reasonable size, and ensure that there’s minimal cross-over between partitions.

First off, we need to add the `comment_post_ID` column to the Primary Key. This can be a slow process if you already have a massive `wp_comments` table, so you may need to schedule some downtime to handle this. Alternatively, there many methods for making schema changes with no downtime, such as judicious use of Replication, Facebook’s Online Schema Change Tool, or the currently-in-development mk-online-schema-change, for Maatkit.

ALTER TABLE wp_comments DROP PRIMARY KEY, ADD PRIMARY KEY (comment_ID, comment_post_ID);

Now that we’ve altered this index, we can define the partitions. For this example, we’ll say we want the comments for 1000 posts per partition. This query can take a long time to run, if you already have many comments in your system.

ALTER TABLE wp_comments PARTITION BY RANGE(comment_post_ID) (
    PARTITION p0 VALUES LESS THAN (1000),
    PARTITION p1 VALUES LESS THAN (2000),
    PARTITION p2 VALUES LESS THAN (3000),
    PARTITION p3 VALUES LESS THAN (4000),
    PARTITION p4 VALUES LESS THAN (5000),
    PARTITION p5 VALUES LESS THAN (6000),
    PARTITION p6 VALUES LESS THAN MAXVALUE
);

When you’re approaching the next partition divider value, adding a new partition is simple. For example, you’d run this query around post 6000.

ALTER TABLE wp_comments REORGANIZE PARTITION p6 INTO (
    PARTITION p6 VALUES LESS THAN (7000),
    PARTITION p7 VALUES LESS THAN MAXVALUE
);

Naturally, this process is most useful for very large WordPress sites. If you’re starting a new site with big plans, however, you may just want to factor this into your architecture.

UPDATE: Changed the partition definition to better reflect how WordPress uses the wp_comments table, per Giuseppe’s comments.

8 comments

Querying partitioned tables is beneficial only if you are using the partitioning column in your SELECT statements. If the tables are accessed by comment ID, partitions are just overhead. While reading your post, several questions came to mind:
* What benefit are you after for this exercise?
* Could you share some benchmarks comparing the before and after status?
* Why you didn’t partition by comment_ID instead of date?

Note, also, that MySQL 5.5 allows you to partition by date without using functions.

Cheers

Giuseppe

@Giuseppe:

You are correct, I hadn’t thought about that. Partitioning by comment_post_ID would be a much more appropriate option – keeps things in more appropriate groupings that reflect the majority of the SELECT statements run on wp_comments. I’ll tweak the post to reflect that, and I’ll put together some benchmarks for a future post.

How do solutions like Disqus and Intense Debate measure up when dealing with large number of comments?

Pingback: Log Buffer #218, A Carnival of the Vanities for DBAs | The Pythian Blog

Partitioning can speed up a query compared to an index that does not cover the query. However, queries on covering indexes are much faster than on non-covering indexes and also faster than queries on partitions. It’s a matter of counting up how many rows are accessed sequentially (fewest in covering indexes, more in partitions) versus how many rows are accessed at random (as with non-covering indexes). Random rows are much slower than sequential rows — say 2 orders of magnitude — which helps partitions beat non-covering indexes, but not against covering indexes.

Therefore, I wonder what you have in mind when you say:

“Running a query on a single partition of a large table is much faster than running it across the entire table, even with appropriate indexes.”

Are you only comparing with non-covering indexes, or do you have examples where partitions beat out covering indexes?

Thanks,

Martin

Interesting, will this also work for taxonomy and/or post tables?

can you please explain what these codes mean and where should I execute them…

Table partitioning is the best way to enhance My SQL performance with huge databases.If your table contain an incredible number of information then this is strongly suggested you should use dividing and also Partitioning allows you to store parts of your table in their own sensible area. With dividing, you want to split up your rows depending on how you accessibility them.

Comments are closed.