A few words (and gripes) on Wordpress Indexes

Let me just start off with this note. Wordpress is a fantastically documented framework. The Codex is one of the most beautifully styled and (mostly) cleanly written details I've seen. I love it. Also, Wordpress is great for what it's designed to do. Blogging (I personally prefer using Harp but that's just me).

And before I really get into this, let me just re-iterate that the thoughts on this blogpost and any of my others are mine and not the company I work for, etc. etc.

A default install will, with a good caching plugin like w3tc or super cache, scale out fairly well for a blog. Because the blogs content, though generated by PHP and the database underneath, can be cached since it doesn't really change much. For logged in users you get some changes neccesary, but hey, it's do-able.

Unfortunately, a lot of people use wordpress for things it's not made for, like e-commerce. I've had the unfortunate pleasure of dealing with a plugin that causes your site to change into a Group-On clone. A nice thought, except for the fact that this particular plugin decided to do things "the wordpress way". What does this mean?

It mean's that everything. Products, Customers, Purchases, coupons, you name it, is stuffed into the wp_posts table and the extra information is placed into wp_postmeta. This fits into Wordpress's EAV model. Long and skinny and easily extendable. Sounds great for something where people want to put whatever information they want about their products onto a page with minimal effort.

Except for this simple fact.

The wp_postmeta table is designed to hold anything, so the meta_value column is a LONGTEXT field. What's the problem with that you might ask? When someone does a query like this:

SELECT * from wp_posts wp 
INNER JOIN wp_postmeta wpm ON wp.ID = wpm.post_id 
WHERE meta_value = 'X' 
GROUP BY wp.ID ORDER BY wp.post_date_gmt

This seems like a perfectly harmless query. Doing an explain reveals this:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: wp
         type: index
possible_keys: PRIMARY,post_name,type_status_date,post_parent,post_author
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 31
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: wpm
         type: ref
possible_keys: post_id
          key: post_id
      key_len: 8
          ref: shortcodes.wp.ID
         rows: 2
        Extra: Using where

Also doesn't seem that bad right? Yes, it's using a temporary and a filesort but the JOIN itself is fast right? Yes the join is fast, PRIMARY key is great. The number of rows in this small test database result in a query that takes 0.09s on my machine. But if I switch to a much larger database?

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: wp
         type: index
possible_keys: PRIMARY,post_name,post_parent,post_author
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 168203
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: wpm
         type: ref
possible_keys: post_id
          key: post_id
      key_len: 8
          ref: testdb.wp.ID
         rows: 2
        Extra: Using where
2 rows in set (0.00 sec)

mysql> SELECT * from wp_posts wp      INNER JOIN wp_postmeta wpm ON wp.ID = wpm.post_id      WHERE meta_value = 'X'      GROUP BY wp.ID ORDER BY wp.    post_date_gmt\G 
Empty set (34.33 sec)

That's right kids, 34.33 seconds. On a database with 168,203 rows in it's posts table. That seems like a lot, but when you have plugins that stuff everything into that table, it's really not and is more common than you think.

So why is it slow? The problem lies in the LONGTEXT, because in order to do our query we're making a temporary table of our join, and we're sorting that. And it's the ORDER BY and GROUP BY that are taking a toll on us. If I run the same query without those two, it takes 2.78s to execute. Still an awful time for a database to spend on a single query, but much much better than 34.33 seconds.

The temporary table is what really kills us, if you were to do an explain you'd see that the filesort and temporary extra's are gone. Because we have to pull down all the LONGTEXT fields associated with ALL of posts that have meta information we get clogged up on our disk I/O. Since the LONGTEXT can store up to 4GB (last time I checked the mySQL documentation anyway), we're potentially pulling down a lot of information per row, and that's not going to fit into RAM, especially not with that many rows being acted on by our query.

Really, the solution here is to rewrite your query in some clever way and also to apply partial indexes to your meta_value column so that we can get some speed. Also, if your queries are doing other JOINs and WHEREs that you'll want to try to make an index that fully covers what you're querying against. I found a great slideshow detailing some methods on making indexes work correctly and I suggest it to anyone trying to really understand what they can do to make their database a little more optimized.

Finally, my other biggest gripe is something that flabbergasted me yesterday when I realized it.

Wordpress does not have an index on the autoload column of it's wp_options table.

wp_options is the table designed to hold your options (well named right?), so plugins store setup information, wordpress stores how you want your dates to be shown, and there's tons more. Unfortunately, plugin developers often put a lot of things into the options table because it's so easy to use. The autoload column decides whether or not the option should be pulled down from the database when Wordpress boots up. So I see a query like this on start up a lot:

SELECT option_name, option_value FROM wp_options WHERE autoload='yes'

On a clients webpage, this query took 40+ seconds. While this was due primarily to being clogged up on queries like the one mentioned above, it showed up in the slow query log and I investigated it. After fooling around, I ended up adding an index to the autoload column to help it. Why? Because without the index the full wp_options table has to be scanned, for this client that's 15681 rows (see why I don't like Wordpress being used as a platform to do things it shouldn't?), when I add an index, the number of rows scanned is reduced to 229. That's a huge boost.

Really, I was confused why autoload isn't indexed by default. A quick perusal of wp-includes/options.php in a wordpress install will show that the the function wp_load_alloptions calls into the database with this query, and searching for the use of that function will turn up that it's used on every page load since, as I said previously, it's called on every page request. This seems like an oversight to me, granted that I doubt the core wordpress developers felt that people would abuse the options table like it has been in this particular case, but it's still something I would expect from people as smart as them. I wouldn't consider indexing that column to be a pre-optimization either (an argument I imagine they might make).

Wordpress is a great blogging platform. It would be great if, when people tried to use it for other areas, that they don't blindly follow the "wordpress way", and instead focus on good normalized schema design. Just my 2 cents.