Randomly selecting records in MySQL… slowly

Sample image vetting page.

A sample of our quick and (until now) painless image vetting page.


Sometimes it’s easy to become tunnel-visioned, and sometimes it takes a really long time to realize that you’ve done it.

This happened to me today. Well, the realization part did — the process started about three months ago.

We have a giant database at hort.net that contains lots of information about lots of images. We want to identify all of the plant images, so I wrote a little web page that displays fifty images at random from within the database, lets you click on the plant photos, and then sends all of your picks to the database when you’re done. It worked great when we only had 100,000 images to search through in the beginning. The page would load in about a second, fifty photos would be inspected and submitted, rinse, repeat.

Unfortunately, the image vetting process is a lot slower than the database population. We’re at a point now where the database has 1,000,000 photos, and that vetting page became very slow. Usually the page would just give us an error, but even running the database query by hand was taking 3-4 minutes. That’s just not acceptable.

Here’s what we started with as a query:

SELECT photo.id, url FROM photo LEFT JOIN vet ON photo.id=vet.id WHERE (vetting = 'none' OR vetting IS NULL) AND upload_date IS NOT NULL AND server > 0 ORDER BY RAND() LIMIT 0,50

That’s the one that worked great in the beginning, but failed with more records. After a little digging, I discovered that the problem was the RAND() function in MySQL. When you use ORDER BY RAND(), MySQL actually retrieves every row and then generates a random value for all of them. That means that I was running through the random number generator 1,000,000 times each time that the page was loaded, even though I was only displaying fifty.

Since the same table contained data for vetted and unvetted images, we couldn’t just randomly select fifty numbers from 1 .. COUNT(photo.id). Continuing down the same path, the fix was to increment a counter and prepend it to our search results.

SELECT photo.id, url FROM photo JOIN (SELECT @rownum := @rownum + 1 as row_number, photo.id FROM photo LEFT JOIN vet ON photo.id=vet.id  CROSS JOIN (SELECT @rownum := 0) r WHERE (vetting = 'none' OR vetting IS NULL) AND upload_date IS NOT NULL AND server > 0) b;

@rownum was now set to the number of matching database entries for the query and each row that was retrieved was numbered, but we still didn’t have a way to randomly select fifty entries that wasn’t slow. Just generating a random number is pretty fast, so telling it to generate fifty random numbers from 1 .. @rownum and then extracting only those entries is faster. I wrote a little script on the calling program to generate a lengthy subquery that was appended into MySQL’s IN function, and this is what we ended up with:

SELECT photo.id, url FROM photo JOIN (SELECT @rownum := @rownum + 1 as row_number, photo.id FROM photo LEFT JOIN vet ON photo.id=vet.id  CROSS JOIN (SELECT @rownum := 0) r WHERE (vetting = 'none' OR vetting IS NULL) AND upload_date IS NOT NULL AND server > 0 ORDER BY photo.id) b ON (photo.id=b.id AND b.row_number IN (CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum))) LIMIT 0,50

Sometimes this was fast(-ish). We’d see two minute load times every once in a while, but often it was around seven seconds. But when the database was under load, things were usually annoyingly slow. I kicked this around in my head for a couple of days, tried different things, and then it hit me (this is where the recognition of tunnel vision occurs, by the way).

I had become fixated on the idea that I needed to randomly select from the entire database, but in reality, I didn’t. Entries were added to the database fairly randomly already (more like chunks of 50 similar plants at a time), but I certainly didn’t need to choose from 1,000,000 plants. Randomly selecting from the first 10000 entries would probably work too. And there you have it. Query times were consistently shortened to under one second, and everything was right with the world again:

SELECT photo.id, url FROM photo JOIN (SELECT @rownum := @rownum + 1 as row_number, photo.id FROM photo LEFT JOIN vet ON photo.id=vet.id  CROSS JOIN (SELECT @rownum := 0) r WHERE (vetting = 'none' OR vetting IS NULL) AND upload_date IS NOT NULL AND server > 0 ORDER BY photo.id LIMIT 10000) b ON (photo.id=b.id AND b.row_number IN (CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum))) LIMIT 0,50

Detecting if a remote connection is mobile in Apache

hort.net-mobile

As many of you have heard, Google is requiring that sites pass their ‘mobile usability’ test if they want to show up in mobile search rankings. After April 22nd, Google will start rolling out changes to their listings, and those who don’t pass the test stand to lose a lot of traffic.

I was pretty concerned until I realized that our biggest problem is all of the small menus around our content. That stuff works fine on a desktop, but fat fingers (compared to a mouse pointer) can’t discern between tightly spaced links.

So… Why not just make all of the menus disappear for mobile users?

It seemed like the easiest way would be to create an environment variable from within Apache. This would let me either use Server Side Includes (SSI) or make programmatic changes in scripts.

The trick, of course, was figuring out which visitors were mobile and which ones weren’t. Setting an environment variable wouldn’t be of much use if I was identifying the wrong visitors.

I ended up finding a great Apache regular expression at detectmobilebrowers.com. Although the regular expression had what I needed, they were using it to rewrite mobile visitors to another URL. I just wanted to set a variable, not redirect people. Here’s what I came up with instead.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (android|bb\d+|meego).+mobile|avantgo|bada\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od)|iris|kindle|lge\ |maemo|midp|mmp|mobile.+firefox|netfront|opera\ m(ob|in)i|palm(\ os)?|phone|p(ixi|re)\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\.(browser|link)|vodafone|wap|windows\ ce|xda|xiino [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a\ wa|abac|ac(er|oo|s\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\-m|r\ |s\ )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\-(n|u)|c55\/|capi|ccwa|cdm\-|cell|chtm|cldc|cmd\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\-s|devi|dica|dmob|do(c|p)o|ds(12|\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\-|_)|g1\ u|g560|gene|gf\-5|g\-mo|go(\.w|od)|gr(ad|un)|haie|hcit|hd\-(m|p|t)|hei\-|hi(pt|ta)|hp(\ i|ip)|hs\-c|ht(c(\-|\ |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\-(20|go|ma)|i230|iac(\ |\-|\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt(\ |\/)|klon|kpt\ |kwc\-|kyo(c|k)|le(no|xi)|lg(\ g|\/(k|l|u)|50|54|\-[a-w])|libw|lynx|m1\-w|m3ga|m50\/|ma(te|ui|xo)|mc(01|21|ca)|m\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\-|\ |o|v)|zz)|mt(50|p1|v\ )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\-2|po(ck|rt|se)|prox|psio|pt\-g|qa\-a|qc(07|12|21|32|60|\-[2-7]|i\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\-|oo|p\-)|sdk\/|se(c(\-|0|1)|47|mc|nd|ri)|sgh\-|shar|sie(\-|m)|sk\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\-|v\-|v\ )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\-|tdg\-|tel(i|m)|tim\-|t\-mo|to(pl|sh)|ts(70|m\-|m3|m5)|tx\-9|up(\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\-|\ )|webc|whit|wi(g\ |nc|nw)|wmlb|wonu|x700|yas\-|your|zeto|zte\-) [NC]
RewriteRule ^ - [E=MOBILE_TMP:1]
RequestHeader add MOBILEUSER %{MOBILE_TMP}e env=MOBILE_TMP

As you can see, the first three lines came from detectmobilebrowsers.com. The last two lines set the environment variable in a two-step process.

The mod_rewrite line that starts with RewriteRule sets a variable named MOBILE_TMP to ‘1’ if any of the regular expressions match. Unfortunately, Apache won’t pass that environment variable on to CGI scripts. mod_header, however, can pass variables that it sets.

The last line uses this feature in mod_header to set another variable named HTTP_MOBILEUSER to the value of MOBILE_TMP if MOBILE_TMP is set. Note that the variable is called HTTP_MOBILEUSER, not MOBILEUSER. The RequestHeader directive will prepend the string ‘HTTP_’ to whatever environment variable you pick.

Most of our pages at hort.net use SSI, so it’s very easy for me to tell the system to exclude blocks of code for mobile users now. Here’s a sample SSI snippet that only includes the files if it’s a mobile user. If not, they’re excluded.

<!--#if expr="${HTTP_MOBILEUSER} != /1/" -->
   <!--#include virtual="/include/root-top.html"-->
   <!--#include virtual="/include/vertical-spacer8.html"-->
   <!--#include virtual="/include/root-left.html"-->
<!--#endif -->

Another example sets the background color of a page to green for mobile users, but white for desktop users:

<!--#if expr="${HTTP_MOBILEUSER} != /1/" -->
<body bgcolor="#FFFFFF" vlink="#668F66" link="#000000">
<!--#else -->
<body bgcolor="#004400" vlink="#668F66" link="#000000">
<!--#endif -->

Of course, it would be ideal to use CSS for these, but when you have 800,000 legacy pages dating back nearly twenty years this is much simpler!

Sometimes our Perl scripts also had stylistic changes for mobile users too. In this case, we choose a different image for mobile visitors:

if ($ENV{'HTTP_MOBILEUSER'} == 1) {
   $imagedir = "/mnt/WWW/hort.net/images/gallery/mobile";
} else {
   $imagedir = "/mnt/WWW/hort.net/images/gallery/desktop";
}

Or:

  $imagedir = sprintf(qq(/mnt/WWW/hort.net/images/gallery/%s),
     ($ENV{'HTTP_MOBILEUSER'} == 1) ? "mobile" : "desktop"
  );

Making these changes ended up being easier than I thought, and for that I’m grateful. There are still some tweaks to make to hort.net, and we still need to add some mobile navigation options to our pages. But for now we’ll continue to show up in the search rankings, our traffic won’t disappear, and most importantly, people will be able to use our pages on mobile devices.