Randomly selecting records in MySQL… slowly

Sample image vetting page.

A sample of our quick and (until now) painless image vetting page.


Sometimes it’s easy to become tunnel-visioned, and sometimes it takes a really long time to realize that you’ve done it.

This happened to me today. Well, the realization part did — the process started about three months ago.

We have a giant database at hort.net that contains lots of information about lots of images. We want to identify all of the plant images, so I wrote a little web page that displays fifty images at random from within the database, lets you click on the plant photos, and then sends all of your picks to the database when you’re done. It worked great when we only had 100,000 images to search through in the beginning. The page would load in about a second, fifty photos would be inspected and submitted, rinse, repeat.

Unfortunately, the image vetting process is a lot slower than the database population. We’re at a point now where the database has 1,000,000 photos, and that vetting page became very slow. Usually the page would just give us an error, but even running the database query by hand was taking 3-4 minutes. That’s just not acceptable.

Here’s what we started with as a query:

SELECT photo.id, url FROM photo LEFT JOIN vet ON photo.id=vet.id WHERE (vetting = 'none' OR vetting IS NULL) AND upload_date IS NOT NULL AND server > 0 ORDER BY RAND() LIMIT 0,50

That’s the one that worked great in the beginning, but failed with more records. After a little digging, I discovered that the problem was the RAND() function in MySQL. When you use ORDER BY RAND(), MySQL actually retrieves every row and then generates a random value for all of them. That means that I was running through the random number generator 1,000,000 times each time that the page was loaded, even though I was only displaying fifty.

Since the same table contained data for vetted and unvetted images, we couldn’t just randomly select fifty numbers from 1 .. COUNT(photo.id). Continuing down the same path, the fix was to increment a counter and prepend it to our search results.

SELECT photo.id, url FROM photo JOIN (SELECT @rownum := @rownum + 1 as row_number, photo.id FROM photo LEFT JOIN vet ON photo.id=vet.id  CROSS JOIN (SELECT @rownum := 0) r WHERE (vetting = 'none' OR vetting IS NULL) AND upload_date IS NOT NULL AND server > 0) b;

@rownum was now set to the number of matching database entries for the query and each row that was retrieved was numbered, but we still didn’t have a way to randomly select fifty entries that wasn’t slow. Just generating a random number is pretty fast, so telling it to generate fifty random numbers from 1 .. @rownum and then extracting only those entries is faster. I wrote a little script on the calling program to generate a lengthy subquery that was appended into MySQL’s IN function, and this is what we ended up with:

SELECT photo.id, url FROM photo JOIN (SELECT @rownum := @rownum + 1 as row_number, photo.id FROM photo LEFT JOIN vet ON photo.id=vet.id  CROSS JOIN (SELECT @rownum := 0) r WHERE (vetting = 'none' OR vetting IS NULL) AND upload_date IS NOT NULL AND server > 0 ORDER BY photo.id) b ON (photo.id=b.id AND b.row_number IN (CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum))) LIMIT 0,50

Sometimes this was fast(-ish). We’d see two minute load times every once in a while, but often it was around seven seconds. But when the database was under load, things were usually annoyingly slow. I kicked this around in my head for a couple of days, tried different things, and then it hit me (this is where the recognition of tunnel vision occurs, by the way).

I had become fixated on the idea that I needed to randomly select from the entire database, but in reality, I didn’t. Entries were added to the database fairly randomly already (more like chunks of 50 similar plants at a time), but I certainly didn’t need to choose from 1,000,000 plants. Randomly selecting from the first 10000 entries would probably work too. And there you have it. Query times were consistently shortened to under one second, and everything was right with the world again:

SELECT photo.id, url FROM photo JOIN (SELECT @rownum := @rownum + 1 as row_number, photo.id FROM photo LEFT JOIN vet ON photo.id=vet.id  CROSS JOIN (SELECT @rownum := 0) r WHERE (vetting = 'none' OR vetting IS NULL) AND upload_date IS NOT NULL AND server > 0 ORDER BY photo.id LIMIT 10000) b ON (photo.id=b.id AND b.row_number IN (CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum), CEIL(RAND()*@rownum))) LIMIT 0,50

Leave a Reply

Your email address will not be published. Required fields are marked *