the five latest nodes satisfying some condition. MySQL (and I believe any RDBMS) will pick the whole result set, sort it, and then limit it. On NowPublic homepage this meant tens, sometimes hundreds of thousands nodes to sort. We have a very nice box to serve database queries, but not this nice. So we implemented a cutoff -- surely anything older than some time is not front page worthy. Now it just sorts the nodes that gets submitted in a short timespan:
SELECT ... WHERE changed > $cutoff_time
Next problem was cutting off nodes belonging to certain terms. This is tricky. Very tricky. Let's say we wanted precalculate the timestamp when the tenth oldest node belonging to a certain tid was changed. You would think executing
INSERT INTO term_cutoff SELECT tid, MIN(changed) FROM node INNER JOIN term_node t USING(nid) WHERE changed IN (SELECT n1.changed FROM node n1 INNER JOIN term_node tn1 USING(nid) WHERE tn1.tid = t.tid ORDER BY changed DESC LIMIT 10) from a cronscript would do the trick. After all, the subquery SELECTs the ten newest and the outer SELECTs the oldest of these. MySQL will tell you that
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery. End of story. This can't be circumvented. So, should I go to my boss and tell him, sorry, the monster box just does not cut the mustard? There must be a way.
And there is. Let's retrieve to PHP every distinct tid from term_node, grab the ten latest and store the results into a table. Nice lil' cronscript would this be and surely would finish quick: there are ~200K distinct tids in term_node and on average there are more than 10 nodes belonging to each, so we are talking of loading two million records into PHP (Note: the problem is that some terms have tens of thousands of nodes and sorting these take a long time). And this won't get any better as the site becomes more and more successful. Still, we do not need to run it too often as it takes quite a long while before the cutoff time becomes so obsolete that the performance benefit from cuting off the query becomes neglible. So, this is much better than getting myself fired, but still not something I like.
Time to call in heavy cavalry:
DROP PROCEDURE IF EXISTS calculate_term_cutoff;
CREATE PROCEDURE calculate_term_cutoff()
DECLARE done INT DEFAULT 0;
DECLARE t INT;
DECLARE cur1 CURSOR FOR SELECT DISTINCT(tid) FROM term_node;
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done = 1;
FETCH cur1 INTO t;
IF NOT done THEN
INSERT INTO term_cutoff1 (tid, timestamp) SELECT t1.tid, n1.changed FROM node n1 INNER JOIN term_node t1 USING(nid) where t1.tid = t ORDER BY changed DESC LIMIT 10;
UNTIL done END REPEAT;
CREATE TABLE term_cutoff2 AS SELECT tid, MIN(timestamp) FROM term_cutoff1 GROUP BY tid;
SET @old_count = (SELECT COUNT(*) FROM term_cutoff);
SET @new_count = (SELECT COUNT(*) FROM term_cutoff2);
IF @new_count > @old_count THEN
RENAME TABLE term_cutoff TO term_cutoff_old, term_cutoff2 TO term_cutoff;
ALTER TABLE term_cutoff ADD PRIMARY KEY(tid);
DROP TABLE term_cutoff_old;
I wanted to iterate over it in MySQL. Iterate? They call that cursors in SQL. The cursor skeleton is copied from the MySQL manual. We take special care that if something goes kerplunk then we do not put in a bogus term_cutoff table in place -- we could lose some serious performance benefit. The only other interesting is the double
RENAME TABLE -- it's atomic, so the swap happens instantly.
To use the table, I have
$cutoff = db_result(db_query('SELECT cutoff FROM term_cutoff WHERE tid = %d', $tid));
$result = db_query_range("SELECT .... WHERE tid = %d AND n.changed > %d ORDER BY changed DESC", $tid, $cutoff, 0, 5);
This little trick made the home page load on our dev server in about 900-1000ms instead 1300-1400ms...
Commenting on this Story is closed.