Drupal has a number of queries with unfortunate scalability profiles.
URL alias counting (one instance in core)
The biggest offender in Drupal 5 and Drupal 6 is the query counting the number of URL aliases: SELECT COUNT(<em>) FROM url_alias. This query dates back to when nearly every Drupal site ran on MyISAM, which is important because MyISAM keeps an exact count of the number of rows in every table, making SELECT COUNT (</em>) FROM [table] an O(1) (read: fast, constant-time) operation.
But InnoDB, the engine of choice for high-scale Drupal sites, does not keep an exact row count for tables because its multiversion concurrency control (MVCC) makes such a count difficult and inefficient. But, MySQL with InnoDB still faithfully runs the query, but by counting every row in the table, an O(n) operation, meaning it is proportionally slow to number of URL aliases on the site.
Such counting is particularly unfortunate because the URL alias system only cares whether the number is zero or the number is greater than zero. In Pressflow 5 and Pressflow 6, we replace this query with SELECT pid FROM url_alias LIMIT 1 (or equivalent), giving us just the information we need (“Is there at least one alias?”) in a way that runs in O(1) on both MyISAM and InnoDB.
Use of LOWER() for case-insensitivity (many places in core)
Drupal 5, 6, and 7 all currently use LOWER() on both sides of some queries to create database-agnostic, case-insensitive string matches. The uses of LOWER() in Drupal on the a table column prior to comparison in a query automatically degrades queries to O(n) with respect to the number of users on the site. The most users a site has, the more time login and other frequent user operations take.
The reason Drupal 5 and 6 use LOWER() is because PostgreSQL’s LIKE operation performs case-sensitive comparisons. (And PostgreSQL’s ILIKE operation is not cross-platform.) By using LOWER(), the same query can run on MySQL and PostgreSQL without modification.
But Pressflow 5 and Pressflow 6 only explicitly support MySQL, so they can take advantage of MySQL’s case-insensitive collations and seamlessly drop the LOWER(). Dropping LOWER() results in user lookups happening in O(log(n)) time, which is very fast for even the largest sites.
Commenting on this Blog post is closed.
Comments
So, in the case of the COUNT() example, is there a reason for it not to be in Drupal core?
It fixes a non-bug, non-security issue, which has made it a low priority for inclusion in existing versions of Drupal.
But I certainly tried: http://drupal.org/node/260400
I’m not sure why it’d count() the aliases either…Sounds weird and why someone wouldn’t think about a counter cache. On insert/delete just increment/deincrement a row somewhere.
I guess that’s assuming a count is even needed. It looks like it’s not, which really just points out one of the many flaws with Drupal’s design (and contributed elements). Disorganization, no planning, no database normalization, it’s not object oriented, it runs into race conditions…etc. the list goes on. It’s a great effort to make it scale, and lord knows I’ve been faced with trying to make Drupal work in an impossible situation, being hit by what I call the “Yahoo effect” — kinda like being featured on the homepage of Digg, only it’s Yahoo… Nevertheless, this is good to point out this issue as the url_aliases really steer the big awkard Drupal ship.
Thanks for this post, while perhaps not the biggest issue with Drupal, certainly something I wish I knew before!