Scalability

The CAP theorem is like physics to airplanes: every database must design around it

Back in 2000, Eric Brewer introduced the CAP theorem, an explanation of inherent tradeoffs in distributed database design. In short: you can’t have it all. (Okay, so there’s some debate about that, but alternative theories generally introduce other caveats.)

On Twitter, I recently critiqued a presentation by Bryan Fink on the Riak system for claiming that Riak is “influenced” by CAP. This sparked a short conversation with Justin Sheehy, also from the project. 140 characters isn’t enough to explain my objection in depth, so I’m taking it here.

While I give Riak credit for having a great architecture and pushing innovation in the NoSQL (non-relational database) space, it can no more claim to be “influenced” by CAP than an airplane design can claim influence from physics. Like physics to an airplane, CAP lays out the rules for distributed databases. With that reality in mind, a distributed database designed without regard for CAP is like an airplane designed without regard for physics. So, claiming unique influence from CAP is tantamount to claiming competing systems have a dangerous disconnect with reality. Or, to carry on the analogy, it’s like Boeing making a claim that their plane designs are uniquely influenced by physics.

But we all know Airbus designs their planes with physics in mind, too, even if they pick different tradeoffs compared to Boeing. And traditional databases were influenced by CAP and its ancestors, like BASE (warning: PDF) and Bayou from Xerox PARC. CAP says “pick two.” And they did: generally C and P. This traditional — and inflexible — design of picking only one point on the CAP triangle for a database system doesn’t indicate lack of influence.

What Riak actually does is quite novel: it allows operation at more than one point on the triangle of CAP tradeoffs. This is valuable because applications value different parts of CAP for different types of data or operations on data.

For example, a banking application may value availability for viewing bank balances. Lots of transactions happen asynchronously in the real world, so a slightly outdated balance is probably better than refusing any access if there’s a net split between data centers.

In contrast, transferring from one account to another of the same person at the same bank (say, checking to savings) generally happens synchronously. A bank would rather enforce consistency above availability. If there’s a net split, they’d rather disable transfers than have one go awry or, worse, invite fraud.

A system like Riak allows making these compromises within a single system. Something like MySQL NDB, which always enforces consistency, would either unnecessarily take down balance viewing during a net split or require use of a second storage system to provide the desired account-viewing functionality.

Making Drupal and Pressflow more mundane

Drupal and Pressflow have too much magic in them, and not the good kind. On the recent Facebook webcast introducing HipHop PHP, their PHP-to-C++ converter, they broke down PHP language features into two categories: magic and mundane. The distinction is how well each capability of PHP, a dynamic language, translates to a static language like C++. “Mundane” features translate well to C++ and get a big performance boost in HipHop PHP. “Magic” features are either unsupported, like eval(), or run about as fast as today’s PHP+APC, like call_user_func_array().

Mundane

  • If/else control blocks
  • Normal function calls
  • Array operations
  • …and most other common operations

Magic

  • eval()
  • call_user_func_array()
  • Code causing side-effects that depends on conditions like function existence
  • Includes within function bodies
  • Other PHP-isms that make Java and C++ developers cringe

How Drupal and Pressflow can run better (or at all) on HipHopPHP

Prelinking

Currently, we invoke hooks using “magic” (though still HipHop-supported) calls to call_user_func_array(). We don’t have to do that; we could be “prelinking” hook invocations by generating the right PHP for the set of enabled modules. If we generate the right PHP here, HipHop can link the function calls during compilation.

This sort of “prelinking” also cleans up profiling results, making it easier to trace function calls through hooks in tools like KCacheGrind.

Compatibility break? Nope, it should be possible to replace the guts of module_invoke_all() with appropriate branching and calls to the generated PHP.

Including files staticly

Drupal 6 introduced an optimization to dynamically load files based on which menu path a user is visiting. This won’t fly in HipHop; it’s simply not supported. Fortunately, this is easy to work around: we can either drop the feature (shared hosters without APC are already booing me) or we could, like in the prelinking example, generate a big, static includes file (which is itself included on HipHop-based systems) that includes all possible page callback handlers based on the hook_menu() entries. Sites that include the static includes file would skip the dynamic includes at runtime.

Compatibility break? None, assuming we take the approach I describe above.

Death to eval()

Like dynamic includes, eval() is unsupported on HipHop. Drupal has already relegated core use of eval() to an isolated module, which is great for security. eval() is pretty bad in general: PHP+APC doesn’t support opcode caching for it, so serious code can’t run in eval() sanely. Unfortunately, using the PHP module to allow controlling block display remains quite popular.

We have a few options here:

  • Drop the feature (ouch!)
  • Provide a richer interface for controlling block display, including support for modules to hook in and provide their own extended options
  • Pump out the PHP to functions in a real file, include that, and call those functions to control block display

Compatibility break? Yes, on all but the third option (writing out a PHP file).

Migrate performance-intensive code to C++

I’m looking at you, drupal_render().

This opportunity is exciting. Without the cruft of Zend’s extension framework, we can migrate performance-critical code paths in core to C++ and make use of STL and Boost, two of the most respected libraries in terms of predictable memory usage and algorithm running time.

Compatibility break? There’s no reason to have one, but keeping C++ and PHP behaviors consistent will be a serious challenge.

The takeaway

  • Use real, file-based PHP, avoiding dynamic language features.
  • Profile the system to find the biggest wins versus development cost for migrating core functionality to C++.

I’ll be presenting the “Ultimate PHP Stack” for large-scale applications at PHP TEK-X. Zend PHP, Quercus, and HipHop PHP (source code release pending) will all be contenders.

Anticipage: scalable pagination, especially for ACLs

Pagination is one of the hardest problems for web applications supporting access-control lists (ACLs). Drupal and Pressflow support ACLs through the node access system.

Problems with traditional pagination

  • Because pagination uses row offsets into the results, browsing listings where newly published items get added to the beginning of the results creates “page drift.” Page drift is where a user already browsing through paginated results sees, for example, items E, D, and C on page one, waits awhile, clicks to the next page, and sees items C, B, and A. Going back to page one again shows F (newly published), E, and D. Item C “drifted” to page two while the user was reading page one. If new items are published frequently enough, pagination can become unusable due to this drifting effect.
  • Even if content and ordering are fully indexed, jumping n rows into the results remains inefficient; it scales linearly with depth into pagination.
  • Paginating sets where the content and ordering are not fully indexed is even worse, often to the point of being unusable.
  • The design is optimized around visiting arbitrary page offsets, which does not reflect user needs. Users only need to make relative jumps in pagination of up to 10 pages (or so) in either direction or to start from the end of the results. (If users are navigating results by hopping to arbitrary pages to drill down to what they need, there are other flaws in the system.)

“Anticipage”

With a combination of paginating by inequality and, optionally, optimistic permission review, a site can paginate content with the following benefits:

  • No page drift
  • Stable pagination URLs that will generally include the same items, regardless of how much new content has been published to the beginning or end of the content listing
  • If the ordering is indexed, logarithmic time to finding the first item on a page, regardless of how many pages the user is into browsing
  • Minimal computation of JOINs, an especially big benefit for sites using JOINs for ACLs

The general strategy is to amortize the cost of pagination as the user browses through pages.

Paginating by inequality

The path to achieving fast pagination first involves a fresh strategy for sorting and slicing content. A “pagination key” must be selected for the intended set of content that:

  • Includes the column(s) desired for sorting. For a Drupal site, this might be the “created” column on the “node” table.
  • Is a superkey (unique for all rows in the table but not necessarily minimally unique). Sorting by the columns in a superkey is inherently deterministic. And because a superkey is also unique, it allows us to use where criteria on the deterministically sorted set to deterministically define pages. An existing set of sort columns for a listing can always be converted to a superkey by appending a primary key to the end.

For a Drupal site, a qualifying pagination key could be (created, nid) on the “node” table. This key allows us to deterministically sort the rows in the node table and slice the results into pages. Really, everyone should use such pagination keys regardless of pagination strategy in order to have a deterministic sort order.

Having selected (created, nid) as the key, the base query providing our entire listing would look something like this:

SELECT * FROM node ORDER BY created DESC, nid, DESC;

Traditionally, a site would then paginate the second page of 10 items in MySQL using a query like this:

SELECT * FROM node ORDER BY created DESC, nid, DESC LIMIT 10, 10;

But because we’re ordering by a pagination key (as defined above), we can simply run the base query for the first page and note the attributes of the final item on the page. In this example, the final node on the first page has a creation timestamp of “1230768000” and a node ID of “987.” We can then embed this data in the GET criteria of the link to the second page, resulting in running a query like this for rendering the second page:

SELECT * FROM node WHERE created <= 1230768000 AND (created <> 1230768000 OR nid < 987) ORDER BY created DESC, nid, DESC LIMIT 10;

We’re asking for the same sorting order but adding a WHERE condition carefully constructed to start our results right after the content on the first page. (Note: this query could also be dissected into a UNION if the database does not properly optimize the use of the index.) This strategy allows the database to fully employ indexes on the data to find, in logarithmic time, the first item on any page. Note how page drift becomes impossible when pagination happens using keys instead of offsets.

Should a system choose to support moving more than one page in either direction, it would either have to:

  • Read a sufficient depth into the results in nearby pages to obtain the necessary WHERE attributes. This is a bit inefficient but consistent with the rest of the approach.
  • Adopt a hybrid strategy by using a traditional-style query (a LIMIT that skips records) with WHERE conditions beginning the set on the adjacent page. For example, if a user were currently on page 9, the direct link to page 11 would load a page that runs the query for page 10 but starts its listing 10 items later (“LIMIT 10, 10”). Naturally, this becomes less efficient as we allow users to hop greater distances, but the running time, at worst, converges on how the traditional pagination approach works.

This inequality pagination strategy is already a huge win for pagination queries using expensive joins. If everything can be calculated in the database, this is about as good as it gets without denormalization or alternatives to relational databases. Unless, of course, we have a site where an optimistic permissions strategy works well:

An iterative, optimistic permissions strategy

One feature of ACLs is that they’re hard to generically and flexibly define in fixed schemas. Sometimes, it’s easiest to allow callback functions in the application that don’t have to fit into rigid ACL architectures. And for listings where a very large proportion of items are displayable to a very large proportion of users, it can be non-optimal to use a pessimistic permissions strategy where the database vets every item before sending it to the application.

Inequality-based pagination fits well with an optimistic, iterative pagination strategy:

  1. Fetch an initial batch of rows for a page without regard to permissions. The initial batch of rows need not be equivalent to the number intended for display on a page; the system could be optimized to expect approximately 20% of records it fetches to be non-displayable to most users.
  2. Test whether each item is displayable to the current user.
  3. Render and output the displayable items.
  4. Fetch more items if the quota intended for display on the page (say, 10 items) isn’t met. Each subsequent batch from the database may increase in size as the algorithm realizes that it’s finding a low proportion of displayable content.
  5. Repeat until the quota for the page is filled.

This strategy works well when a low percentage of items evenly distributed through result sets are locked away from general displayability. Fortunately, that case is quite common for large, public sites with:

  • Publishing workflows that exclude small quantities of content during the editorial process
  • Small quantities of content that need to be hidden, like Wikipedia for legally troublesome revisions
  • Small numbers of internal documents, like documentation intended for editors

Intelligent memcached and APC interaction across a cluster

Anyone experienced with high-performance, scalable PHP development is familiar with APC and memcached. But used alone, they each have serious limitations:

APC

  • Advantages
    • Low latency
    • No need to serialize/unserialize items
    • Scales perfectly with more web servers
  • Disadvantages
    • No enforced consistency across multiple web servers
    • Cache is not shared; each web server must generate each item

memcached

  • Advantages
    • Consistent across multiple web servers
    • Cache is shared across all web servers; items only need to be generated once
  • Disadvantages
    • High latency
    • Requires serializing/unserializing items
    • Easily shards data across multiple web servers, but is still a big, shared cache

Combining the two

Traditionally, application developers simply think about consistency needs. If consistency is unnecessary (or the scope of the application is one web server), APC is great. Otherwise, memcached is the choice. There is, however, a third, hybrid option: use memcached as a coordination system for invalidation with APC as the main item cache. This functions as a loose L1/L2 cache structure. To borrow terminology from multimaster replication systems, memcached stores “tombstone” records.

The “extremely fresh” check for the APC item (see below) allows throttling hits to memcached. Even a one-second tolerance for cache incoherency massively limits the amount of traffic to the shared memcached pool.

Reading

The algorithm below may not be perfect, but I’ll revise it as I continue work on an implementation.

  1. Attempt to load the item from APC:
    1. On an APC hit, check if the item is extremely fresh or recently verified as fresh against memcached. (For perfect cache coherency, the answer is always “not fresh.”)
      1. If fresh, return the item.
      2. If not fresh, check if there is a tombstone record in memcached:
        1. If there is no tombstone (or the tombstone post-dates the local item):
          1. Update the freshness timestamp on the local item.
          2. Return the local item.
        2. Otherwise, treat as an APC miss.
    2. On an APC miss, attempt to load the item from memcached:
      1. On a memcache hit:
        1. Store the item into APC.
        2. Return the item.
      2. On a soft memcache miss (the item is available but due for replacement), attempt to take out a semaphore in APC:
        1. If the APC semaphore was successful, attempt to take out a semaphore in memcached:
          1. If the memcached semaphore was successful:
            1. Write the semaphore to APC.
            2. Rebuild the cache item and write it (see below).
            3. Release the semaphore in memcached. (The semaphore in APC should clear itself very quickly.)
          2. If the memcached semaphore was unsuccessful:
            1. Copy the memcached rebuild semaphore to APC. Store this very briefly (a second or so); it is only to prevent hammering memcached for semaphore checks.
            2. Return the slightly stale item from memcache.
        2. If the APC semaphore was unsuccessful:
          1. Return the slightly stale item.
      3. On a hard memcache miss (no item available at all):
        1. Is a stampede to generate the item acceptable?
          1. If yes:
            1. Generate the item real-time.
            2. Store to the cache.
          2. If no:
            1. Use the APC/memcache semaphore system (see above) to lock regeneration of the item.
            2. If the current request cannot grab the semaphore, fail as elegantly as possible.

Writing/invalidating

  1. Write to/delete from memcached.
  2. Write to/delete from APC.
  3. Set the tombstone record in memcached. This record should persist long enough for all web servers to notice that their local cache needs to be updated.

Benchmarks, hot off the Pressflow

Independent benchmarks by Josh Koenig from Chapter Three show a 3000x throughput increase and a 40x load reduction moving from plain Drupal to Pressflow + Varnish. His testing was performed on a small Amazon EC2 instance, demonstrating how Pressflow can deliver internet-scale performance on modest, inexpensive hardware. Pressflow is able to deliver this class of performance because it’s optimized to support Varnish and other enterprise-grade web infrastructure tools in ways that standard Drupal cannot.

With Pressflow’s API compatibility with Drupal, Josh’s move from Drupal to Pressflow on his project didn’t require any coding or extensive testing. He just replaced Drupal core with Pressflow. (It’s no harder than a minor Drupal update.)

For single-server setups in the Amazon EC2 cloud, Josh’s Project Mercury AMI provides a click-and-run, configured setup with the Pressflow + Varnish stack. For more complex setups, Four Kitchens provides infrastructure consulting services on the Pressflow system.

Need to scale Drupal on EC2? Check out Chapter Three's Mercury project

Josh Koenig from Chapter Three has made pre-release EC2 AMIs (pre-packaged virtual machine images) for Mercury, a project to combine Four Kitchens’ Drupal-derived, high-performance Pressflow with Varnish, Cache Router, and memcached. Initial results show it easily saturating the EC2’s pipe. Mercury instances directly update their Pressflow releases from the Four Kitchens Bazaar server.

Mercury is an exciting project for anyone who needs to run a high-traffic, Drupal-based site without having to configure a bunch of caching systems.

Giving schema back its good name

For modern applications, the word “schema” has become synonymous with the tables, columns, constraints, indexes, and foreign keys in a relational database management system. A typical relational schema affects physical concerns (like record layout on disk) and logical concerns (like the cascading deletion of records in related tables).

Schemas have gotten a bad name because current RDBMS tools give them these rotten attributes:

  • Changes broadly lock data, requiring downtime or putting excessive, short-term load on the system.
  • The data integrity constraints don’t quite support application integrity constraints. For example, there’s no standard way in SQL to require a column to contain only valid URLs.
  • Coordinating schema changes with application code changes is difficult, largely because schema changes lack convenient coupling with code.

But, instead of solving these problems, there’s been a rebellion against schemas in next-generation object and document databases. In the case of CouchDB, for example, a document isn’t guaranteed to be anything more than valid JSON. That puts a lot of burden on the application, and it makes sharing a database between two applications unsafe and unreliable. It’s also caused some architects from the RDBMS side to falsely assume distributed document/object database architecture is inherently at odds with integrity enforcement.

What’s needed is a fresh perspective on schemas for the document database era. Mostly, we need to fix the errors and omissions of RDBMS schemas.

Current approaches

Let’s say an application is moving a table from schema A to schema B.

The current relational approach

In current relational systems, a DBA runs code on the database server that changes every row from schema A to schema B. Other administrators simultaneously deploy code to support the updated schema. The application is probably down for maintenance in the process.

The current document approach

In current document systems, code gets written to do the following:

  • Correctly read data in schema A or B.
  • Write data in schema B.
  • And if the developers ever want to remove support for reading A from the code, an updater that walks through schema A data and updates it to schema B.

The document-based application lacks the coordination and downtime issues of the relational application, but its code gets increasingly unwieldy. Where does the handling of older data schema happen? I’ve seen applications implement it anywhere from in data access objects/functions (ideally) all the way to the view/theme layer. The older schema support is often commented well but has no clear path to removal. No one wants to embark on a project to analyze all documents to determine the necessity of old schema support code, let alone risk application breakage by doing such cleanup.

The result is ever-increasing code complexity and little sense of distinct data versions.

What needs to happen

The document approach clearly has the technical edge, but the architecture and burdens are misplaced. So, we should augment today’s document database design with built-in schema management features modeled after how applications cope with schema changes.

Document databases need the following features added:

Basic data types

Document databases are wonderfully flexible for their model of making everything “just a document,” but supporting schemas in applications with more than one data type requires some awareness of type. An application could embed the type into each document, but then typing becomes inconsistent, and the schema update and validation functions below require lots of conditionals. It’s much simpler for the document database to actually understand type.

Schema versioning for each document (or any other non-global set)

One of the problems with current RDBMS schema systems is that they perform updates globally. They have to do this because all data in a table is assumed to be in the same structure dictated by the schema. Document databases go to the other extreme, supporting no schema. Supporting schema versioning at a non-global level would allow updates to happen lazily and without introspection and guesswork.

Similar to type, an application could embed the schema versions in the documents, but then versioning would not be understood by the database, impairing the efficiency and clarity of validation and update code.

Combined with type support, a document database implementing this feature would know the type and schema version for any document it stores, opening a wealth of possibilities described below.

Validation functions

For each type and schema version, this is a function that throws an exception if the document fails validation. “Validation” is however the application defines it, and the validation function runs in a Turing-complete language. In the degenerate case (a validation function that simply returns) allows the current, schema-less behavior. New documents written to the database are required to validate (before any updates) against a minimum schema version, if not the most recent.

Update functions

For each type and schema version except schema version zero, there is a function that accepts a document from schema version N-1 and returns a document in schema version N. The database then runs the returned document against the validator for schema version N. By running a series of updates, data can be upgraded from even schema version zero to the latest schema.

Referential overlays

Current relational systems implement cascading changes the same way they implement schema changes: update/delete all tuples as necessary at the same time. If type Q references type P with a CASCADE DELETE, then related instances of Q are immediately deleted when something of type P is deleted.

These sorts of cascading changes are very useful, and document databases can support them using what I call “referential overlays.” Instead of deleting Q items at the same time P items are deleted — which requires either an expensive traversal of all Q tuples or maintaining an index of Q-based references to P tuples — we can simply note the deletions of P and apply the changes at read or write time.

The referential overlay logs are easiest to implement in a centralized (though replicated) way. The centralization limits throughput, but the logs are lightweight and much faster than applying the actual cascading effects, as RDBMS tools do currently.

A Drupal-focused use of referential overlays would be the system of nodes and taxonomy terms. If a taxonomy term is deleted, we currently remove the term from all nodes having it. We do this immediately, even if 10,000 nodes have the term. With a referential overlay, we would instead note the deletion of the term and remove it as nodes are loaded. And, of course, we wouldn’t save nodes with references to now-deleted terms. The result is the appearance of a system-wide cascading change without the immediate overhead.

The gardener

The lazy-updating architecture above obviously results in a data store with lots of invalid references and outdated documents that have to be corrected at read-time. For many reasons, we don’t want these to continuously accumulate. Fortunately, there’s a happy middle-ground between the global locking updates of current RDBMS tools and accumulating cruft forever. I call it the gardener.

A real-world gardener prunes plants and pulls weeds. She works progressively across a large field, slowly moving through the rows to keep things tidy. Even though weeds aren’t getting picked in every row every day, things stay under control because the attention is frequent enough.

The “gardener” in the architecture here works the same way with the documents. Before starting a round through the whole document set, the gardener notes the current schema versions and position in the relational overlay log. The garden then moves through the documents, one by one, updating each one to the latest schema version and applying relational overlays (which may result in deletion). The gardener always updates to the latest schema version and overlay, even if either has changed since starting the gardening round.

Also like a real-world gardener, this approach horizontally scalable and can, of course, be scheduled to work at convenient times rather than interfering with competing work. Even across many servers, the gardeners can work in relative independence.

After a complete round through the system, the database can delete scheme validators and updaters for versions older than the one the gardener started the round with. (We can do this because we know the oldest schema version in use among all documents is the one the gardener started the round with.) Similarly, we can delete relational overlay records that pre-date the beginning of the gardener’s round.

Given the size of the data set, the speed of the gardener, and the number of changes being applied, we can establish an upper bound on the time from requesting a schema change or performing a cascading change and knowing the change is completely reflected in all documents. It can be comforting to know nothing in the database is so old that it might cause a problem when actually used.

Now, the application software can be unaware of the gardening and even the time from schema change to complete deployment because they’re handled transparently. The application software will never encounter obsolete references or documents in old schema formats because the document database will never deliver a document without updating, validating, and applying the relational overlays.

Other notes and thoughts

Coordinating code and schema

Schema updates are currently decoupled with code largely because running schema updates is an event:

  • Database writes will be blocked.
  • The servers will experience almost complete load saturation during the update.
  • For the above reasons, downtime might be scheduled.

Because the architecture proposed here is non-blocking, it’s possible to have the application trigger any schema update as new code starts running without lots of planning and downtime. For a web application, the schema update could run on the first page request with updated code.

Snapshotting

Because a schema update request is a lightweight event and records have (relatively) lazy updates, it’s possible to use snapshots with schema changes with low overhead. Let’s say a system is running with copy-on-write (COW) snapshots like LVM. Snapshotting load is directly proportional to the amount of data being written. COW-based snapshots therefore work much better with a lazy update strategy than an RDBMS that might rewrite gigabytes of rows during one schema change.

The same is true for implementing transactional DDL. There simply isn’t much data that needs to be retained to allow DDL rollbacks.

Failed updates

Above, I proposed treating the request time as the point of atomic schema changeover. It’s possible that some documents might fail post-update validation, or the updater itself might throw exceptions for some documents.

One useful feature of some RDBMS tools is transactional DDL (like in PostgreSQL). Any row failing an update reverts the entire update.

It’s possible to efficiently get a similar effect within the proposed architecture two different ways:

  • If the space is available, walk through all records, update and validate them, and store the updated versions in a way invisible to application. As records get created or updated by the still-live application, write documents with the current schema version and then run the update, creating and storing the updated record at the same time. If the update successfully runs on all records, including records being written during the update, atomically switch to using the updated records (or inform the DBA that the atomic switch is now possible and safe).
  • If there’s less space, do the same as above, but discard the updated records after they pass validation. Post-switch, records will be updated on read or by the gardener.

Either way, administrators can get confirmation of update success without the normal locks and downtime.

Where some RDBMS approaches maintain validity

There are some cases where this architecture may be insufficient, like when an application needs to look up references between documents rather than merely ensure documents do not list invalid ones. This is a case where an index has validity. But the architecture proposed here is more flexible than the RDBMS one of combining indexes, foreign keys, and cascading effect in one system. We could, for example, have an index with lazy updates or one maintained entirely outside the document database itself.

A Drupal-based case of this would be taxonomy listing pages, which display any node using a taxonomy term. Often, these listings need to be fast but not perfectly up-to-date. They could, say, lag 10 minutes behind the most updated data. It might also make sense to manage these indexes in a system like Hadoop or Solr.

Pressflow 6 now offers direct downloads

For quite some time, Four Kitchens has provided Pressflow releases to its large-scale clients and anyone interested enough to request a copy. We provided limited access to copies so that we could understand what organizations expected of Pressflow, how they wanted to use it, and so that we could keep all users updated with the latest security, bug-fix, and feature releases.

Recently, we’ve seen increasing request volume, and we’d like to give new users faster access to downloads. So, we’re making Pressflow available for direct download. No signup. No hassles. We even give you the copy-and-paste commands to download and extract Pressflow on a remote Linux machine. (You can still sign up for the updates mailing list, which we highly recommend.)

In the near future, we’ll be posting more documentation on our website that was previously only available to clients and partners of Four Kitchens. We’ll also be opening up our Pressflow issue tracker (for Pressflow-specific issues that do not apply to Drupal).

Pressflow 5 will remain available by request only.

Improvements to the Materialized View API

[img_assist|nid=196|title=An eye-catching graphic, largely irrelevant to this blog post.|desc=|link=none|align=right|width=200]

Introduction

The Materialized View API (related posts) provides resources for pre-aggregation and indexing of data for use in complex queries. It does this by managing denormalized tables based on data living elsewhere in the database (and possibly elsewhere). As such, materialized views (MVs) must be populated and updated using large amounts of data. As users change data on the site, MVs must be intelligently updated to avoid complete (read: very slow) rebuilds. Part of performing these intelligent updates is calculating how user changes to data affect MVs in use. Until now, these updates had limitations in scalability and capability.

Update and deletion propagation

In the first iteration of the Materialized View API (MV API), which is currently deployed to Drupal.org, update and deletion propagation were rather naïve: the hooks used to trap changes (hook_nodeapi and hook_comment) simply called the updater for both the entity itself and the updater for anything related. For example, hook_comment() called both the updaters for the comment itself and the node parent of the comment:

function materialized_view_comment($a1, $op) {
  $comment = (array) $a1;
  switch ($op) {
    case 'insert':
    case 'update':
    case 'publish':
    case 'unpublish':
    case 'delete':
      MVJobQueue::update('comment', $comment['cid']);
      // Also "update" the node for the comment.
      MVJobQueue::update('node', $comment['nid']);
  }
}

Calling updaters for related entities is important for aggregation-based data sources, like one that, for a given node, determines the later of when the node was changed and when the latest comment to the node was posted. A change to either the node or a comment related to the node may change the aggregated value:

class MVLastNodeActivityTimestamp extends MVColumn {
  public function getValue($entity_type, $entity_id) {
    $timestamp = db_result(db_query('SELECT MAX(c.timestamp) FROM {comments} c
      WHERE c.nid = %d', $entity_id));
    if (!$timestamp) {
      $timestamp = db_result(db_query('SELECT n.changed FROM {node} n
        WHERE n.nid = %d', $entity_id));
    }
    return $timestamp;
  }
  [...]
}

The design of building propagation into the change-capture hooks proved sufficient for the initial MV API uses, which were forum-centric. The design was sufficient because update propagation was highly predictable: nodes to themselves and comments to themselves and their parent nodes.

But a limitation quickly became apparent: this would not scale to more entity-entity relationships and introducing more MV-supported entity types.

Here’s why:

  • Update notifications quickly became noisy: MVs based on purely node data would be updated whenever comments for the node changed, even if the node-based MV didn’t rely on comment data.
  • Mapping change propagation created misplaced burdens. It’s impossible for a change-capture hook to predict all the possible relationships MV data sources might introduce. For example, if we wanted an MV based on the number of replies to a comment, we would have to trigger updates for every parent comment walking up the tree. Do we update hook_comment yet again?

The solution was to put the change-propagation burden on the data sources, with the default change-propagation algorithm being “changes to X require updating rows related to X in the MVs.”

The default covers the standard entity attribute (e.g. published status for a node) data sources while allowing aggregated sources to become much smarter.

The default change mapper in the MVColumn abstract class:

abstract class MVColumn {
  [...]
  public function getChangeMapping($entity_type, $entity_id) {
    $changed = array();
    $changed[$entity_type] = array($entity_id);
    return $changed;
  }
  [...]
}

But for data sources like MVLastNodeActivityTimestamp — which provides a data sources for nodes which is the later of the last comment posting and the node change timestamp — has more complex change-propagation logic. (This code admittedly assumes that comments will post-date the last node changes.)

MVLastNodeActivityTimestamp’s change propagation logic:

class MVLastNodeActivityTimestamp extends MVColumn {
  [...]
  public function getChangeMapping($entity_type, $entity_id) {
    $changed = array();
 
    if ($entity_type == 'node') {
      // A change to a node only affects its own value.
      $changed['node'] = array($entity_id);
    }
    else if ($entity_type == 'comment') {
      $comment = MVEntityCache::get('comment', $entity_id, '_comment_load');
 
      // A change to a comment affects the value of the node it's attached to.
      $changed['node'] = array($comment['nid']);
    }
 
    return $changed;
  }
  [...]
}

getChangeMapping() effectively says:

  • This data source changes whenever a node changes or a comment changes.
  • A node change affects the value of this data source for that node.
  • A comment change affect the value of this data source for the parent node.

Now when an entity changes, the Materialized View API walks through data sources in use on any MV and establishes the unique set of entities needing updating. If a node-based MV doesn’t use any data based on comments, comment changes won’t trigger any changes in that MV. (See the new update() method for class MaterializedView.)

But this caused a problem (already solved in the code above): while hook_comment() gets passed any comments being deleted, it’s not possible for a data source to later load the comments and look up the related nodes to calculate propagation. The solution for this also became a useful overall optimization, the entity cache.

The entity cache

The disconnection between change-capture hooks and data sources used to result in excessive object loading. For example, changing the timestamp on a comment would pass the comment to hook_comment(), but a data source relying on the timestamp for the comment would load the comment fresh from the DB while updating MVs at the end of the page request (when MV updates currently occur).

Now, change-capture hooks populate the entity cache, allowing most data sources to use statically cached entity data. The entity cache also transparently loads entities in the background, keeping the data source code clean.

Of course, the entity cache was originally created to solve the change propagation for deleted items problem. It solves that problem by caching deleted items in the change-capture hooks. MV data sources are then able to load basic data for deleted items despite running after the items disappear from the database.

Challenges ahead

Change propagation can be expensive: for modifying a taxonomy term, it’s O(n), where n is the number of nodes with the term. Eventually, change propagation will have to be batched and handled offline, which raises the next issue.

It’s now more complex to queue MV updates to happen offline (read: during cron). The data necessary to calculate propagations lives in a static cache that disappears at the end of each page request. The only truly scalable option now is to have a persistent entity cache. That way, change propagation can happen offline, especially for large sets.

Some sets are so large that the most reasonable option may be to trigger reindexing for affected MVs. Changing a taxonomy term will fall under this category until change propagation can be batched.

The end scalability goal is to have the real-time overhead for running MVs be very small and linearly proportional to the number of entity changes being requested by the user, but the challenges above will need implemented solutions to reach this goal.

Opportunities

This new architecture opens the door for an explosion of MV data sources and supported entity types. Particularly, MV should expose every CCK field (or Field API field in D7) as an MV data source.

The long-awaited automatic “Views to MV” conversion system work is now underway. It will be possible to automatically convert many common Views to run on MV-based data sets, dramatically improving scalability for converted Views without requiring any external system (like Solr, Lucene, or CouchDB).

David's Epic Presentation Megapost

Pages