Databases

The CAP theorem is like physics to airplanes: every database must design around it

Back in 2000, Eric Brewer introduced the CAP theorem, an explanation of inherent tradeoffs in distributed database design. In short: you can’t have it all. (Okay, so there’s some debate about that, but alternative theories generally introduce other caveats.)

On Twitter, I recently critiqued a presentation by Bryan Fink on the Riak system for claiming that Riak is “influenced” by CAP. This sparked a short conversation with Justin Sheehy, also from the project. 140 characters isn’t enough to explain my objection in depth, so I’m taking it here.

While I give Riak credit for having a great architecture and pushing innovation in the NoSQL (non-relational database) space, it can no more claim to be “influenced” by CAP than an airplane design can claim influence from physics. Like physics to an airplane, CAP lays out the rules for distributed databases. With that reality in mind, a distributed database designed without regard for CAP is like an airplane designed without regard for physics. So, claiming unique influence from CAP is tantamount to claiming competing systems have a dangerous disconnect with reality. Or, to carry on the analogy, it’s like Boeing making a claim that their plane designs are uniquely influenced by physics.

But we all know Airbus designs their planes with physics in mind, too, even if they pick different tradeoffs compared to Boeing. And traditional databases were influenced by CAP and its ancestors, like BASE (warning: PDF) and Bayou from Xerox PARC. CAP says “pick two.” And they did: generally C and P. This traditional — and inflexible — design of picking only one point on the CAP triangle for a database system doesn’t indicate lack of influence.

What Riak actually does is quite novel: it allows operation at more than one point on the triangle of CAP tradeoffs. This is valuable because applications value different parts of CAP for different types of data or operations on data.

For example, a banking application may value availability for viewing bank balances. Lots of transactions happen asynchronously in the real world, so a slightly outdated balance is probably better than refusing any access if there’s a net split between data centers.

In contrast, transferring from one account to another of the same person at the same bank (say, checking to savings) generally happens synchronously. A bank would rather enforce consistency above availability. If there’s a net split, they’d rather disable transfers than have one go awry or, worse, invite fraud.

A system like Riak allows making these compromises within a single system. Something like MySQL NDB, which always enforces consistency, would either unnecessarily take down balance viewing during a net split or require use of a second storage system to provide the desired account-viewing functionality.

Anticipage: scalable pagination, especially for ACLs

Pagination is one of the hardest problems for web applications supporting access-control lists (ACLs). Drupal and Pressflow support ACLs through the node access system.

Problems with traditional pagination

  • Because pagination uses row offsets into the results, browsing listings where newly published items get added to the beginning of the results creates “page drift.” Page drift is where a user already browsing through paginated results sees, for example, items E, D, and C on page one, waits awhile, clicks to the next page, and sees items C, B, and A. Going back to page one again shows F (newly published), E, and D. Item C “drifted” to page two while the user was reading page one. If new items are published frequently enough, pagination can become unusable due to this drifting effect.
  • Even if content and ordering are fully indexed, jumping n rows into the results remains inefficient; it scales linearly with depth into pagination.
  • Paginating sets where the content and ordering are not fully indexed is even worse, often to the point of being unusable.
  • The design is optimized around visiting arbitrary page offsets, which does not reflect user needs. Users only need to make relative jumps in pagination of up to 10 pages (or so) in either direction or to start from the end of the results. (If users are navigating results by hopping to arbitrary pages to drill down to what they need, there are other flaws in the system.)

“Anticipage”

With a combination of paginating by inequality and, optionally, optimistic permission review, a site can paginate content with the following benefits:

  • No page drift
  • Stable pagination URLs that will generally include the same items, regardless of how much new content has been published to the beginning or end of the content listing
  • If the ordering is indexed, logarithmic time to finding the first item on a page, regardless of how many pages the user is into browsing
  • Minimal computation of JOINs, an especially big benefit for sites using JOINs for ACLs

The general strategy is to amortize the cost of pagination as the user browses through pages.

Paginating by inequality

The path to achieving fast pagination first involves a fresh strategy for sorting and slicing content. A “pagination key” must be selected for the intended set of content that:

  • Includes the column(s) desired for sorting. For a Drupal site, this might be the “created” column on the “node” table.
  • Is a superkey (unique for all rows in the table but not necessarily minimally unique). Sorting by the columns in a superkey is inherently deterministic. And because a superkey is also unique, it allows us to use where criteria on the deterministically sorted set to deterministically define pages. An existing set of sort columns for a listing can always be converted to a superkey by appending a primary key to the end.

For a Drupal site, a qualifying pagination key could be (created, nid) on the “node” table. This key allows us to deterministically sort the rows in the node table and slice the results into pages. Really, everyone should use such pagination keys regardless of pagination strategy in order to have a deterministic sort order.

Having selected (created, nid) as the key, the base query providing our entire listing would look something like this:

SELECT * FROM node ORDER BY created DESC, nid, DESC;

Traditionally, a site would then paginate the second page of 10 items in MySQL using a query like this:

SELECT * FROM node ORDER BY created DESC, nid, DESC LIMIT 10, 10;

But because we’re ordering by a pagination key (as defined above), we can simply run the base query for the first page and note the attributes of the final item on the page. In this example, the final node on the first page has a creation timestamp of “1230768000” and a node ID of “987.” We can then embed this data in the GET criteria of the link to the second page, resulting in running a query like this for rendering the second page:

SELECT * FROM node WHERE created <= 1230768000 AND (created <> 1230768000 OR nid < 987) ORDER BY created DESC, nid, DESC LIMIT 10;

We’re asking for the same sorting order but adding a WHERE condition carefully constructed to start our results right after the content on the first page. (Note: this query could also be dissected into a UNION if the database does not properly optimize the use of the index.) This strategy allows the database to fully employ indexes on the data to find, in logarithmic time, the first item on any page. Note how page drift becomes impossible when pagination happens using keys instead of offsets.

Should a system choose to support moving more than one page in either direction, it would either have to:

  • Read a sufficient depth into the results in nearby pages to obtain the necessary WHERE attributes. This is a bit inefficient but consistent with the rest of the approach.
  • Adopt a hybrid strategy by using a traditional-style query (a LIMIT that skips records) with WHERE conditions beginning the set on the adjacent page. For example, if a user were currently on page 9, the direct link to page 11 would load a page that runs the query for page 10 but starts its listing 10 items later (“LIMIT 10, 10”). Naturally, this becomes less efficient as we allow users to hop greater distances, but the running time, at worst, converges on how the traditional pagination approach works.

This inequality pagination strategy is already a huge win for pagination queries using expensive joins. If everything can be calculated in the database, this is about as good as it gets without denormalization or alternatives to relational databases. Unless, of course, we have a site where an optimistic permissions strategy works well:

An iterative, optimistic permissions strategy

One feature of ACLs is that they’re hard to generically and flexibly define in fixed schemas. Sometimes, it’s easiest to allow callback functions in the application that don’t have to fit into rigid ACL architectures. And for listings where a very large proportion of items are displayable to a very large proportion of users, it can be non-optimal to use a pessimistic permissions strategy where the database vets every item before sending it to the application.

Inequality-based pagination fits well with an optimistic, iterative pagination strategy:

  1. Fetch an initial batch of rows for a page without regard to permissions. The initial batch of rows need not be equivalent to the number intended for display on a page; the system could be optimized to expect approximately 20% of records it fetches to be non-displayable to most users.
  2. Test whether each item is displayable to the current user.
  3. Render and output the displayable items.
  4. Fetch more items if the quota intended for display on the page (say, 10 items) isn’t met. Each subsequent batch from the database may increase in size as the algorithm realizes that it’s finding a low proportion of displayable content.
  5. Repeat until the quota for the page is filled.

This strategy works well when a low percentage of items evenly distributed through result sets are locked away from general displayability. Fortunately, that case is quite common for large, public sites with:

  • Publishing workflows that exclude small quantities of content during the editorial process
  • Small quantities of content that need to be hidden, like Wikipedia for legally troublesome revisions
  • Small numbers of internal documents, like documentation intended for editors

Giving schema back its good name

For modern applications, the word “schema” has become synonymous with the tables, columns, constraints, indexes, and foreign keys in a relational database management system. A typical relational schema affects physical concerns (like record layout on disk) and logical concerns (like the cascading deletion of records in related tables).

Schemas have gotten a bad name because current RDBMS tools give them these rotten attributes:

  • Changes broadly lock data, requiring downtime or putting excessive, short-term load on the system.
  • The data integrity constraints don’t quite support application integrity constraints. For example, there’s no standard way in SQL to require a column to contain only valid URLs.
  • Coordinating schema changes with application code changes is difficult, largely because schema changes lack convenient coupling with code.

But, instead of solving these problems, there’s been a rebellion against schemas in next-generation object and document databases. In the case of CouchDB, for example, a document isn’t guaranteed to be anything more than valid JSON. That puts a lot of burden on the application, and it makes sharing a database between two applications unsafe and unreliable. It’s also caused some architects from the RDBMS side to falsely assume distributed document/object database architecture is inherently at odds with integrity enforcement.

What’s needed is a fresh perspective on schemas for the document database era. Mostly, we need to fix the errors and omissions of RDBMS schemas.

Current approaches

Let’s say an application is moving a table from schema A to schema B.

The current relational approach

In current relational systems, a DBA runs code on the database server that changes every row from schema A to schema B. Other administrators simultaneously deploy code to support the updated schema. The application is probably down for maintenance in the process.

The current document approach

In current document systems, code gets written to do the following:

  • Correctly read data in schema A or B.
  • Write data in schema B.
  • And if the developers ever want to remove support for reading A from the code, an updater that walks through schema A data and updates it to schema B.

The document-based application lacks the coordination and downtime issues of the relational application, but its code gets increasingly unwieldy. Where does the handling of older data schema happen? I’ve seen applications implement it anywhere from in data access objects/functions (ideally) all the way to the view/theme layer. The older schema support is often commented well but has no clear path to removal. No one wants to embark on a project to analyze all documents to determine the necessity of old schema support code, let alone risk application breakage by doing such cleanup.

The result is ever-increasing code complexity and little sense of distinct data versions.

What needs to happen

The document approach clearly has the technical edge, but the architecture and burdens are misplaced. So, we should augment today’s document database design with built-in schema management features modeled after how applications cope with schema changes.

Document databases need the following features added:

Basic data types

Document databases are wonderfully flexible for their model of making everything “just a document,” but supporting schemas in applications with more than one data type requires some awareness of type. An application could embed the type into each document, but then typing becomes inconsistent, and the schema update and validation functions below require lots of conditionals. It’s much simpler for the document database to actually understand type.

Schema versioning for each document (or any other non-global set)

One of the problems with current RDBMS schema systems is that they perform updates globally. They have to do this because all data in a table is assumed to be in the same structure dictated by the schema. Document databases go to the other extreme, supporting no schema. Supporting schema versioning at a non-global level would allow updates to happen lazily and without introspection and guesswork.

Similar to type, an application could embed the schema versions in the documents, but then versioning would not be understood by the database, impairing the efficiency and clarity of validation and update code.

Combined with type support, a document database implementing this feature would know the type and schema version for any document it stores, opening a wealth of possibilities described below.

Validation functions

For each type and schema version, this is a function that throws an exception if the document fails validation. “Validation” is however the application defines it, and the validation function runs in a Turing-complete language. In the degenerate case (a validation function that simply returns) allows the current, schema-less behavior. New documents written to the database are required to validate (before any updates) against a minimum schema version, if not the most recent.

Update functions

For each type and schema version except schema version zero, there is a function that accepts a document from schema version N-1 and returns a document in schema version N. The database then runs the returned document against the validator for schema version N. By running a series of updates, data can be upgraded from even schema version zero to the latest schema.

Referential overlays

Current relational systems implement cascading changes the same way they implement schema changes: update/delete all tuples as necessary at the same time. If type Q references type P with a CASCADE DELETE, then related instances of Q are immediately deleted when something of type P is deleted.

These sorts of cascading changes are very useful, and document databases can support them using what I call “referential overlays.” Instead of deleting Q items at the same time P items are deleted — which requires either an expensive traversal of all Q tuples or maintaining an index of Q-based references to P tuples — we can simply note the deletions of P and apply the changes at read or write time.

The referential overlay logs are easiest to implement in a centralized (though replicated) way. The centralization limits throughput, but the logs are lightweight and much faster than applying the actual cascading effects, as RDBMS tools do currently.

A Drupal-focused use of referential overlays would be the system of nodes and taxonomy terms. If a taxonomy term is deleted, we currently remove the term from all nodes having it. We do this immediately, even if 10,000 nodes have the term. With a referential overlay, we would instead note the deletion of the term and remove it as nodes are loaded. And, of course, we wouldn’t save nodes with references to now-deleted terms. The result is the appearance of a system-wide cascading change without the immediate overhead.

The gardener

The lazy-updating architecture above obviously results in a data store with lots of invalid references and outdated documents that have to be corrected at read-time. For many reasons, we don’t want these to continuously accumulate. Fortunately, there’s a happy middle-ground between the global locking updates of current RDBMS tools and accumulating cruft forever. I call it the gardener.

A real-world gardener prunes plants and pulls weeds. She works progressively across a large field, slowly moving through the rows to keep things tidy. Even though weeds aren’t getting picked in every row every day, things stay under control because the attention is frequent enough.

The “gardener” in the architecture here works the same way with the documents. Before starting a round through the whole document set, the gardener notes the current schema versions and position in the relational overlay log. The garden then moves through the documents, one by one, updating each one to the latest schema version and applying relational overlays (which may result in deletion). The gardener always updates to the latest schema version and overlay, even if either has changed since starting the gardening round.

Also like a real-world gardener, this approach horizontally scalable and can, of course, be scheduled to work at convenient times rather than interfering with competing work. Even across many servers, the gardeners can work in relative independence.

After a complete round through the system, the database can delete scheme validators and updaters for versions older than the one the gardener started the round with. (We can do this because we know the oldest schema version in use among all documents is the one the gardener started the round with.) Similarly, we can delete relational overlay records that pre-date the beginning of the gardener’s round.

Given the size of the data set, the speed of the gardener, and the number of changes being applied, we can establish an upper bound on the time from requesting a schema change or performing a cascading change and knowing the change is completely reflected in all documents. It can be comforting to know nothing in the database is so old that it might cause a problem when actually used.

Now, the application software can be unaware of the gardening and even the time from schema change to complete deployment because they’re handled transparently. The application software will never encounter obsolete references or documents in old schema formats because the document database will never deliver a document without updating, validating, and applying the relational overlays.

Other notes and thoughts

Coordinating code and schema

Schema updates are currently decoupled with code largely because running schema updates is an event:

  • Database writes will be blocked.
  • The servers will experience almost complete load saturation during the update.
  • For the above reasons, downtime might be scheduled.

Because the architecture proposed here is non-blocking, it’s possible to have the application trigger any schema update as new code starts running without lots of planning and downtime. For a web application, the schema update could run on the first page request with updated code.

Snapshotting

Because a schema update request is a lightweight event and records have (relatively) lazy updates, it’s possible to use snapshots with schema changes with low overhead. Let’s say a system is running with copy-on-write (COW) snapshots like LVM. Snapshotting load is directly proportional to the amount of data being written. COW-based snapshots therefore work much better with a lazy update strategy than an RDBMS that might rewrite gigabytes of rows during one schema change.

The same is true for implementing transactional DDL. There simply isn’t much data that needs to be retained to allow DDL rollbacks.

Failed updates

Above, I proposed treating the request time as the point of atomic schema changeover. It’s possible that some documents might fail post-update validation, or the updater itself might throw exceptions for some documents.

One useful feature of some RDBMS tools is transactional DDL (like in PostgreSQL). Any row failing an update reverts the entire update.

It’s possible to efficiently get a similar effect within the proposed architecture two different ways:

  • If the space is available, walk through all records, update and validate them, and store the updated versions in a way invisible to application. As records get created or updated by the still-live application, write documents with the current schema version and then run the update, creating and storing the updated record at the same time. If the update successfully runs on all records, including records being written during the update, atomically switch to using the updated records (or inform the DBA that the atomic switch is now possible and safe).
  • If there’s less space, do the same as above, but discard the updated records after they pass validation. Post-switch, records will be updated on read or by the gardener.

Either way, administrators can get confirmation of update success without the normal locks and downtime.

Where some RDBMS approaches maintain validity

There are some cases where this architecture may be insufficient, like when an application needs to look up references between documents rather than merely ensure documents do not list invalid ones. This is a case where an index has validity. But the architecture proposed here is more flexible than the RDBMS one of combining indexes, foreign keys, and cascading effect in one system. We could, for example, have an index with lazy updates or one maintained entirely outside the document database itself.

A Drupal-based case of this would be taxonomy listing pages, which display any node using a taxonomy term. Often, these listings need to be fast but not perfectly up-to-date. They could, say, lag 10 minutes behind the most updated data. It might also make sense to manage these indexes in a system like Hadoop or Solr.

Improvements to the Materialized View API

[img_assist|nid=196|title=An eye-catching graphic, largely irrelevant to this blog post.|desc=|link=none|align=right|width=200]

Introduction

The Materialized View API (related posts) provides resources for pre-aggregation and indexing of data for use in complex queries. It does this by managing denormalized tables based on data living elsewhere in the database (and possibly elsewhere). As such, materialized views (MVs) must be populated and updated using large amounts of data. As users change data on the site, MVs must be intelligently updated to avoid complete (read: very slow) rebuilds. Part of performing these intelligent updates is calculating how user changes to data affect MVs in use. Until now, these updates had limitations in scalability and capability.

Update and deletion propagation

In the first iteration of the Materialized View API (MV API), which is currently deployed to Drupal.org, update and deletion propagation were rather naïve: the hooks used to trap changes (hook_nodeapi and hook_comment) simply called the updater for both the entity itself and the updater for anything related. For example, hook_comment() called both the updaters for the comment itself and the node parent of the comment:

function materialized_view_comment($a1, $op) {
  $comment = (array) $a1;
  switch ($op) {
    case 'insert':
    case 'update':
    case 'publish':
    case 'unpublish':
    case 'delete':
      MVJobQueue::update('comment', $comment['cid']);
      // Also "update" the node for the comment.
      MVJobQueue::update('node', $comment['nid']);
  }
}

Calling updaters for related entities is important for aggregation-based data sources, like one that, for a given node, determines the later of when the node was changed and when the latest comment to the node was posted. A change to either the node or a comment related to the node may change the aggregated value:

class MVLastNodeActivityTimestamp extends MVColumn {
  public function getValue($entity_type, $entity_id) {
    $timestamp = db_result(db_query('SELECT MAX(c.timestamp) FROM {comments} c
      WHERE c.nid = %d', $entity_id));
    if (!$timestamp) {
      $timestamp = db_result(db_query('SELECT n.changed FROM {node} n
        WHERE n.nid = %d', $entity_id));
    }
    return $timestamp;
  }
  [...]
}

The design of building propagation into the change-capture hooks proved sufficient for the initial MV API uses, which were forum-centric. The design was sufficient because update propagation was highly predictable: nodes to themselves and comments to themselves and their parent nodes.

But a limitation quickly became apparent: this would not scale to more entity-entity relationships and introducing more MV-supported entity types.

Here’s why:

  • Update notifications quickly became noisy: MVs based on purely node data would be updated whenever comments for the node changed, even if the node-based MV didn’t rely on comment data.
  • Mapping change propagation created misplaced burdens. It’s impossible for a change-capture hook to predict all the possible relationships MV data sources might introduce. For example, if we wanted an MV based on the number of replies to a comment, we would have to trigger updates for every parent comment walking up the tree. Do we update hook_comment yet again?

The solution was to put the change-propagation burden on the data sources, with the default change-propagation algorithm being “changes to X require updating rows related to X in the MVs.”

The default covers the standard entity attribute (e.g. published status for a node) data sources while allowing aggregated sources to become much smarter.

The default change mapper in the MVColumn abstract class:

abstract class MVColumn {
  [...]
  public function getChangeMapping($entity_type, $entity_id) {
    $changed = array();
    $changed[$entity_type] = array($entity_id);
    return $changed;
  }
  [...]
}

But for data sources like MVLastNodeActivityTimestamp — which provides a data sources for nodes which is the later of the last comment posting and the node change timestamp — has more complex change-propagation logic. (This code admittedly assumes that comments will post-date the last node changes.)

MVLastNodeActivityTimestamp’s change propagation logic:

class MVLastNodeActivityTimestamp extends MVColumn {
  [...]
  public function getChangeMapping($entity_type, $entity_id) {
    $changed = array();
 
    if ($entity_type == 'node') {
      // A change to a node only affects its own value.
      $changed['node'] = array($entity_id);
    }
    else if ($entity_type == 'comment') {
      $comment = MVEntityCache::get('comment', $entity_id, '_comment_load');
 
      // A change to a comment affects the value of the node it's attached to.
      $changed['node'] = array($comment['nid']);
    }
 
    return $changed;
  }
  [...]
}

getChangeMapping() effectively says:

  • This data source changes whenever a node changes or a comment changes.
  • A node change affects the value of this data source for that node.
  • A comment change affect the value of this data source for the parent node.

Now when an entity changes, the Materialized View API walks through data sources in use on any MV and establishes the unique set of entities needing updating. If a node-based MV doesn’t use any data based on comments, comment changes won’t trigger any changes in that MV. (See the new update() method for class MaterializedView.)

But this caused a problem (already solved in the code above): while hook_comment() gets passed any comments being deleted, it’s not possible for a data source to later load the comments and look up the related nodes to calculate propagation. The solution for this also became a useful overall optimization, the entity cache.

The entity cache

The disconnection between change-capture hooks and data sources used to result in excessive object loading. For example, changing the timestamp on a comment would pass the comment to hook_comment(), but a data source relying on the timestamp for the comment would load the comment fresh from the DB while updating MVs at the end of the page request (when MV updates currently occur).

Now, change-capture hooks populate the entity cache, allowing most data sources to use statically cached entity data. The entity cache also transparently loads entities in the background, keeping the data source code clean.

Of course, the entity cache was originally created to solve the change propagation for deleted items problem. It solves that problem by caching deleted items in the change-capture hooks. MV data sources are then able to load basic data for deleted items despite running after the items disappear from the database.

Challenges ahead

Change propagation can be expensive: for modifying a taxonomy term, it’s O(n), where n is the number of nodes with the term. Eventually, change propagation will have to be batched and handled offline, which raises the next issue.

It’s now more complex to queue MV updates to happen offline (read: during cron). The data necessary to calculate propagations lives in a static cache that disappears at the end of each page request. The only truly scalable option now is to have a persistent entity cache. That way, change propagation can happen offline, especially for large sets.

Some sets are so large that the most reasonable option may be to trigger reindexing for affected MVs. Changing a taxonomy term will fall under this category until change propagation can be batched.

The end scalability goal is to have the real-time overhead for running MVs be very small and linearly proportional to the number of entity changes being requested by the user, but the challenges above will need implemented solutions to reach this goal.

Opportunities

This new architecture opens the door for an explosion of MV data sources and supported entity types. Particularly, MV should expose every CCK field (or Field API field in D7) as an MV data source.

The long-awaited automatic “Views to MV” conversion system work is now underway. It will be possible to automatically convert many common Views to run on MV-based data sets, dramatically improving scalability for converted Views without requiring any external system (like Solr, Lucene, or CouchDB).

David's Epic Presentation Megapost