Improvements to the Materialized View API

[img_assist|nid=196|title=An eye-catching graphic, largely irrelevant to this blog post.|desc=|link=none|align=right|width=200]

Introduction

The Materialized View API (related posts) provides resources for pre-aggregation and indexing of data for use in complex queries. It does this by managing denormalized tables based on data living elsewhere in the database (and possibly elsewhere). As such, materialized views (MVs) must be populated and updated using large amounts of data. As users change data on the site, MVs must be intelligently updated to avoid complete (read: very slow) rebuilds. Part of performing these intelligent updates is calculating how user changes to data affect MVs in use. Until now, these updates had limitations in scalability and capability.

Update and deletion propagation

In the first iteration of the Materialized View API (MV API), which is currently deployed to Drupal.org, update and deletion propagation were rather naïve: the hooks used to trap changes (hook_nodeapi and hook_comment) simply called the updater for both the entity itself and the updater for anything related. For example, hook_comment() called both the updaters for the comment itself and the node parent of the comment:

function materialized_view_comment($a1, $op) {
  $comment = (array) $a1;
  switch ($op) {
    case 'insert':
    case 'update':
    case 'publish':
    case 'unpublish':
    case 'delete':
      MVJobQueue::update('comment', $comment['cid']);
      // Also "update" the node for the comment.
      MVJobQueue::update('node', $comment['nid']);
  }
}

Calling updaters for related entities is important for aggregation-based data sources, like one that, for a given node, determines the later of when the node was changed and when the latest comment to the node was posted. A change to either the node or a comment related to the node may change the aggregated value:

class MVLastNodeActivityTimestamp extends MVColumn {
  public function getValue($entity_type, $entity_id) {
    $timestamp = db_result(db_query('SELECT MAX(c.timestamp) FROM {comments} c
      WHERE c.nid = %d', $entity_id));
    if (!$timestamp) {
      $timestamp = db_result(db_query('SELECT n.changed FROM {node} n
        WHERE n.nid = %d', $entity_id));
    }
    return $timestamp;
  }
  [...]
}

The design of building propagation into the change-capture hooks proved sufficient for the initial MV API uses, which were forum-centric. The design was sufficient because update propagation was highly predictable: nodes to themselves and comments to themselves and their parent nodes.

But a limitation quickly became apparent: this would not scale to more entity-entity relationships and introducing more MV-supported entity types.

Here’s why:

  • Update notifications quickly became noisy: MVs based on purely node data would be updated whenever comments for the node changed, even if the node-based MV didn’t rely on comment data.
  • Mapping change propagation created misplaced burdens. It’s impossible for a change-capture hook to predict all the possible relationships MV data sources might introduce. For example, if we wanted an MV based on the number of replies to a comment, we would have to trigger updates for every parent comment walking up the tree. Do we update hook_comment yet again?

The solution was to put the change-propagation burden on the data sources, with the default change-propagation algorithm being “changes to X require updating rows related to X in the MVs.”

The default covers the standard entity attribute (e.g. published status for a node) data sources while allowing aggregated sources to become much smarter.

The default change mapper in the MVColumn abstract class:

abstract class MVColumn {
  [...]
  public function getChangeMapping($entity_type, $entity_id) {
    $changed = array();
    $changed[$entity_type] = array($entity_id);
    return $changed;
  }
  [...]
}

But for data sources like MVLastNodeActivityTimestamp — which provides a data sources for nodes which is the later of the last comment posting and the node change timestamp — has more complex change-propagation logic. (This code admittedly assumes that comments will post-date the last node changes.)

MVLastNodeActivityTimestamp’s change propagation logic:

class MVLastNodeActivityTimestamp extends MVColumn {
  [...]
  public function getChangeMapping($entity_type, $entity_id) {
    $changed = array();
 
    if ($entity_type == 'node') {
      // A change to a node only affects its own value.
      $changed['node'] = array($entity_id);
    }
    else if ($entity_type == 'comment') {
      $comment = MVEntityCache::get('comment', $entity_id, '_comment_load');
 
      // A change to a comment affects the value of the node it's attached to.
      $changed['node'] = array($comment['nid']);
    }
 
    return $changed;
  }
  [...]
}

getChangeMapping() effectively says:

  • This data source changes whenever a node changes or a comment changes.
  • A node change affects the value of this data source for that node.
  • A comment change affect the value of this data source for the parent node.

Now when an entity changes, the Materialized View API walks through data sources in use on any MV and establishes the unique set of entities needing updating. If a node-based MV doesn’t use any data based on comments, comment changes won’t trigger any changes in that MV. (See the new update() method for class MaterializedView.)

But this caused a problem (already solved in the code above): while hook_comment() gets passed any comments being deleted, it’s not possible for a data source to later load the comments and look up the related nodes to calculate propagation. The solution for this also became a useful overall optimization, the entity cache.

The entity cache

The disconnection between change-capture hooks and data sources used to result in excessive object loading. For example, changing the timestamp on a comment would pass the comment to hook_comment(), but a data source relying on the timestamp for the comment would load the comment fresh from the DB while updating MVs at the end of the page request (when MV updates currently occur).

Now, change-capture hooks populate the entity cache, allowing most data sources to use statically cached entity data. The entity cache also transparently loads entities in the background, keeping the data source code clean.

Of course, the entity cache was originally created to solve the change propagation for deleted items problem. It solves that problem by caching deleted items in the change-capture hooks. MV data sources are then able to load basic data for deleted items despite running after the items disappear from the database.

Challenges ahead

Change propagation can be expensive: for modifying a taxonomy term, it’s O(n), where n is the number of nodes with the term. Eventually, change propagation will have to be batched and handled offline, which raises the next issue.

It’s now more complex to queue MV updates to happen offline (read: during cron). The data necessary to calculate propagations lives in a static cache that disappears at the end of each page request. The only truly scalable option now is to have a persistent entity cache. That way, change propagation can happen offline, especially for large sets.

Some sets are so large that the most reasonable option may be to trigger reindexing for affected MVs. Changing a taxonomy term will fall under this category until change propagation can be batched.

The end scalability goal is to have the real-time overhead for running MVs be very small and linearly proportional to the number of entity changes being requested by the user, but the challenges above will need implemented solutions to reach this goal.

Opportunities

This new architecture opens the door for an explosion of MV data sources and supported entity types. Particularly, MV should expose every CCK field (or Field API field in D7) as an MV data source.

The long-awaited automatic “Views to MV” conversion system work is now underway. It will be possible to automatically convert many common Views to run on MV-based data sets, dramatically improving scalability for converted Views without requiring any external system (like Solr, Lucene, or CouchDB).

Commenting on this Blog post is closed.

Comments

Many thanks for this update. In Drupal7, I would think that the non-persistent entity cache is stored in the drupal_static() system.

You dropped a bombshell at the end by mentioning progress with Views integration. Great stuff.

The main thing holding back View integration was smarter change propagation calculation. Views can be based on many entity types, and I couldn’t provide scalable support for additional ones without this code landing first.