Real results from the Materialized View API

David Strauss | February 23, 2009 | Materialized Views, Drupal, Performance, Scalability

I’ve faced a lot of skepticism (rightfully so) over my Materialized View module, which I’m pushing for inclusion in Drupal 7 as a solution to the overhead of table-per-field storage in Field API, as well as many other scalability issues.

I could have responded with contrived benchmarks, but I wanted real results. With Drupal 7 far from release and large projects on Drupal 7 even farther from release, I decided to rewrite MV for Drupal 6 and install it on Drupal.org to replace some of the worst queries.

I used MV-based tables to rewrite #1 and #4 of Drupal.org’s slowest, most common queries. This required the creation (and indexing) of two materialized view tables. The rewritten queries went live on Saturday at around 19:00 UTC.

The drop in load is visible at 19:00 on Cacti graphs from Drupal.org’s DB2:

Temporary disk tables, one of the worst causes of scalability issues, showed a small drop:

The execution plans of SELECT queries also improved. “Read next,” an indicator of table-scan behavior, dropped significantly following the MV switch-over:

And a commenter below requested the standard “load average” stats from the server:

If anything, the results are pessimistic because MV indexed data (a one-time process) from 19:00 UTC through Sunday morning. You can see the effect of MV indexing on load by comparing Sat 12:00-19:00 and Sun 04:00 - 05:00, where MV was not indexing, to surrounding times, where MV was indexing.

This indexing load is visible on the “Volatile Queries” graph:

Weekday vs. Weekday (scale change)

The most interesting graphs, in my opinion, are some Friday versus Monday ones:

Commenting on this Blog post is closed.

Comments

Maybe I’m missing it, but are there any stats for actual server load?

In response to your request, I’ve added conventional load and CPU average graphs to the post.

Brilliant, David! Looks good so far. Once we have a week over week average to look at, we can start to feel real comfortable. :-)

As one of the people who’s been very skeptical of Materialized Views, though, I want to clarify where my concerns lie. Turning a given select query into a materialized view will absolutely, positively result in a performance boost. That’s basic logic: if you have a complex set of joins, it’s going to be faster to blast data to a flattened table on insert, and select from the denormalized table. Drupal.org and fastcompany.com are proof.

My concern lies in the unpredictable nature of the “long tail” of Drupal sites: will they be sorting nodes by created date and filtering based on username? Will they be filtering by taxonomy and sorting by VotingAPI score? Will they be filtering on two random CCK fields, and sorting a third-party module’s custom fields? The answers to those questions must be known before a materialized view can be created, and the answer must be consistent across all of a site’s content listings unless they want to maintain many separate materialized views. Until those questions can be answered, and answered conclusively, a materialized view is useless. Even worse, maintaining the Materialized view can make all writes slower without speeding up any reads.

For high-traffic sites that have a large pool of extremely knowledgeable volunteers — like Drupal.org — building custom materialized views is just one of the step in the site optimization process. For smaller bespoke sites, it’s far from straightforward. And that’s where the move to per-field-table-storage in Drupal 7 (which you mentioned specifically) concerns me. By definition, the introduction of numerous tables that must be joined ensures that carefully constructed materialized views will be necessary for acceptable performance. In other words, you’ve argued hard for a change in Drupal core that makes your own performance tweak of choice a must for acceptable site performance, even though few sites have the in-house resources necessary to build them effectively.

That means one of two things: either you will figure out how to turn your advice as a performance optimization consultant into a PHP algorithm and bundle it with Drupal 7… or anyone implementing a Drupal site with custom fields will have to come to you (or another consulting shop like Lullabot) when they need to scale. This worries me. For the vast majority of people who use Drupal — the long tail who build sites with it and even contribute modules — it will just be an additional barrier to entry.

Hi, Jeff, thanks for posting. It’s probably no surprise, but I disagree with your argument. Particularly, I think it relies on the (false) premise that per-table field storage leads to a mountain of unscalable queries that only a performance sherpa could lead users out of.

On the contrary, for the purpose of querying, we can consolidate data at least as well as we do now by using Barry’s storage code or materialized views. Implying that only an expert could return SELECT performance of the system to the current levels is misleading.

These are the possibilities I’m aware of (which are not mutually exclusive):

  • Consolidating data as we do now (in per-bundle tables) is adequate for querying, and Barry’s code or MVs can provide the same facilities (despite some extra updating overhead) without any even the site user’s awareness, let alone custom expert work.
  • Consolidating data as we do now is adequate, but the performance overhead of running Barry’s code or MVs is too great to use as the primary solution.
  • Consolidating data as we do now is adequate, but we choose not to use Barry’s code or MVs because we object to storing the data in multiple places.
  • Consolidating data as we do now is inadequate for performance reasons, and we shouldn’t be treating it as an acceptable solution.
  • Consolidating data as we do now is inadequate because it shuffles the data around in a way we don’t like, and we therefore shouldn’t consider it as a model for the new system.

This possibility does not exist, despite what you seem to suggest in your post:

  • Consolidating data as we do now (in per-bundle tables) is adequate, and there’s no way Barry’s code or MVs can provide similar capabilities to run scalable SELECT queries without expert help.

Given the range of actual possibilities, here’s where I think this debate should be:

  • Is it better to refactor canonical storage or have redundancy?
  • What is the cost of saving to multiple tables instead of one when saving a single object?
  • What is the overall (caching considered) increased cost of reading from multiple tables to load a single object?
  • How does the solution affect code complexity, both in core and for other developers?
  • What are the solution’s characteristics for deployment and change on large and small sites?

David,

A couple of points. First, an editing error chopped off the first line of my post, which was a well-deserved congratulations on the work optimizing Drupal.org. The site handles staggering levels of traffic given the hardware that it’s running on, and your work has been invaluable in making that happen!

Now, onto the code winking and the nerdfight portion of my post. ;)

I think it relies on the (false) premise that per-table field storage leads to a mountain of unscalable queries that only a performance sherpa could lead users out of.

I’m actually relying on your own analysis of Drupal, David. You’ve said many times in the past that Drupal’s scattering of important data across disparate tables (node, node_revisions, node_content_[node-type] and so on) is the cause of unacceptable scaling problems on high traffic sites. Per-table field storage multiplies that problem by turning every single field on every single node into another one of those unindexable silos of data.

From the moment that the decision was announced, numerous individuals in the Drupal community (myself included) have expressed grave concerns about the performance implications. You’ve said that Materialized Views would solve all of the problems, and that they would be easy — even transparent — for site builders to use.

You’re a huge proponent of Materialized Views — I get it! And you’ve demonstrated great results optimizing specific queries with them on large, high-traffic sites with key hot-spot queries! That’s great. But the decision to move to per-field tables for Fields in D7 means that your optimization technique of choice will be required for a much larger number of sites to achieve acceptable performance. And right now, Materialized Views for Drupal do not exist in a form that is usable by anyone save a handful of experienced developers.

* Is it better to refactor canonical storage or have redundancy?

An excellent question. Clearly, we have different opinions on this matter. My position has been, and is, that native storage of custom fields in per-entity tables — the way we do currently in CCK for Drupal 6 — is the better alternative. There are trade-offs associated with it, too (to avoid error-prone mass-migration code, for example, we would have to force people to choose field cardinality at creation rather than after data has been populated.)

* What is the overall (caching considered) increased cost of reading from multiple tables to load a single object?

I’m less concerned about single-object-loads than the impact of complex queries like Views. As you hint, object caching means that loading a given object is easy to optimize no matter how many tables the underlying data is stored in. But as you’ve said many times in the past, that is not where Drupal’s worst bottlenecks lie. Comple cross-table queries, like those often built against CCK tables with Views, are what kills us. That’s where Materialized Views can help, but per-field table storage makes the problem much, much worse and isn’t required for MV. Let’s keep the ‘somewhat efficient’ storage model and optionally layer MV on top of it.

* How does the solution affect code complexity, both in core and for other developers?

If Materialized Views become a requirement for scaling a Drupal site — and they are not, currently — the impact will be considerable.

* What are the solution’s characteristics for deployment and change on large and small sites?

That’s an excellent question. Right now, because Materialized Views for Drupal does not exist in any state that can be tested or experimented with, there’s no answer to that question. It is as awesome or as terrible as one imagines.

Per-entity native storage would not preclude the use of Materialized Views! It would just ensure that we don’t step backwards from today’s performance while promising that MVs will solve all of our problems.

@Jeff,

Now that you have nicely stated your objection, may I suggest that your help is needed to (hopefully) find a happy resolution. I say ‘hopefully’ because we don’t yet know where this MV adventure will end. I think everyone wants it to end up in the hands of small Drupal sites, like you do.

I think you know the Views API as well as anyone after Earl. David has basically begged for someone to do the Views integration for its fields/filters/sorts. Would you be able to design such an integration? Would you be available to implement it? Baby steps like these are the only way to get to the outcome we all want (auto created MV based on site specific Views).

As best as I can tell, this would entail porting views to Drupal 7, implementing CCK integration for it, then implementing Oracle 9i’s dynamic query rewrites to database-agnostic PHP.

Frankly, I would be much easier to simply implement canonical per-entity storage and code to migrate fields when cardinality changes.

As best as I can tell, this would entail porting views to Drupal 7…

The current MV module is for Drupal 6 and is publicly downloadable from my recent blog post and the Four Kitchens Bazaar repository.

…implementing CCK integration for it…

This would be a single class extending MVColumn. I’m willing to write this part.

…then implementing Oracle 9i’s dynamic query rewrites to database-agnostic PHP.

The final step is replace the query generated by Views with the MV-based one. I don’t think any complex transformation is necessary.

Frankly, I would be much easier to simply implement canonical per-entity storage and code to migrate fields when cardinality changes.

This is not a solution to the scalability problem MV is designed to solve.

The field storage debate is over because we now have implementations for all the possibilities anyone is arguing for:

  • Field API in core uses per-field tables.
  • Per-bundle storage module provides hybrid per-field and per-bundle storage.
  • Materialized Views provides denormalized storage across field and non-field tables for sites where that makes sense.

I suppose there is still room to discuss whether per-bundle storage should be in core and on by default, but in any case the code is working.

So let’s stop having the debate. It’s over, and everyone is happy.