"CVS Instructions" tab now available for all Drupal.org projects

[img_assist|nid=145|title=”CVS Instructions” tab on the Author Taxonomy module|desc=|link=popup|align=right|width=248|height=300]

Drupal’s CVS is now more user-friendly!

As part of the Documentation Sprint at Drupalcon DC 2009, web chef David Strauss built a “CVS Instructions” tab for Drupal.org. The tab provides concise, step-by-step instructions on how to check out, commit, patch, tag, and branch any module or theme. A simple drop-down box at the top of the page allows the user to select the version of the module or theme they want to work with, and the instructions are updated to display exact, copy-and-pastable commands.

Here’s an example of the CVS Instructions tab for the 6.x-1.x branch of Author Taxonomy.

This tiny feature represents a huge step forward in Drupal’s approach to opening its doors to contributors of all skill levels. This is especially important for those designers among us, many of whom cannot contribute themes — something Drupal sorely lacks — because they do not understand Drupal’s arcane CVS or command-line interaction in general. At Drupalcon DC, we designers pledged to learn and use the developer-oriented tools used by the Drupal community — namely IRC and CVS. Simple steps like the “CVS Instructions” tab move mountains for those who would otherwise give up and not contribute anything at all.

The hidden costs of proprietary software: #1 optimizing around licensing

Articles abound about the “hidden costs” of using free, open-source software. Many of them are sponsored by companies with a stake in their own proprietary solutions — and they’re responding to the threat of increasing enthusiasm about free alternatives. Some of the claims are legitimate; others are FUD.

Here at Four Kitchens, we’re on the opposite side. We advocate using free software like Drupal (and our own free-software derivative, Pressflow) whenever possible. When it’s not immediately possible, it’s a hard decision between writing a free solution and going proprietary. We enjoy the freedom of free software for many reasons, especially because it doesn’t feel like we’re fighting the company behind the software in order to get the most out of it.

The problem with proprietary licensing

Proprietary software usually requires licensing fees; this is hardly a “hidden cost.” The hidden cost is identifying the optimal licenses and developing around them. Technical people suddenly get involved in the business decisions of what licenses the the organization needs to buy. Business people put constraints on the technical team with what licenses they approve. They’re both operating outside their areas of expertise, something that should be minimized.

Whether it be differences between Oracle editions, Lotus’s processor value unit formulas, or what restrictions IBM puts on its IFL cards, they’re all technically arbitrary restrictions designed to maximize price discrimination. In other words, the vendors want to make different customers each pay as much as possible for the same basic product.

Companies are even up-front that these licensing schemes are wholly for settings price tiers:

The attractively priced IFL processor enables you to purchase additional processing capacity exclusively for Linux workloads, without affecting the MSU rating or the IBM System z™ model designation. This means that an IFL will not increase charges for System z software running on general purpose (standard) processors in the server.

For readers unfamiliar with the System z IFL scheme, IFLs are identical except for the microcode restrictions IBM installs to prevent running an unapproved “workload” on a restricted IFL. So, instead of simply buying the number of processors the server needs, systems architects have to choose how many Java cards, Linux cards, and general-purpose cards will go into their mainframe, despite them all being physically identical.

The cost to the team is engineering around these arbitrary restrictions, restrictions that may be vaguely defined, as in some editions of Microsoft SQL Server (.doc), where “…performance degrades when more than five Transact-SQL batches are executed concurrently.”

As a developer and systems architect, I don’t like designing systems around restrictions artificially imposed by sales and marketing teams, especially when they’re vague. I’d much rather design around real restrictions and spend the rest of the time building great software. Don’t systems architects have enough to worry about?

What makes Pressflow scale: #1 faster core queries

Drupal has a number of queries with unfortunate scalability profiles.

URL alias counting (one instance in core)

The biggest offender in Drupal 5 and Drupal 6 is the query counting the number of URL aliases: SELECT COUNT() FROM url_alias. This query dates back to when nearly every Drupal site ran on MyISAM, which is important because MyISAM keeps an exact count of the number of rows in every table, making SELECT COUNT () FROM [table] an O(1) (read: fast, constant-time) operation.

But InnoDB, the engine of choice for high-scale Drupal sites, does not keep an exact row count for tables because its multiversion concurrency control (MVCC) makes such a count difficult and inefficient. But, MySQL with InnoDB still faithfully runs the query, but by counting every row in the table, an O(n) operation, meaning it is proportionally slow to number of URL aliases on the site.

Such counting is particularly unfortunate because the URL alias system only cares whether the number is zero or the number is greater than zero. In Pressflow 5 and Pressflow 6, we replace this query with SELECT pid FROM url_alias LIMIT 1 (or equivalent), giving us just the information we need (“Is there at least one alias?”) in a way that runs in O(1) on both MyISAM and InnoDB.

Use of LOWER() for case-insensitivity (many places in core)

Drupal 5, 6, and 7 all currently use LOWER() on both sides of some queries to create database-agnostic, case-insensitive string matches. The uses of LOWER() in Drupal on the a table column prior to comparison in a query automatically degrades queries to O(n) with respect to the number of users on the site. The most users a site has, the more time login and other frequent user operations take.

The reason Drupal 5 and 6 use LOWER() is because PostgreSQL’s LIKE operation performs case-sensitive comparisons. (And PostgreSQL’s ILIKE operation is not cross-platform.) By using LOWER(), the same query can run on MySQL and PostgreSQL without modification.

But Pressflow 5 and Pressflow 6 only explicitly support MySQL, so they can take advantage of MySQL’s case-insensitive collations and seamlessly drop the LOWER(). Dropping LOWER() results in user lookups happening in O(log(n)) time, which is very fast for even the largest sites.

Real results from the Materialized View API

I’ve faced a lot of skepticism (rightfully so) over my Materialized View module, which I’m pushing for inclusion in Drupal 7 as a solution to the overhead of table-per-field storage in Field API, as well as many other scalability issues.

I could have responded with contrived benchmarks, but I wanted real results. With Drupal 7 far from release and large projects on Drupal 7 even farther from release, I decided to rewrite MV for Drupal 6 and install it on Drupal.org to replace some of the worst queries.

I used MV-based tables to rewrite #1 and #4 of Drupal.org’s slowest, most common queries. This required the creation (and indexing) of two materialized view tables. The rewritten queries went live on Saturday at around 19:00 UTC.

The drop in load is visible at 19:00 on Cacti graphs from Drupal.org’s DB2:



Temporary disk tables, one of the worst causes of scalability issues, showed a small drop:


The execution plans of SELECT queries also improved. “Read next,” an indicator of table-scan behavior, dropped significantly following the MV switch-over:


And a commenter below requested the standard “load average” stats from the server:



If anything, the results are pessimistic because MV indexed data (a one-time process) from 19:00 UTC through Sunday morning. You can see the effect of MV indexing on load by comparing Sat 12:00-19:00 and Sun 04:00 - 05:00, where MV was not indexing, to surrounding times, where MV was indexing.

This indexing load is visible on the “Volatile Queries” graph:


Weekday vs. Weekday (scale change)

The most interesting graphs, in my opinion, are some Friday versus Monday ones:



David Strauss elected as a Permanent Member of the Drupal Association

Congratulations to David Strauss, Four Kitchens co-founder and Drupal scalability guru, who was elected yesterday as a Permanent Member of the Drupal Association.

David’s goals as a member focus largely on improvements to infrastructure, community-building, and reaching out to other open-source projects. Details can be found in his his application:

What are the primary goals you would like to work on?

I would like to advance the infrastructure for development and sprints by working with the community to drive development and deployment of next-generation (read: not CVS) tools, both for issue tracking and version control. I would like to participate in discussions surrounding the membership software for Drupal Association membership, including CiviCRM (the current tool) and alternatives. I would like to work with major free culture and free software organizations to establish partnerships.

What strategy will you employ in order to accomplish said goals?

I would like to host a series of official online meetings to discuss options for development infrastructure and prepare an Association-endorsed roadmap. For membership management, I would use my experience implementing and managing CiviCRM and other non-profit-focused tools to inform and guide the Association’s decisions. For partnership building, I would draw on my relationships with other projects and developers to start productive discussions about shared goals and their projects’ stake in Drupal’s success.

Read more »

Congratulations are also in order for the other elected Permanent Members and the new Board Members, all of whom are donating huge amounts of time towards promoting and supporting the Drupal project.

Developer preview of Materialized Views

I’ve posted a developer preview of Materialized Views for Drupal 6. I’d like ambitious Drupal developers to try it out so I can get feedback on the developer experience.

From sites/all/modules:
bzr branch bzr://vcs.fourkitchens.com/drupal/modules/materialized_view/6 materialized_view (all on one line)

To get them running:

  1. Install all three MV-related modules.
  2. Run cron.

This will create the mv_forum_topic materialized view, which is populated and indexed for fast Forum module topic listings.

Specific developer experience areas I’d like feedback on:

  • Creating new materialized views.
  • Creating new data sources.

If you create or modify MVs, run cron to generate the tables and index the data.

A Bazaar branch of Drupal HEAD with all history

I originally created a Bazaar branch of Drupal HEAD with hourly snapshots of upstream updates (commits to CVS HEAD) to streamline work on patches to Drupal HEAD.

This snapshot method had a few advantages over using Launchpad’s Drupal branch:

  • Because of the shallow history, branching from the Four Kitchens server was relatively fast compared to branching from Launchpad.
  • The Four Kitchens server has more reserve bandwidth and capacity, making initial branching and updates faster.
  • Launchpad goes down more often than the Four Kitchens server.

The snapshot-based branch also had a big disadvantage: it didn’t keep a true commit-by-commit history that mapped to CVS HEAD commits, making it hard to understand ongoing changes. After some discussions with chx, I’ve decided to combine the best of both worlds: a mirror of Launchpad’s Drupal HEAD branch on the Four Kitchens server with instructions (provided below) to avoid the trouble of pulling thousands of revisions every time you branch.

Why not branch directly from Launchpad? Aside from occasional Launchpad reliability issues, downloading all Drupal revisions takes takes 2m37s from Launchpad versus 1m15s from Four Kitchens (both tested from my home connection).

I will also continue to maintain the old snapshot-based branch until Drupal 7 is released.

Method 0: Plain, old branching

  • Pros
    • Runs with older Bazaar versions
    • Performs all post-branch operations locally with no network access
    • Easy to set up
  • Cons
    • Downloads all upstream revisions each time you branch (takes 1m25s)
    • Stores all upstream revisions each time you branch (takes 51M each branch)

Run this: bzr branch bzr://vcs.fourkitchens.com/drupal/7-all-history

Method 1: Stacked branches

  • Pros
    • Fast initial branching: only downloads basic branch data (takes 42s)
    • Most space efficient: only stores basic branch data (takes 6.6M total)
    • Easy to set up
  • Cons
    • Requires Bazaar 1.6 or later
    • Requires internet access to perform most history-based operations

Run this: bzr branch --stacked bzr://vcs.fourkitchens.com/drupal/7-all-history

Method 2: Shared branch storage

  • Pros
    • Runs with older Bazaar versions
    • Downloads upstream revisions once for all branches
    • Stores upstream revisions once for all branches (takes 51M once)
    • Performs all post-branch operations locally with no network access
  • Cons
    • Still stores upstream revisions once
    • Still downloads all upstream revisions once
    • You have to run a few more commands to create your first local branch

From a new directory that will be a parent of your future branch directories, run this:
bzr init-repository .

From within the parent directory you just created, run this to create your local branch:
bzr branch bzr://vcs.fourkitchens.com/drupal/7-all-history

Using your branch

Presumably, you’ve gone through all this to work on patches for Drupal HEAD. I’ve updated my earlier instructions for how to use your new branch.

Enforcing branch commit atomicity (or, why the git staging area is bad)

With CVS, one of the only repository-wide atomic operations is tagging a local checkout. And not all that long ago, Subversion introduced mainstream users of free, open-source version control systems to full-scale atomicity. Or, at least the ability to be atomic.

Subversion’s approach to atomicity is rooted in its centralization and hybrid branch/directory model. Because Subversion makes it hard to merge from other repositories, there’s a strong incentive to combine many projects and branches into one repository. Subversion therefore offers directory-level checkouts as well as convenient, repository-wide checkouts for developers working on multiple projects. To fit this project and branching model, Subversion performs most operations at the level of the current directory and below. (The only alternative would be performing the operations checkout-wide, which would cause behavior confusingly dependent on the choice of checkout root.)

Subversion’s model operates atomically if users run the commands from the root of a project or branch, but the this-directory-and-below model encourages bad development behavior. For example, when working on a Drupal project, it’s easy to commit just the changes for one module or theme, which creates a revision in the repository that may never have existed as a working copy and may not work. Administrators can mitigate the problem with repository-side continuous integration (CI), but even CI still doesn’t guarantee true project coherence and atomicity.

Bazaar, on the other hand, performs its operations (at least by default) on the entire branch, encouraging real atomic commits. This default, branch-level behavior tends to annoy developers used to separating a project’s changes into different directories of a Subversion checkout and checking in by directory. This Subversion-based workflow works reasonably well in practice, and for systems like Bazaar and git to enforce high levels of atomicity and remain usable, they must provide convenient tools to separate the changes intended for each commit.

Bazaar and git have different approaches providing such tools. Bazaar has shelve and unshelve. git has the staging area.

The most obvious way they differ is in workflow. Bazaar’s commands are optional and remove changes from the upcoming commit. git’s commands are mandatory and add changes to the upcoming commit. Here, git’s choice seems very sensible. Encouraging manual approval of each change in each commit reduces mistakes. On the other hand, Bazaar provides a convenient uncommit command that allows reversal of erroneous commits (which are often only obvious when seeing the file list as the commit is happening). All considered, I slightly prefer git’s workflow here.

Where git fails is in the theoretical foundations of the staging area. The staging area encourages the same bad behavior as in Subversion, just with more surgical control of what gets committed. Committing in git with only some changes added to the staging area still results in an “atomic” revision that may never have existed as a working copy and may not work.

Along these lines, one of the most atomicity-busting aspects of git’s staging area is that it doesn’t just mark code that needs to go into the next commit; it actually saves the hunk into the staging index. So, a developer could add code to the staging area, modify her working copy, and end up with a commit containing code that’s neither in her working copy nor in her stash. The code only ends up in the commit just made, silently filed away for someone else to get in their next merge:

This command [add] can be performed multiple times before a commit. It only adds the content of the specified file(s) at the time the add command is run; if you want subsequent changes included in the next commit, then you must run git add again to add the new content to the index.

In contrast, shelving a change in Bazaar reverts the change in the working copy. (It does save the change for later restoration with unshelve.) Because shelved changes are not in the working copy, Bazaar encourages the ultimate in atomicity: what a developer commits represents an atomic snapshot of the entire branch as represented by her working copy. And if tests pass when she commits, the same tests will pass if another developer pulls the same revision of the same branch.

DrupalCon DC swag is here!

Here’s a taste of what you’ll be getting at DrupalCon DC.

[img_assist|nid=105|title=DrupalCon DC swag|desc=|link=none|align=center|width=600|height=452]

The stickers and button were designed by Four Kitchens. Development Seed very graciously printed and paid for the buttons. Thanks, Development Seed!

The sticker in the lower left is based on a DrupalCon T-shirt design we submitted. (If you like it, vote for it! And our other designs, too…)

The button is a more polished version of a DrupalCon “campaign button” design we released last month under the GPLv3.

[img_assist|nid=106|title=Todd’s bag o’ buttons|desc=|link=none|align=center|width=600|height=400]

Here’s my bag wearing a series of hip buttons and a very classy Oktoberfest scarf. It’s probably the best scarf you will ever see — and you’ll see it at DrupalCon DC!

Creating common branch ancestry is a hard problem

One of the key features of distributed version control systems (DVCS) is support for divergent development (branching) and then merging. Most DVCS tools, including git and Bazaar, include rather elegant support for such workflows by embedding metadata about common ancestry into branches. In this post, I’ll be focusing on Bazaar.

“Common ancestry” means an identical revision shared by two branches. The most recent common ancestor typically indicates the point of branching, unless there has been a more recent merge. A successful merge between two branches establishes a new, more recent common ancestor. Identifying a common ancestor is a required step for performing automatic three-way merges to integrate changes from a foreign, divergent branch.

Typically, the branching metadata allows Bazaar to automatically determine the most recent common ancestor to use as the base revision for three-way merging. The problem comes when you need to merge two branches that do not share any common ancestors. (For the purpose of this post, I am not counting revision zero, the universal common ancestor, as a common ancestor. It is generally useless for merging.)

Creating common ancestry
If you try to merge two branches without common ancestry and without any revision identifiers, Bazaar will complain and do nothing. You can specify a revision range (like -r5..-1, which would be everything from the fifth revision to the latest on the foreign branch) and then apply the merge. Bazaar will then merge the changes in the specified revision range into your local branch and establish the last merged revision as the latest common ancestor, making future merges a breeze. Unfortunately, finishing this initial merge is where dragons lie. But before I can get into the difficulty of creating common ancestry by merging two unrelated branches, I have to briefly discuss how Bazaar handles files.

How Bazaar tracks files
Bazaar maps file paths to globally unique file IDs, and two branches without prior common ancestry will have different file IDs for the same paths, even if the files are really the same. Every time a file is added to a Bazaar branch, it gets a unique ID. As files are moved and renamed, they keep their unique IDs. So, if two people download Drupal, extract it, and independently “bzr init”, “bzr add”, and “bzr commit”, their respective README.txt files (for example) will have different file IDs.

Creating common ancestry (continued)
These globally unique file IDs cause trouble when merging from unrelated branches. When Bazaar merges two branches (related or not), two files with the same path but different file IDs create a conflict even if they contain identical content. Merging two unrelated branches with lots of shared files creates a mess of conflict files and conflict directories, and there is currently no convenient way to resolve these conflicts.

A concrete example with Drupal
A developer downloads Drupal and puts the project under version control in a fresh Bazaar branch. She discovers that Four Kitchens maintains a Bazaar branch of stable Drupal releases and decides she would like to use the Four Kitchens branch to automate installation of minor Drupal updates. She attempts to merge in the revision range from the Four Kitchens branch that would perform the upgrade. Because the file IDs in the Four Kitchens branch differ from hers, every Drupal core file and directory has a conflict despite having identical content for the base merge revision. After a bit of tedious conflict cleanup, she commits the merge and goes back to enjoying Bazaar’s generally elegant architecture.

I thought this blog post was going to give me a solution!
Nope, there’s not one out there yet. I’m currently looking at writing a custom merge handler (subclassed from the standard merge3 in Bazaar) that would intelligently handle merges where file paths do represent the same files, regardless of file IDs. Unfortunately, the file ID/path conflict is low-level in Bazaar and occurs before reaching the most modular part of merge conflict resolution.

Thanks to Robert Collins on the Bazaar project for walking me through the Bazaar internals necessary for me to explain this issue and, hopefully, solve it.