Update from the Drupal 7 Contributed Modules Sprint

[img_assist|nid=213|title=|desc=The Vancouver Planetarium. Photo by Qole Pejorian.|link=none|align=right|width=300|height=222]

chx and I gathered last week in Vancouver’s West End for a two-person performance sprint during the final code slush days, allowing us to finish several key improvements to Drupal’s database layer. Right afterward, many more people joined us for another sprint to port key modules to Drupal 7. People worked in-person, voluntarily over IRC, and involuntarily over IRC (lost passport).

I can say — without reservation — that our work was successful. We kicked off the weekend with Drupal 6 versions of Coder and Views. (Though there had been a touch of prior work on the Views port to Drupal 7’s new database layer.)

We ended the weekend with usable releases of both modules. The Coder release is already posted to its project page. Views work is ongoing, and I’m posting fresh tarballs throughout the day on the Four Kitchens server as work continues.

Help testing is great! Please post bugs for Coder back to its project and bugs for the Views port to the #D7CX sprint’s “Views for Drupal 7” Launchpad project.

We owe thanks to NowPublic for hosting the sprint in their downtown Vancouver offices. We collaborated over Bazaar version-control branches on Canonical’s free (as in beer and speech) Launchpad service.

Alternatives to rebasing in Bazaar

A discussion recently arose on the Bazaar mailing list asking, “Why isn’t rebase support in core?” Rebase support is currently packaged as a plugin. This plugin is widely distributed, even in the standard Mac OS X installation bundle.

There are boring reasons that rebase support isn’t in core, like the lack of strong test coverage. More interesting are questions about the necessity of rebasing in typical workflows.

What is rebasing, and why should I care?

In large projects, there’s a mainline branch representing the current, global, coordinated development. In Drupal’s case, this is CVS HEAD. This mainline might not always be in perfect condition, but there’s a general sense that the mainline is not a sandbox for untested changes. Many changes are small enough that the developers simply work on and test a patch, but this workflow is inadequate for larger development projects like Fields in Core. Such large features require their own branch for development, a feature branch.

A feature branch allows development of a feature in isolation from the mainline but with the eventual intent of merging the changes back into the mainline. Because feature branches are created to foster long-term, divergent development from the mainline, it’s common for both feature development and mainline development to happen in parallel. This parallel development creates a problem: How do developers on the feature branch prepare for the eventual re-integration of their feature code into the mainline?

There are a few options:

  • Don’t sync changes. This option makes merging the feature back into the mainline painful. This option also defeats the purpose of developing and testing the feature in isolation because merging two tested (but divergent) branches often results in one broken (but converged) branch.
  • Merge the feature into the mainline before making any changes to the mainline and then re-branch for more feature work after making mainline changes. Merging an untested or incomplete feature into the mainline makes this option unattractive and impractical. This option is so silly, I only included it for completeness.
  • Periodically update the feature branch from the mainline. This is ideal because the feature branch continually answers the question “What if we merged this feature into the mainline?” and is ready for quick merging into the mainline without any disruption to mainline work.

The third option is the only practical one. But how should it work? What should the feature branch history look like after synching from the mainline?

Back to rebasing…

Rebasing integrates the updates to the mainline as ancestors to the changes on the feature branch. The commit history is reorganized (read: rebased) as if the feature branch were freshly created from the mainline and all work were done on top of that. There are many theoretical objections to rebasing, and I won’t rehash them here. There’s general consensus that rebasing is sort of icky.

I find that many rebase users use the tool because they’re not aware of better workflows. I’ll address each (supposed) reason to use rebase in its own section.

“I want to keep my feature branch updated from the mainline.”

The better choice is to run bzr merge [mainline] on the feature branch. This command will update the common ancestry between the feature and mainline branches so that the feature branch includes the latest changes from the mainline and is ready for smooth merging back into the mainline.

“I want to view only the revisions that make up the feature I’ve been working on.”

With a rebase, it’s reasonably clear which revisions constitute the feature work: they’re the top ones. But rebasing is not the best choice for reviewing this list. Run bzr missing --mine-only [mainline] from the feature branch, and Bazaar will output all the feature branch’s unique revisions without mangling the actual history (the way rebasing does).

“I want a human-readable summary of how merging the feature into the mainline will affect the code.”

For background, a rebase user would run a diff from the oldest feature-specific commit to the latest commit, but there’s a better way. Instead, run bzr diff --old=[mainline], and Bazaar will provide the net diff for merging the feature into the mainline. Now, don’t use this diff for anything but human review; you should still use bzr merge from the mainline to integrate the feature branch’s changes and preserve all history.

Creating a merge directive with bzr send provides an identical human-readable diff to the method above, but a merge directive also includes all the binary data Bazaar needs to perform a history-preserving merge.

“I want to maintain a patch set on top of the mainline.”

Rebasing commits is an ugly way to do this because you don’t retain your own history of work on each patch or the history of how rebasing has changed each patch. Bazaar has a plug-in called “Looms” that provides direct support for a much better patch set workflow. I’m a touch skeptical of Looms’ stability, so I just do what Looms does under the hood: maintain multiple branches, each derived (branched) from the one below. Each branch represents a patch. This method retains full, original history, including any changes I’ve made to the patches. When the mainline updates, I simply merge the mainline changes up through my patches.

“I want to clean up my commit history prior to submitting my changes to the mainline.”

Rebasing may group the feature commits, but it doesn’t make them coherent or pretty. It’s more effective to do the following:

  1. bzr merge [mainline]
  2. Use bzr diff --old=[mainline] on the feature branch to create a net diff.
  3. Get a fresh branch from the mainline.
  4. Apply the net diff as a patch.
  5. Shelve all changes.
  6. Work through unshelving the changes and committing them to create a coherent, pretty history.
  7. Create a merge directive using bzr send.
  8. Submit the merge directive.

“[Your reason here]”

I’d like to hear from users of any distributed version-control system why they use “rebase” in their workflows, even if their reason is one I’ve discussed above.

No-brainer shared branch storage on your workstation

This post is a follow-up to my post about the new Drupal HEAD Bazaar branch hosted at Four Kitchens. This post is written specifically for users of Unix-like systems. You can do the same thing on Windows, but you’ll have to adapt the directions a bit.

For me, my projects are code-based and, hence, mostly a collection of version-controlled branches. So, my “Projects” directory (located at ~/Projects) contains a smattering of branches organized into multi-level directory structures.

In my last post, I noted that shared branch storage is the key to fast branching, good offline access, and efficient disk usage. I even covered a quick way to get it rolling for Drupal branches. But it’s one thing to know how and another to know “best practices,” so I’ll share how I do things.

If you’re like me, you want minimal hassle and maximum flexibility for organizing all of your branches. Fortunately, Bazaar doesn’t just look at the next directory up to find a viable shared branch storage database.

Here’s what happens when you create a branch:

  1. Bazaar checks if the new branch is created using the --standalone option. If so, it doesn’t even worry about the possibility of shared storage, and it stores the branch right there. (I’ve included this option for completeness, and you should only worry about it if you need branch permission control, which is typically not an issue on a workstation.)
  2. Bazaar traverses up the directory tree looking for a viable shared storage location. If one is available, it stores the data there, taking advantage of shared revision history between branches.
  3. If all else fails (and this is the default case), Bazaar stores the data right there with the branch.

Based on rule two, all you need is shared storage somewhere in a writable parent directory. I like to put my branches in ~/Projects, so I create shared storage in ~/Projects by running bzr init-repository ~/Projects. Then, I’m set for any branches created in or below ~/Projects.

If you don’t like to centralize your branches one place, you can run bzr init-repository ~ and have shared storage for any branch, anywhere in your home directory. For those used to Subversion and CVS spewing directories everywhere, I’ll explicitly note that this only creates one directory at the specified shared-storage location.

Once you’ve set up shared storage, you don’t have to think about using it; Bazaar will manage it automatically and transparently. If you’re curious whether it’s working, you can run bzr info from within any branch and look for “shared repository” in the output.

If you want to move your branch
If your branch uses shared storage, it references revisions in the shared storage repository. You can’t move it to another computer using normal filesystem commands and still have it work. If you’d like to make the branch completely independent of its shared storage so you can move it anywhere, run bzr reconfigure --standalone from a directory within the branch. It will copy its revisions to local storage.

A better option for moving branches is to use Bazaar’s own bzr branch command to clone the branch and then delete the original. Then, you don’t have to worry about the possibility of shared storage.

Theory note: This is Bazaar’s version of “cheap branching.” Creating additional branches with shared storage only takes the additional disk space required to create the branch’s new working tree (and even that additional space is only necessary if you choose to create a working tree for the new branch).

A Bazaar branch of Drupal HEAD with all history

I originally created a Bazaar branch of Drupal HEAD with hourly snapshots of upstream updates (commits to CVS HEAD) to streamline work on patches to Drupal HEAD.

This snapshot method had a few advantages over using Launchpad’s Drupal branch:

  • Because of the shallow history, branching from the Four Kitchens server was relatively fast compared to branching from Launchpad.
  • The Four Kitchens server has more reserve bandwidth and capacity, making initial branching and updates faster.
  • Launchpad goes down more often than the Four Kitchens server.

The snapshot-based branch also had a big disadvantage: it didn’t keep a true commit-by-commit history that mapped to CVS HEAD commits, making it hard to understand ongoing changes. After some discussions with chx, I’ve decided to combine the best of both worlds: a mirror of Launchpad’s Drupal HEAD branch on the Four Kitchens server with instructions (provided below) to avoid the trouble of pulling thousands of revisions every time you branch.

Why not branch directly from Launchpad? Aside from occasional Launchpad reliability issues, downloading all Drupal revisions takes takes 2m37s from Launchpad versus 1m15s from Four Kitchens (both tested from my home connection).

I will also continue to maintain the old snapshot-based branch until Drupal 7 is released.

Method 0: Plain, old branching

  • Pros
    • Runs with older Bazaar versions
    • Performs all post-branch operations locally with no network access
    • Easy to set up
  • Cons
    • Downloads all upstream revisions each time you branch (takes 1m25s)
    • Stores all upstream revisions each time you branch (takes 51M each branch)

Run this: bzr branch bzr://

Method 1: Stacked branches

  • Pros
    • Fast initial branching: only downloads basic branch data (takes 42s)
    • Most space efficient: only stores basic branch data (takes 6.6M total)
    • Easy to set up
  • Cons
    • Requires Bazaar 1.6 or later
    • Requires internet access to perform most history-based operations

Run this: bzr branch --stacked bzr://

Method 2: Shared branch storage

  • Pros
    • Runs with older Bazaar versions
    • Downloads upstream revisions once for all branches
    • Stores upstream revisions once for all branches (takes 51M once)
    • Performs all post-branch operations locally with no network access
  • Cons
    • Still stores upstream revisions once
    • Still downloads all upstream revisions once
    • You have to run a few more commands to create your first local branch

From a new directory that will be a parent of your future branch directories, run this:
bzr init-repository .

From within the parent directory you just created, run this to create your local branch:
bzr branch bzr://

Using your branch

Presumably, you’ve gone through all this to work on patches for Drupal HEAD. I’ve updated my earlier instructions for how to use your new branch.

Enforcing branch commit atomicity (or, why the git staging area is bad)

With CVS, one of the only repository-wide atomic operations is tagging a local checkout. And not all that long ago, Subversion introduced mainstream users of free, open-source version control systems to full-scale atomicity. Or, at least the ability to be atomic.

Subversion’s approach to atomicity is rooted in its centralization and hybrid branch/directory model. Because Subversion makes it hard to merge from other repositories, there’s a strong incentive to combine many projects and branches into one repository. Subversion therefore offers directory-level checkouts as well as convenient, repository-wide checkouts for developers working on multiple projects. To fit this project and branching model, Subversion performs most operations at the level of the current directory and below. (The only alternative would be performing the operations checkout-wide, which would cause behavior confusingly dependent on the choice of checkout root.)

Subversion’s model operates atomically if users run the commands from the root of a project or branch, but the this-directory-and-below model encourages bad development behavior. For example, when working on a Drupal project, it’s easy to commit just the changes for one module or theme, which creates a revision in the repository that may never have existed as a working copy and may not work. Administrators can mitigate the problem with repository-side continuous integration (CI), but even CI still doesn’t guarantee true project coherence and atomicity.

Bazaar, on the other hand, performs its operations (at least by default) on the entire branch, encouraging real atomic commits. This default, branch-level behavior tends to annoy developers used to separating a project’s changes into different directories of a Subversion checkout and checking in by directory. This Subversion-based workflow works reasonably well in practice, and for systems like Bazaar and git to enforce high levels of atomicity and remain usable, they must provide convenient tools to separate the changes intended for each commit.

Bazaar and git have different approaches providing such tools. Bazaar has shelve and unshelve. git has the staging area.

The most obvious way they differ is in workflow. Bazaar’s commands are optional and remove changes from the upcoming commit. git’s commands are mandatory and add changes to the upcoming commit. Here, git’s choice seems very sensible. Encouraging manual approval of each change in each commit reduces mistakes. On the other hand, Bazaar provides a convenient uncommit command that allows reversal of erroneous commits (which are often only obvious when seeing the file list as the commit is happening). All considered, I slightly prefer git’s workflow here.

Where git fails is in the theoretical foundations of the staging area. The staging area encourages the same bad behavior as in Subversion, just with more surgical control of what gets committed. Committing in git with only some changes added to the staging area still results in an “atomic” revision that may never have existed as a working copy and may not work.

Along these lines, one of the most atomicity-busting aspects of git’s staging area is that it doesn’t just mark code that needs to go into the next commit; it actually saves the hunk into the staging index. So, a developer could add code to the staging area, modify her working copy, and end up with a commit containing code that’s neither in her working copy nor in her stash. The code only ends up in the commit just made, silently filed away for someone else to get in their next merge:

This command [add] can be performed multiple times before a commit. It only adds the content of the specified file(s) at the time the add command is run; if you want subsequent changes included in the next commit, then you must run git add again to add the new content to the index.

In contrast, shelving a change in Bazaar reverts the change in the working copy. (It does save the change for later restoration with unshelve.) Because shelved changes are not in the working copy, Bazaar encourages the ultimate in atomicity: what a developer commits represents an atomic snapshot of the entire branch as represented by her working copy. And if tests pass when she commits, the same tests will pass if another developer pulls the same revision of the same branch.

Bazaar 1.11 and BzrTools 1.11.0 RPMs for Red Hat Enterprise Linux 5 and CentOS 5

These RPMs are now obsolete. Please check this blog for the latest ones.

The EPEL RPMs for Bazaar and BzrTools are quite out of date, so I rolled some experimental packages based on the Fedora 10 spec files.

Bazaar i386 Bazaar x86_64 Bazaar SRPM Bazaar Spec
BzrTools i386 BzrTools x86_64 BzrTools SRPM BzrTools Spec

One more requirement

If you don't have EPEL installed as a repository, you may have to manually download and install the python-pycurl (i386 or x86_64) package.

Creating common branch ancestry is a hard problem

One of the key features of distributed version control systems (DVCS) is support for divergent development (branching) and then merging. Most DVCS tools, including git and Bazaar, include rather elegant support for such workflows by embedding metadata about common ancestry into branches. In this post, I’ll be focusing on Bazaar.

“Common ancestry” means an identical revision shared by two branches. The most recent common ancestor typically indicates the point of branching, unless there has been a more recent merge. A successful merge between two branches establishes a new, more recent common ancestor. Identifying a common ancestor is a required step for performing automatic three-way merges to integrate changes from a foreign, divergent branch.

Typically, the branching metadata allows Bazaar to automatically determine the most recent common ancestor to use as the base revision for three-way merging. The problem comes when you need to merge two branches that do not share any common ancestors. (For the purpose of this post, I am not counting revision zero, the universal common ancestor, as a common ancestor. It is generally useless for merging.)

Creating common ancestry
If you try to merge two branches without common ancestry and without any revision identifiers, Bazaar will complain and do nothing. You can specify a revision range (like -r5..-1, which would be everything from the fifth revision to the latest on the foreign branch) and then apply the merge. Bazaar will then merge the changes in the specified revision range into your local branch and establish the last merged revision as the latest common ancestor, making future merges a breeze. Unfortunately, finishing this initial merge is where dragons lie. But before I can get into the difficulty of creating common ancestry by merging two unrelated branches, I have to briefly discuss how Bazaar handles files.

How Bazaar tracks files
Bazaar maps file paths to globally unique file IDs, and two branches without prior common ancestry will have different file IDs for the same paths, even if the files are really the same. Every time a file is added to a Bazaar branch, it gets a unique ID. As files are moved and renamed, they keep their unique IDs. So, if two people download Drupal, extract it, and independently “bzr init”, “bzr add”, and “bzr commit”, their respective README.txt files (for example) will have different file IDs.

Creating common ancestry (continued)
These globally unique file IDs cause trouble when merging from unrelated branches. When Bazaar merges two branches (related or not), two files with the same path but different file IDs create a conflict even if they contain identical content. Merging two unrelated branches with lots of shared files creates a mess of conflict files and conflict directories, and there is currently no convenient way to resolve these conflicts.

A concrete example with Drupal
A developer downloads Drupal and puts the project under version control in a fresh Bazaar branch. She discovers that Four Kitchens maintains a Bazaar branch of stable Drupal releases and decides she would like to use the Four Kitchens branch to automate installation of minor Drupal updates. She attempts to merge in the revision range from the Four Kitchens branch that would perform the upgrade. Because the file IDs in the Four Kitchens branch differ from hers, every Drupal core file and directory has a conflict despite having identical content for the base merge revision. After a bit of tedious conflict cleanup, she commits the merge and goes back to enjoying Bazaar’s generally elegant architecture.

I thought this blog post was going to give me a solution!
Nope, there’s not one out there yet. I’m currently looking at writing a custom merge handler (subclassed from the standard merge3 in Bazaar) that would intelligently handle merges where file paths do represent the same files, regardless of file IDs. Unfortunately, the file ID/path conflict is low-level in Bazaar and occurs before reaching the most modular part of merge conflict resolution.

Thanks to Robert Collins on the Bazaar project for walking me through the Bazaar internals necessary for me to explain this issue and, hopefully, solve it.