Enforcing branch commit atomicity (or, why the git staging area is bad)

With CVS, one of the only repository-wide atomic operations is tagging a local checkout. And not all that long ago, Subversion introduced mainstream users of free, open-source version control systems to full-scale atomicity. Or, at least the ability to be atomic.

Subversion’s approach to atomicity is rooted in its centralization and hybrid branch/directory model. Because Subversion makes it hard to merge from other repositories, there’s a strong incentive to combine many projects and branches into one repository. Subversion therefore offers directory-level checkouts as well as convenient, repository-wide checkouts for developers working on multiple projects. To fit this project and branching model, Subversion performs most operations at the level of the current directory and below. (The only alternative would be performing the operations checkout-wide, which would cause behavior confusingly dependent on the choice of checkout root.)

Subversion’s model operates atomically if users run the commands from the root of a project or branch, but the this-directory-and-below model encourages bad development behavior. For example, when working on a Drupal project, it’s easy to commit just the changes for one module or theme, which creates a revision in the repository that may never have existed as a working copy and may not work. Administrators can mitigate the problem with repository-side continuous integration (CI), but even CI still doesn’t guarantee true project coherence and atomicity.

Bazaar, on the other hand, performs its operations (at least by default) on the entire branch, encouraging real atomic commits. This default, branch-level behavior tends to annoy developers used to separating a project’s changes into different directories of a Subversion checkout and checking in by directory. This Subversion-based workflow works reasonably well in practice, and for systems like Bazaar and git to enforce high levels of atomicity and remain usable, they must provide convenient tools to separate the changes intended for each commit.

Bazaar and git have different approaches providing such tools. Bazaar has shelve and unshelve. git has the staging area.

The most obvious way they differ is in workflow. Bazaar’s commands are optional and remove changes from the upcoming commit. git’s commands are mandatory and add changes to the upcoming commit. Here, git’s choice seems very sensible. Encouraging manual approval of each change in each commit reduces mistakes. On the other hand, Bazaar provides a convenient uncommit command that allows reversal of erroneous commits (which are often only obvious when seeing the file list as the commit is happening). All considered, I slightly prefer git’s workflow here.

Where git fails is in the theoretical foundations of the staging area. The staging area encourages the same bad behavior as in Subversion, just with more surgical control of what gets committed. Committing in git with only some changes added to the staging area still results in an “atomic” revision that may never have existed as a working copy and may not work.

Along these lines, one of the most atomicity-busting aspects of git’s staging area is that it doesn’t just mark code that needs to go into the next commit; it actually saves the hunk into the staging index. So, a developer could add code to the staging area, modify her working copy, and end up with a commit containing code that’s neither in her working copy nor in her stash. The code only ends up in the commit just made, silently filed away for someone else to get in their next merge:

This command [add] can be performed multiple times before a commit. It only adds the content of the specified file(s) at the time the add command is run; if you want subsequent changes included in the next commit, then you must run git add again to add the new content to the index.

In contrast, shelving a change in Bazaar reverts the change in the working copy. (It does save the change for later restoration with unshelve.) Because shelved changes are not in the working copy, Bazaar encourages the ultimate in atomicity: what a developer commits represents an atomic snapshot of the entire branch as represented by her working copy. And if tests pass when she commits, the same tests will pass if another developer pulls the same revision of the same branch.

Commenting on this Blog post is closed.

Comments

You might have a lock “git stash”. In its normal operation mode, it saves the
staging area AND your work tree in a (sort of) temporary commit. In newer versions
it got a new option: “git stash save —keep-index”, which will only save those
changes which are not staged for the next commit.

After that you could test if your code compiles and then you may commit your changes and use “git stash apply” to restore your previously saved changes.

By the way, your captcha is a little hard. I had to press 3 times reload before I actually was able to read anything (the lines striking through the letters made id unreadable)

Just as it’s possible to work around Subversion’s flawed atomicity by remembering to always update and commit from a branch or project root, you can work around git’s staging model (as you’ve helpfully noted in your comment).

But my objection is not that git cannot preserve atomicity; it’s that git encourages non-atomic commit behavior with its staging-based commit workflow. It takes effort to break Bazaar’s branch-level atomicity.

Sorry about the CAPTCHA difficulty. You can skip it by signing on.

Hmm, so your gripe with Git here is that it doesn’t force you into your preferred workflow? I’d say calling the Git staging area bad for that reason is a bit harsh.

In fact, I think that being able to do partial commits (even just parts of the changes to the same file) is a feature, not a bug. I use it almost every day to say no, I don’t want to commit the debugging code here, nor the development environment specific settings here. Having to shelve/stash and then unshelve/unstash every time you want to commit doesn’t seem to me as a good way to work.

I agree. Speed is somewhat important, and forcing that workflow on every commit is burdensome.

I also agree that Git’s ability to commit hunks of a file, as opposed to the entire file, is extremely useful and I use it almost daily, if not repeatedly. I can fix code formatting, and tweak this here, and that there all very quickly, and then commit each change separately.

If what you’re doing risks your project from being able to ‘compile’ between commits, then you can voluntarily do a stash/unstash commit workflow to run tests on every commit. While for other types of less risky work, you can go through the other workflow faster.

Git also allows you to manipulate commits before pushing. You can amend the previous commit, and run git rebase -i to change the order of commits, and squash commits (you should run rebase after a pull before pushing anyway). You could then go through each commit and test, and make changes to the commits as necessary before pushing.

The biggest importance to having atomic and ‘perfect’ commits is 1) finding bugs, 2) having a last known “good” state. Since you push a set of commits together, each developer ends up with a batch of commits with very close timestamps. If you always ensure tests pass before pushing, then you know each “commit batch” from a developer is the last known “good” state. When I try to find where a bug was introduced, I will test the commits between each of these “commit batches” to see which batch introduced the bug, then go into the batch to see which commit it was introduced in.

However… the easiest way to find which commit introduced a bug is through git bisect. It will do all the calculations for you to find which commit introduced the bug.

Bazaar does not really “force” users into one workflow, it merely makes the most atomic workflow the default. Optionally, users can commit file by file or (with the interactive commit plugin) hunk by hunk, but both of those options have the same atomicity hazards as git’s staging area. Obviously, git also supports truly atomic workflows, but they require avoiding the staging area’s default behavior. I think “safety by default” should be a driving principle of version-control system design.

I don’t think using “bzr shelve” is significantly slower than working with “git add”. If you have more changes you want to exclude than commit, you can “bzr shelve —all” and then selectively unshelve the items you want to commit. Overall, it’s two more commands than in git: the initial shelving of all changes and the post-commit unshelving of uncommitted changes, and that’s only if you choose to reverse the direction way shelving works.

Perhaps it’s too much to ask but I’d appreciate including Mercurial in this line of reasoning. I guess it’ll be closer to git because Mercurial started as saner and more user-friendly git.

The exclusion of Mercurial is not intended to be a snub. I’m simply not familiar enough with Mercurial in this area to discuss it. I will include Mercurial in an upcoming post about patch management.

I do invite you to follow up with how Mercurial handles branch commit atomicity. Mercurial generally shows equivalent theoretical and design sophistication to git and Bazaar, so I’m sure its approach is also the result of serious thought about these issues.

I find this kind of “philosophical” discussions quite pointless, and sometimes slightly amusing too. I’m not sure if this blog post is a good example of that but these discussions pretty much always contain stuff like “$VCS tends to encourage users to do $THING whereas this $OTHER_VCS’s default model emphasizes $WORKFLOW”.

The reason why I find them pointless is that I think it’s the practice that matters. It’s about what you can do and what you can not. A good tool is flexible and you can do many things with it, easily and conveniently. That also means many things other than what the $VCS is supposed to encourage with default settings.

The reason it’s not pointless is because proponents of git cannot continue to argue these contradictory points:

  • “There’s only a small, core command set you need to learn to use git effectively.”
  • “You can implement a workflow that satisfies $CONSTRAINT for safety, all you need is intimate knowledge of many git commands and their associated options.”

Users who learn the a small, core command set should be generally constrained to the safest methods. Commands and options that compromise core concerns like branch atomicity should be the territory of “advanced” options and commands.

Moreover, distributed VCS tools offer so many options and potential workflows that the choice of default functionality is enormously important. Even advanced users like me can appreciate a well-chosen set of defaults. When I venture too far away from the defaults, I know to ask myself, “Is this really the right way to be doing this?”

OK, whatever. Keep talking and philosophing. I don’t know what “proponents of git” (or bzr or $VCS) are doing but I’m still saying that on a large scale people are adopting tools in terms of “can do or can not do”, not by metaphysical theories of the correct default usage model. In other words, it’s the practice which rules in the open-source world.

Large-scale free software projects tend to use the gatekeeper development model, which git and Bazaar both support quite well (and where the atomicity flaws in git’s staging area are far less important).

The concerns I raise in this post are more likely to affect small teams collaborating on a branch. So, you’re attacking a straw man. I’ve never argued that the atomicity flaws in git’s staging area should be a basis for large-scale projects to make a VCS decision. You’re just assuming I am and then attacking the imaginary argument by countering with what you believe large-scale projects use at their primary criterion in choosing VCS tools.

It seems to me that all the bazaar “features,” are things you can do in git. git reset HEAD^ (or HEAD^^) sets the working copy back a few revisions so that you can fix a bad commit, and tagging things locally allows you to bookmark known working states. So you can get the features you want from git.

It strikes me that, having said that, comparing centralized VC workflows and commit behavior to distributed workflows and commit behavior isn’t always productive. In subversion/CVS you want the commits to reference known good states, and you expect developers to commit at the end of a day or every week, depending. You also, probably expect developers to be working in the same office (though I recognize that this is very often not the case, it’s an ideal). In this mode the commit is a sacred object, and an indelible record, and development is expected to take a very linear path.

In contrast distributed work flows, expect people to commit regularly. Really regularly, hourly, even. The commit isn’t sacred, because for the most part commits are rarely merged one at a time. Development happens in a branch with a series of commits which when the development is done, gets merged with the current working master/head, and then someone upstream pulls the merge into the main repository. Tags are used to mark useful points in the history while commits become a tool that the developers use to promote experiment with code/content safely, rather than a tool for managers/vendors to track and save progress to guard against experimenting developers.

So the workflow changes. In some cases, a great deal. As boring as it sounds, looking at http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git">linus’ kernel repository paying particular attention to the commit messages illuminates that much of the development doesn’t happen on the kernel.org servers (distribution), that commits are imported in chunks, and that they get reviewed and singed off by a lot of people before they make it to.

Git is a very non-rigid system, and it pushes a lot of “best practice” enforcement off on cultural/social norms rather than on preventing developers from suffering the pains of stupidity, which given its origins, makes total sense.

It seems to me that all the bazaar “features,” are things you can do in git.

See my comment above about why the ability to do something in git is not the measure I’m looking at.

git reset HEAD^ (or HEAD^^) sets the working copy back a few revisions so that you can fix a bad commit, and tagging things locally allows you to bookmark known working states. So you can get the features you want from git.

Yes, you can fix bad commits, but you probably wouldn’t be aware that you’re making them because your commits differ from your working copy. You could make 100 broken commits while working on a feature because you continuously exclude the same debugging code, and the exclusion breaks the code.

It strikes me that, having said that, comparing centralized VC workflows and commit behavior to distributed workflows and commit behavior isn’t always productive.

Ironically, Bazaar does offer a centralized (“cathedral”) workflow mode using checkouts where commits go directly into a shared branch that operates very much like a Subversion repository. I frequently use this capability for co-located teams. It remains the most efficient way for teams to collaborate on shared chunks of code.

And this Subversion-style checkout support jibes well with Bazaar’s distributed capabilities. For major feature development for Drupal core, we often create a working group that develops its work on a branch from core. We don’t want the feature branch to live in the main core repository because its existence is temporary, but we also want the kind of tight-knit collaboration Bazaar/Subversion checkouts provide.

The commit isn’t sacred, because for the most part commits are rarely merged one at a time.

That’s true when you use git like the Linux kernel does, though it’s not true at all for the Bazaar checkouts described above. (The metaphor even works: commits to the “cathedral” are sacred.)

And, even if commits aren’t individually merged, there are plenty of workflows that don’t follow the git ideal of merges-as-completed-features:

  • Joe: “Can you push your changes to module X so I can see the updated output?”
  • Mary: “Sure, let me commit just that part.”
  • Changes are merged/pushed/pulled/updated/whatever from Mary to Joe.
  • Joe: “I’m getting the WSOD.”
  • Mary: “I’m not. Hmm. Let me see what I committed wrong.”

In Bazaar, Mary would have only the changes she’s committing in her working copy, allowing her to easily check her code’s integrity before pushing. It’s pretty common practice to check that you’re not getting syntax errors or the WSOD before committing, and Bazaar’s model ensures that that check is actually on what you’re pushing.

The workflow in git is great for the gatekeeper model where requests to merge are carefully vetted and applied, but real-world teams often push changes to a shared branch for the purpose of sharing work in progress, not just finished features. Sure, you could email merge directives (or the git equivalent) to each other to exchange work, but it’s not efficient.

As much as I’m a fan of Git, your post here makes an excellent point. In fact, it’s was revealed to me the other day by a simple fact:

A Git pre-commit hook which run “make check” before committing does not guarantee anything about the commit being made.

Therefore the only way to have a sane pre-commit hook is to actually clone the current HEAD+index elsewhere, extract it as a working tree, configure and build against it, and THEN you can know that the proposed commit will succeed.

Git’s approach has render pre-commit hooks fairly worthless for verifying compilability of the proposed commit.****