Your SCM may be decentralized, but your project isn’t

By Loren Segal on February 22nd, 2008 at 12:18 AM

Tags: , ,

Or, why your project doesn’t need Git

Geeks love new things. If geeks were frogs, gadgets would be lillypads. Here’s a diagram:

Figure 1: Frogs vs. Lillypads

Frogs Geeks love to jump from lillypad to lillypad as soon as they see a new one. This is why I’m not all that surprised that people (at least in the Ruby community) are quickly ditching the Subversion ship that’s gotten so much attention over the last few years. While surprised I am not, I am slightly disappointed in people for leapfrogging to a new technology so quickly before it’s proven to really be as cool as it sounds.

Important note: I’ve only been using Git for about a week now, so many of the technical details I provide about Git may will be wrong. Please correct me. However, none of what I’m about to say has anything to do with the technical aspects of git. Having seen the way git is used in certain projects makes me wonder what people think they’re really getting from distributed source control management. Also note that I’m not discussing Linux development here– I know nothing about it, nor do I care. The fact that git works great for Linus doesn’t mean it works for every other open source project.

Git is a technical masterpiece; but not all technical masterpieces are useful to you.

Git has some really nice features. Branching is effortless and hidden from the user. Merging is nice, conceptually, though resolving conflicts seems like a pain compared to Subversion. The speed/size optimizations are a must, and I sure hope the svn guys get their act together.

Note however that none of what I just credited Git for has anything to do with its decentralizededness (dictionary, please). All of this could be implemented in Subversion without changing your workflow.

Programming is not quite like editing Wikipedia

The supposed advantage to decentralized SCM’s is that anyone can contribute code just by running a git clone and then making and sharing changes. Everybody has "their own branch" that they could develop on. But the truth is that anyone who tells you this is simply giving you a false sense of reality.

In real life, you don’t just download the repository, make a change and get a guarantee that your work will be merged back into the main branch for everyone to use (you made the change using SCM so you could share it, after all). In real life, projects are well guarded from the outside world and have a few gatekeepers known as maintainers. These people are usually the project owners/creators. In real life, you deal with having to convince these maintainers that your code deserves to be in the main branch.

The only advantage to git is that once you deal with all the politics, you can theoretically have the maintainers merge your Git branch with theirs really easily– though in reality most projects still would rather take patches using ticketing systems. In reality, workflow trumps technology.

Your workflow is the ultimate bottleneck.

To really understand why decentralized SCM is a complete waste for 90% of your projects, you must first step back and look at how you work. Let’s describe your average open source project:

  1. Most open source projects are small. Not everything is Gnome/wine/Linux.
  2. Most projects I see using git have about 10-12 active developers, with about 3-5 active committers. It can easily be fewer.
  3. More specifically, these projects have far fewer developers than users. Most of the people who download the source only do so to compile it– never to edit.
  4. Sometimes there is only one main committer with one or two backups. Watch project timelines, you’ll see only a handful of names– the rest will be your odd patch.
  5. 99.99% of all open source project inevitably make one official release for each set of changesets.
  6. Such a release is usually hosted in one centralized location with maybe a few mirrors strictly for distribution’s sake.

Is anyone coming to a scary realization here? As decentralized as you attempt to make your project, you will always run into a single point of failure: your workflow.

I’m currently watching merb-core development as an example of one of these projects and I’ve noticed that the workflow is essentially equivalent to one with a centralized repository. Someone will submit a patch and have it committed by one of three main committers. If your patch doesn’t make it to the main branch, you’re simply out of luck. Sure, you could use your patch locally, but you could do this with any body of source code whatsoever, .git, .svn or .tar.gz. This is really no different from Subversion to anyone outside the core development team.

If your project has only one or two active committers or falls under any of the above categories, do yourself a favour and don’t waste your time installing git on your server. You won’t be benefiting from its features because your project and workflow will not have changed.

So who really benefits from distributed source control?

Core developers do. The truth is, git isn’t as great as a DVCS as it is a private whiteboard for "pre-commits" to the main repository. Git can make it a lot easier to pass around changes before finalizing them which would mean less broken builds on the main repository. That’s a good thing, and almost worth a two-tier setup (see diagrams below). But really, this is nothing Subversion cannot do with almost as little effort.

Why Subversion can do what Git does

This is what a git development workflow normally looks like:

 Figure 2: How git development works

The outer repository in this diagram is bundled with a release server (web server, most likely) and ticketing system for patches. The "git" blocks are machines with individual branches for each core developer (abstracted from their physical machines in case they use github or something). I didn’t draw all the connecting lines, but enough to show where the bottleneck lies. More importantly, that you’re not really using git as a decentralized development platform (sorry).

Now lets try this setup with Subversion:

 Figure 3: The same scenario with Subversion

Notice that the Tier 1 workflow does not change. Instead, imagine a subversion repository (it doesn’t have to be the same physical repository) where each developer has their own branch and "publishes" their changes by committing to that branch. This development workflow is 100% equivalent to using, say, github, to share changes. Literally– it’s exactly the same. No, really, it is. In this scenario, Joe can merge Larry’s changes by simply– merging them into his branch. When a final release is made, the few maintainers will merge the code that they got from other branches back into trunk, potentially tagging the changeset.

In fact, not only is this workflow exactly the same as Git’s, but it has a side effect that nearly makes it more powerful than using Git: in an optimistic development environment, the maintainers could give out write access for branches to people from the outside world. I could get "Loren’s branch", and start developing my changes in my own little sandbox. This would be similar to git, but the visibility of my code would be much higher in that the core developers would be able to keep tabs on changes that non-core developers are making without having me ping them about it *. I would no longer be a second-class citizen with my own git repository far off at some URI in the public world (see git diagram), but instead I would be developing in the same location where the core team is. I have no clue why people think this is a bad thing.

* To be fair, git could do what I just described, but the developers would need to manage links to all the outsider repositories / track them all with relatively complex, currently-non-free-or-very-private software (github being one). The infrastructure for doing this in Subversion is built-in and implicit.

In summary, group me with all the other people who are skeptical of distributed source control management, please.

15 Responses to “Your SCM may be decentralized, but your project isn’t”

  1. Chris McGrath Says:

    Have you ever actually used subversion merging? I’m guessing not. I happily use git for projects where I’m the only developer because it works so much better than svn for branching and merging.

    You seem to think that having the central maintainers approve and create a branch each time an external developer wants to try something is better than having that external developer just go and do stuff and ping the maintainers when it’s ready. Can you explain how having the main guys on a project doing admin instead of code will lead to better software?

    There’s nothing to stop you having a ticketing system, but instead of going through the hassle of creating and uploading patches you can just give your git repo address and the SHA-1 id. This means the main developers don’t have to have to manually download patches and figure out where to apply them from.

    People aren’t switching to git because it’s new and shiny, they’re switching because it offers a completely different approach that is more flexible and powerful than svn. I think you’ve completely missed the point here.

  2. Jakub Narebski Says:

    * “Merging is nice, conceptually, though resolving conflicts seems like a pain compared to Subversion.”

    Merging in Subversion is PITA, unless you use one of (IIRC incompatibile) extensions like svnmerge or SVK (the latter also gives replication and off-line commits). Sobversion simply does not have automated way to compute merge base: you have to provide it when merging, like in old CVS. This might change in upcoming SVN 1.5, though.

    How resolving conflicts is more difficult in Git than in Subversion? AFAIK both use the same rcsmerge/diff3 -E format for textual conflicts, for CLI. Git can deal with renamed files during mergeing, I don’t know about Subversion.

    * “Why Subversion can do what Git does”
    Presented setup lacks (at least) two very powerfull features of Git: private topic branches and off-line commits.

    One branch for developer in “staging repository” (BTW. how do you want to sync those two repositories? I think you can’t do that using only Subversion) as equivalent of developers having their own private repositories is not enough. When working on a change which would take more than one commit one would want to create separate topic branch for that. With central “staging repository” you have to 0.) have permission to create branch in repository 1.) take care because branches are unique, 2.) post work which is not ready for inclusion, and perhaps you don’t want others to see this ‘dirty’ state.

    As to “currently-non-free-or-very-private software (github being one)” for hosting central, distribution, publishing repository for a project: there is repo.or.cz which uses set of bash scripts, freely available; Savannah uses Savane-cleanup (fork/branch of Savane), also freely available; Gitorious also provides its engine under GPLv2.

  3. Rein Henrichs Says:

    It looks like you forgot to tag this one “humor”. You might want to fix that before people get the wrong idea. If you had a “ironic unintentional satire” tag, that would be better.

  4. Wilson Bilkovich Says:

    Any SCM that does not let me check in code and switch branches while I am on an airplane is useless to me.

    I did this with Subversion via svk, but there are too many moving parts to make most people comfortable with that solution.

    Further, it is common for other non-svk devs to push back against the ‘all in one commit’ synchronization that svk offers.. you end up combining two totally unrelated changes.

    Subversion is a lily-pad, Git is a jet-ski.

  5. Loren Segal Says:

    I’m in a non-confrontational mood today. I happily accept everyone’s opinion and points, but I’d like to emphasize the fact that frogs like lillypads, not jetskis.

  6. Wilson Bilkovich Says:

    I must not be a frog then.

  7. Mr eel Says:

    For small projects Git is ideal. Let’s consider the smallest amount of coders on a project, like… just me.

    With git I cd into a directory, init the repository do some coding. Check it in. Done. With SVN I have to have the repo in one place, them import into it, then check out, then code, then commit. I end up with two sets of files for the one project. I found that annoying.

    Git makes it so easy that I tend to use it on even smaller projects.

    Also, I really think the cheap branching shouldn’t be underrated. It has really changed the way I develop. I tend to keep related changes in their own branches, which makes committing much easier and lets me switch between tasks easily, without getting confused over which files belong to which change.

    Concerning merb-core as an example of distributed SCM; it is not analogous to submitting a patch. When you do a git pull you’re merging the _history_ of two separate but related repositories.

    This gives an amazing amount of flexibility. It’s possible for a contributor to make large breaking changes to code, while still checking them into their own repo. These changes can then later be merged into main repo, along with the associated history.

    Finally, pulling from a remote repository is heaps better than applying patches and checking in changes IMO. A minor thing to be sure, but I’ve always found patching annoying.

  8. James McKinney Says:

    “Any SCM that does not let me check in code and switch branches while I am on an airplane is useless to me.”

    Yeah, because, fuck man, I live half my life in an airplane….

  9. Matt Todd Says:

    One thing that would not be possible with SVN would be to make another branch on your own branch/copy unless you can convince the admins that you need two branches or move all the files around into subdirectories for each “private branch”.

    But really, neither of those are acceptable solutions because, either way, you’re taking away time and adding unneeded complexity for such a simple task.

    The biggest problem with SVN is, when it comes to branching, having to have the admins create a branch for you and all the work on your end to convince them to do so for you and on their end actually evaluating your request and performing the necessary action. That’s time wasted for two people, not just one…

  10. Loren Segal Says:

    @ Matt:

    “Permissions” and “convincing” in Subversion repositories are merely a configuration issue, though; with proper write access already there this is a non-issue, and with little infrastructure you can automate branch creation / setting up write access to anonymous users. Yes, it’s a little more awkward just because it’s not natively supported, but the benefit is that everybody knows where to find your changesets.

  11. Jakub Narebski Says:

    Another wonderfull feature of Git (if used with some restraint) is ability to rewrite history.

    How many times you have patch ready to send in your email client / last review before sending in your editor to notice a bug in a commit, or a typo in a commit message? Or perhaps you decided that documentation update should go into the same commit? “git commit –amend” is your friend.

    Or perhaps when creating a series of commit (clean history, better review, bisectability) introducing some new feature, and in the middle of series you have noticed that with some change to infrastructure introducing this feature would be mich easier; cahnge which should go at the beginning of series. Or noticed a bug in an earlier commit? Or decided that the change had grown too much and commit should be split? “git rebase –interactive”, or cherry-pick, or StGit / Guilt is your friend.

    Of course this makes sense only if you can have private development, in short only for distributed version control system.

    —-
    @Loren Segal: “everybody knows where to find your changesets”

    With Git, or any other DSCM you can always have central publishing repo, like e.g. Linus tree on kernel.org is authoritative source to “get changesets”.

  12. Loren Segal Says:

    @ Jakub:

    Administrating your repository and editing previous changesets does not inherently require a distributed system. That’s purely up to the implementation details of the SCM you’re using. If anything, shame on Subversion for not adding these features- but they *can* exist without changing workflow.

    Like I said before, I admit that Git has great *technical* advantages over Subversion. I can understand why that has cause people to start moving away from svn– heck, I’m doing it myself. However, most of what I heard up until now has not been anything that is completely out of the scope of Subversion’s work model. I still won’t completely abandon ship until the SVN community/devs make it clear that they will not attempt to improve the places where Git completely destroys them.

    Subversion has its own advantages, they just don’t lie in the “technical” end, which everyone here is focusing on. It’s not about the brilliance of uber-unique SHA1 hashes, amazing tree-like filesystems and seamless branching/merging, but rather about simplifying workflow from the non-technical end. For instance, being able to check out just 3 relevant files in a repository of 5000 has enormous advantages, and this is something Git has no interest in from what I know.

  13. Jakub Narebski Says:

    @Loren:

    Rewriting (changing) *published* history is a big no-no. In centralized SCM commiting means publishing. In distributed SCM publishing (push) is separated from comitting changes. So IMHO rewriting history has sense *only* for distributed version control systems, regardless of technical issues / implementation details.

    As to “partial checkout” (ability to checkout just 3 relevant files out of 5000): it plays merry hell with whole-tree (whole-project) commits, and encourages gigantic monolithic repositories. As to Git having no interest in this feature: from some time Git has submodules support, which is proper way of dividing project; it is something akin to svn:externals “done right” ;-). “Partial checkouts” is often requested feature, and from what I can see Nguyen Thai Ngoc Duy are planning to implement it.

    Subversion has just too many issues; I wouldn’t enumerate them here. And Git can be used also in centralized fashion, as centralized SCM.

    One more thing: as Linus said in his Google Tech Talk, for projects with very large community you want to have “network of trust”, because maintainer does not scale. So IMHO DSCM is better for such projects. For single developer project DSCM are mich easier to set-up than centralized SCMs; it is just ” init” in top dir of your project. So…

  14. Jakub Narebski Says:

    @James:
    >> Any SCM that does not let me check in code and switch branches while I am on an airplane is useless to me.”

    > Yeah, because, fuck man, I live half my life in an airplane.

    It doesn’t need to be airplane; it can be train, bus or tram. It is enough if you don’t have network connection. Or your central server can be down, or the network is congested and it is responding slowly.

    Off-line commits is a nice feature to have.

  15. Anonymouse Says:

    “More importantly, that you’re not really using git as a decentralized development platform (sorry).”

    No, you don’t understand. It’s only the “official maintainer releases” that are centralized. i.e. only the official maintainers can create them. This is a tautology.

    On the other hand, the actual development of the code *was* decentralized because different developers were able to ’sync up’ without an “official” release. You could do this with a central server, but it creates bottlenecks and artificial class distinctions (see below.)

    “But really, this is nothing Subversion cannot do with almost as little effort.”

    You are proposing sweeping changes to SVN. You do a lot of hand-waving on exactly how it will work, so it’s hard to comment.

    First, you are creating a lot of central bandwidth and storage for trees that may or may not ever pan out.

    But real distributed source control management systems have the feature that *ANYONE* can be a maintainer and make their own releases. (I’ll admit that’s not useful in a corporate setting, but we’re talking Open Source.)

    In the central SVN model, the core developers create “2nd class developers” who can never be maintainers in their own right. Nobody can collaborate except at the whim of the core developers. If the core developers get stingy with the commit bit (or “branch bit” in your scenario), then the project stagnates. (See the XFree86/Xorg fiasco.)

    “Official maintainer” is a *social* designation, not a technical one. Therefore, you can declare yourself the “official maintainer” of Loren’s repo. You shouldn’t be at the mercy of someone else to host it.

    Linus has a kernel tree, but he will be the first to tell you that his tree isn’t special at all. (In fact, historically, almost no distros shipped his tree.) Linus only gets to decide what’s in his tree. Others (Andrew Morton, Alan Cox, the PowerPC arch guy, the SCSI subsystem maintainer) get to decide what’s in their trees. If the PowerPC guy wants to sync with my tree, he shouldn’t need anyone’s permission, and I shouldn’t need an account on a central server.

    It’s ironic that the SVN maintainers may look at your suggestion and decide “no, we’re not going in that direction”, and you’ll probably wished you had a distributed SCM so you could work on your idea without their permission (and without a major fork like EMACS/EGCS/etc).

Leave a Reply