DVCS for everyone

Tuesday, December 07, 2010 2 Comments

Or, an illustrated guide to DVCS...

As you’ve probably realized, I’ve started writing a series of blog posts, trying to be as “educational” as possible, and covering topics that range from the history of SCM to branching patterns to the way in which the folks working on the Linux kernel make the best use of branching patterns.

My challenge for today is to try to turn the distributed development tiger into a friendly cat, unable to scare anyone. I know a good fraction of the readers are not really scared of DVCS at all (you Git, Hg, Plastic and Bazaar coders out there!) but since we released our free version one month ago I receive more and more emails from people saying things like “hey, yes, I need better branching and merging but... what’s all this noise about being distributed?”.

So here goes, I’ll try to give you a very easy explanation.

Version control scenarios

The main scenario you’ve been using while getting your hands dirty with Subversion, CVS, Perforce, Team Foundation Server or any of the beautiful version control systems out there is something like the following:

One server handling version control and a bunch of coders “checking in” to it. Easy and comfortable, isn't it?

Then you’ve heard of “distributed development” at places like github.com or at any of the open source projects that have migrated away from Subversion to Git during the last two years (like Ruby or Mono, just to mention a couple).

If you’re a hard core OSS guy, you probably don’t have doubts about what it looks like, but if you’re working in a different environment (maybe small companies, corporations with tight rules and so on) maybe it is not so clear. The first idea that comes to my mind is something like the following:

The big change is that instead of having a central server, now everyone is “running his own version control system” (whether it is a real server or just a .git directory sitting on your working copy is not really important at this stage) and exchanging commits (checkins) in a peer-to-peer way. If you want to, you can unplug from the network, leaving all those internet connection and speed issues behind. (Yes, I know is 2010 but we developers hate waiting and speed is never enough for us, is it?)
But, for the more “centralized-all-under-control” of you out there… I bet this will look like a total mess! I hear you crying “what are you talking about? No central location? Are you crazy?”.

Well, give me a second. Pure DVCS is about … well, about being totally distributed. But reality is a little bit different, so behind the DVCS buzz we all end up having a “central copy” somewhere (or a “master copy” if you prefer). Kernel developers rely on Linus having the “real thing” and the rest of open source projects… well, rely on github.com or something similar, as the following figure shows:

So... don’t panic! There will be a master copy of your beloved source code!

Developers can continue exchanging modifications in a peer-to-peer way, but at the end of the day they’ll be pushing their changes to the central location.

In fact, if you think about it, the different “servers” around won’t be exact copies of each other (not required at least) but the important stuff will be safely stored on the master and replicated among the “developers” for greater safety. (An experimental branch will be only on your computer, or maybe shared by one of your colleagues. But the changes for 3.0 release will be shared by everyone and stored on the master server.)

Networking is key

If you look at the centralized vs distributed scenarios, you’ll clearly see there’s a big difference in terms of the networks where they operate. Centralized is great on LANs, speed is high, latency is low and…well, everything is great with a gigabit-per-second cable plugged to your laptop, isn’t it?

Distributed development grew up because developers always need to work faster than the available network supports. Face it, working against a local repository on your machine is faster than accessing an office server, whether it’s LAN-based or Internet-based.

(There are a bunch of other reasons, too, basically related to branching and merging as I always try to explain, but, ok, speed plays its role.)

Don’t let them fool you...

You know what marketers do best... yes, try to fool all of us ;). I hope none of my marketing friends is listening but, you know, nowadays “distributed development” is a big word and everyone is trying to hop aboard the “distributed ship”.

Let me explain: there are a few truly distributed version control systems out there: Git, Mercurial, Bazaar, Darcs from the Open Source world (yes, I know I’m missing your favorite one, I always do... ); and BitKeeper and Plastic SCM in the commercial dimension. (My beloved Plastic can work in centralized mode too... hey, yes, we’re flexible.). Period.

But, “distributed” is cool, so a guy from the marketing department at company X stands up and says “we’re distributed too” without even changing a line of code of their ancient system. And I find myself having to explain why my system is distributed and the other is not… crazy!

Well, a picture is worth a thousand words:

If you take a centralized system and make it listen on a TCP port for clients connecting over the Internet… it is still a centralized system! Period. Ok, I feel better now.

Multi-site development

Well, some of you are probably thinking now “God! I just wanted to link some offices together, what is this guy talking about?”.

And if you’re thinking that… you’ve probably heard about “multi-site”, haven’t you? Remember the extremely expensive ClearCase Multi-Site functionality? Wasn’t it great? Well, I’ve to admit I’d like to be able to charge as much as Rational did for it but, you know, we’re not in the 90s anymore. What good-old CC was able to do for a fortune is now available, for free or at a moderate cost.

If you’ve got several teams working on different sites, then you’re most likely used to the following picture:

Just to clarify: the “cloud” is “The Internet” not a super-cool cloud computing thing.

Well, here are the challenges: while developers at “Site A” enjoy the full speed of their LAN connection to their central server, the developers at “Site B” are jealous because they must access the server using slower connections, theyhave to wait… (and to make things even more dramatic, I drew the diagram using single-monitor desks for the folks at Site B… :D).

Unless the two sites are connected by some sort of super-cool-fast connection, this setup will end up causing problems. It happened in the past, it happens today and it will happen tomorrow. And connections, nowadays, never seem to be fast or reliable enough. (ok, now I’m waiting for the comment from some guy at “mega-corp”, saying they have a trillion-bps connection that only costs half a million a year and everything just works. Thanks. Yes, we know you’re sooo rich.)

The good news is that the same “technology” used to create distributed teams over the Internet (DVCS) is available to connect your distant offices and moving the “Site B” developers to the “premier league”.

Each site will have its own “central server” and then communication between sites will only happen when the “central servers” replicate changes back and forth. It can be done on demand, it can be done once a day, or it can be done continuously. The point is that “it can be done”.

Distributed is for you... yes it is!

If you’re still thinking distributed “is not for me”, let me try once more. Suppose you’re working on your centralized team with a centralized server but one day you feel like working from home (or going at the customer’s site to fix this really urgent bug -- you know, the kind of things real coders do). Then you’ll be like the guy in the following picture:

What’s the problem here? Well, you’ll be facing the same “slow-downs” that the folks at “Site B” had. Can you live with it? Ok, fine, but, wouldn’t it be better to use something better?

(Please note I gave you a joystick so you can have some fun while waiting for the SCM to finish pushing bits through the Internet :P).

Even on a centralized team it would be great to have the freedom to go offline sometimes, wouldn’t it? And that’s basically what the next picture is all about:

You’ll end up checking in locally, being able to create branches, check differences, walk the history and so on. And you’ll push changes back to the central server only when you feel like doing it… (maybe while you try the latest GT5 after a long day of work… ok, just kidding!).

Short guide through DVCS essentials

I won’t try to come up with a full description about distributed version control operations because I’ve already done that in the past (check http://codicesoftware.blogspot.com/2010/03/distributed-development-for-windows.html and http://codicesoftware.blogspot.com/2008/07/distributed-software-development.html). But I think is a good idea to wrap up with a quick explanation of the main operations, so that newcomers to distributed feel it is not hard at all.

While the main “version control” operations are “checkin” and then probably “branch and merge”, the main distributed operations are: clone, pull and push. And they’re depicted in the following diagram:

Basically, you start with an empty repository and then clone it from somewhere else.

Then you’d like to get the new changes someone put on the central server, wouldn’t you? You’ll use a “pull” for that.

And finally you’d like to make your changes available on the central server, too (or some peer’s copy). You’ll use “push” for that.

I definitely skipped the funny part here: what happens if two developers made concurrent changes in two “replicas”? For now just consider they’ll be able to fix them before doing a “push” because the system will detect it and ask you to merge the changes. This should be enough until you fully jump into the gory details...

Enjoy.

2 comments:

  1. You're forgetting a few important benefits. First, you can create private branches, which reduces namespace pollution in the central server and allows you to hide some of the churn while experimenting with changes. Commit policies may also be laxer privately. Second, you can share changes easier. You can push your branch to a fellow developer while others would rely on diff, which lacks some important functions like whether a file was added, removed, or moved. Last, operations are faster not just because of avoiding network latency as mentioned but also because the central server is much less loaded. This is an important distinction because those in the corporate world may still be using network mounts and come to the conclusion there's no benefit to them. Just ask the Perforce guys what kind of machine you'd need to buy to host your software.

    One big disadvantage comes with cloning. While it might be acceptable to you to have a large repository on one server, it might not be acceptable to have lots of big repositories everywhere. Big databases of tests come to mind but also binaries. I don't know about PlasticSCM but none of the other OS tools supports partial cloning. Another related problem is a repository with a large number of files, which even though not large can take a long time to work with. Git supports this use case through "sparse checkout" but the interface is fairly unfriendly. Bazaar has a feature called "Filtered Views" to solve this problem but it is only available in development mode and the interface is even less friendly.

    ReplyDelete
  2. @Anonymous: thanks for the comments, they help making the post much more complete.

    Yes, in Plastic we do support partial cloning, and it will be one of the key features in 4.0 (because we'll be able to replicate from a cset).

    Agreed about performance on central server.

    Tried to keep a very simple picture, anyway.

    Thanks,

    pablo

    ReplyDelete