plastic 700 sec – git 1200 sec – a c# development story

Friday, December 09, 2011 , 16 Comments

The short story: take a code tree of 192.818 files in 33.877 directories (overall size: 5.75GB) and check in on your favorite version control tool. Plastic SCM needs 713 secs, Git needs 1287 secs.

Yes, we’re faster than Git!!!!!!

And yes: a C# program can outperform a well-written C program by a 44%!!!

(Ok, we’d be running cycles around “gitty” had we chosen C++ :P)

The story – the beginning

We started plastic scm back in 2005 and we decided to go for C# because it was much faster to develop with than C/C++. I missed C++ for a while but “.net remoting” (I was a DCOM fan) changed my mind.

We only used C# because Mono existed. A true SCM must be multiplatform. Mono was there, so we went for C#. The first time we added a source tree to plastic was in September 2005 or so. It was the quake source code: 1200 files (about 30Mb). It took 11 hours to complete. (Yes, you read correctly, eleven hours!!!)

Then we removed NHibernate out of the picture and developed the “sql” datalayer (look for datalayersql assembly when you download plastic) we’re still using today (with a ton of improvements) and things started to speed up.

We released Plastic SCM 1.0 back in November 2006 in TechEd Developers in Barcelona. It wasn’t the fastest thing on earth but it already had some of the best branching and merging in town. (Check this for some historical plastic scm photos)

You’ve to pay for your mistakes

And we did. The first thing was the design of the “communication layer” between the client and the server. We came up with a neat and supposedly well designed set of interfaces. Thanks to Mono.Remoting it was all like “invoking local methods”. Isn’t it good?

NO.

Interface Oriented Design greatly explains why. Initially (for newbies) it can sound much better to have something like this:

CheckinInterface
{
    void Checkin(File file);
}

Than this:

CheckinInterface
{
    void Checkin(File[] files);
}

But it is simply wrong. Over the network, the less roundtrips, the better.

Of course it wasn’t as simple as that, it meant really redesigning most operations to work “block based” instead of individually, something we finally put on stone back in Nov 2010 when we started Plastic SCM 4.0.

Now every data transfer is minimized (and there’s still work to do) and ready to work with “bulks”. So, the bigger the checkin op is, the faster we are compared to other systems out there.

Being faster than Git

At the beginning we wanted to be faster than Perforce and so we did: our “update” (downloading the code to a workspace, “checkout” in git/svn jargon) was faster than competitors long ago.

But, the folks at Perforce removed their “fast scm motto” once Git became mainstream (now they’re on a different party) and beating Perforce wasn’t fun anymore.

We focused on scalability due to business requirements for a while() and we still do! But beating all competitors on a single “speed up” test was sort of a goal.

Changes on the database backend

Plastic stores all data and metadata on a database: it can be MySql, Firebird, SQLServer, SQLServer CE, Oracle and now also Postgresql.

In order to speed up we had to dramatically reduce the number of data transfers to the database. In SQL Server we did that using “bulk copy” selectively when possible (a huge checkin will activate it), so 8 months ago we were consistently beating git with the same data tree using SQL Server...

Yes!! It is possible to insert a tree of 200k items on SQL Server (using the network stack and everything) than putting it on git’s hidden directory (ok, database :P).

The current test

The current test I’m writing about today was performed using:

  • Plastic 4.0.237.2 (sqlite backend)
  • git -> 1.7.8.msysgit.0

We’re running on Windows 7 on a DELL XPS 13 laptop (2/3 years old) with 8GB RAM.

We’re using the sqlite backend, which is very good for distributed usage (I’m using it for more than 2 years now on my laptop) but doesn’t work well with concurrency.

Future steps

We always wrote our data and metadata on SQL databases. They work simply great, even faster than the file system (http://codicesoftware.blogspot.com/2008/09/firebird-is-faster-than-filesystem-for.html) under certain circumstances.

But we’re considering a file system based backend (custom, closed, sort of what git/hg do today) to be used by distributed developers (main server can still be on SQL Server or MySql) and speed up some operations even more... (if possible! :P)

16 comments:

  1. No offense, but your sample set doesn't represent a normal scenario as far as *my* use of git goes.

    Do you have similar data for other operations and scenarios? I don't think your average project has over 33,000 directories, nor over 192,000 files. Without data for more typical uses, it's very hard to get a complete picture of how Plastic compares to Git. Maybe Git is better with fewer directories? Maybe not, but we don't know without more information.

    ReplyDelete
  2. Do you think 192k files is too big?

    TB of size and 200k-500k files is what gaming companies request on a daily basis.

    Same for other industries.

    Now, obviously Git is quite, quite fast. That's why I'm glad to be faster doing add/ci.

    It doesn't mean git isn't faster on other operations.

    Linus Torvalds is a genius, and being better than its piece of software is not easy.

    ReplyDelete
  3. Wasn't git famously written in a week or three?

    Pssst if you move from an SQL backend to, say, LevelDB I'd anticipate a noticeable speed boost all over again

    ReplyDelete
  4. Good point! I'll give a try to leveldb.

    Well, git was written in a week... but being actively maintained and improved for 6 years!!!!

    ReplyDelete
  5. Not that I disagree with you, but can git be faster on it's "born" environment? I mean, git on linux vs plastic on linux?!
    NTFS is way slower that most of linux FS.

    ReplyDelete
  6. NTFS is way slower than most of linux FS. ***

    ReplyDelete
  7. Ok, if we beat it there... What would you do?? ;) I bet we beat it on our native env and git in its...

    ReplyDelete
  8. I wasn't challenging you.. hehehe
    It would be much better to me to have these arguments when talking about plastic to other people.

    If you beat git on linux, I'll pay you a beer when you come to Brazil! :D

    ReplyDelete
  9. (a) I too suspect that Plastic performance will be thwarted when you run git on any platform other than Windows. On windows, add/ci are _notoriously_ slow; this information is widespread, so it is quite easy to beat git on windows, in that department.

    (b) "We only used C# because Mono existed. A true SCM must be multiplatform" seems to contradict the statement "I bet we beat it on our native env and git in its...". But it isn't very clear what your suggesting there

    (c) Can we have full disclosure? Is there a script we can run to reproduce this benchmark?

    Oh, and since these operations are largely IO-bound, using C++ likely wouldn't change the outcome at all.

    ReplyDelete
  10. @amber: yeah, yeah, yeah... ok, I'm sure you can do it better... :P

    Last time we tried we consistently beat Git, like today, which is like twice as fast than Hg, which is like ten times faster than Perforce. SVN took 8 hours to complete.

    So yes, not bad, I think, considering we're not only fast, we also give a whole bunch of tools that no other SCM has.

    That being said... YES, we'll beat Git on its own operating system, Linux, too, no problem. BUT, one thing: Git will run on ITS kernel, you know, on the kernel written by the same guys who wrote the Git code... is it fair? I think it will be as unfair for Plastic as running Git on Windows is.

    And, yes, Windows is our primary platform. Yes, 90% of our users are on Windows (despite of the fact that our biggest servers run on linux), so we're basically providing not only the best native SCM for Windows users... but also the fastest.

    The source code: two times the mono source code (only one copy is not big enough) + some big binaries, to make it closer to what our gaming customers require. I'll post the details anyway.

    Ok, it's been a long day, I know Git is a masterpiece of design, but just allow me, for one day, go to sleep thinking our tiny 10 developers team beat one of the big monsters out there... the fastest.

    :P

    ReplyDelete
  11. @cidico: keep these beers cold... 'cause you're going to loose them!!

    I just came back from watching "real steel" so I feel like beating big monsters today... :P

    Just in case you want to read the test against hg and also using smaller repos:

    http://codicesoftware.blogspot.com/2011/04/unscientific-40-benchmark-test.html

    And yes, we still beat them all...

    (ok, probably we won't on linux... who knows... last time we were only as fast as git there)

    ReplyDelete
  12. Interesting that you selectively enable bulk.

    Did you benchmark:

    insert ....
    select (1, 'a')
    union
    select (2, 'b')
    union
    select (3, 'c')
    ...

    as well? i suspect that it will improve performance of other databases and also would enable you to speed the case where you have just a few items and its not worth opening a full bulk channel.

    Planning for this dynamic SQL isn't noprmally a problem.

    @Cidico: Have you tried benchmarking NTFS after explicitly disabling the flush of drive write caches and making it 'unsafe'?

    ReplyDelete
  13. I too would like to see benchmark Plastic on Widnows vs git on Linux, the same operation.
    Plastic on Linux would suffer from mono performance to much, I believe, Although it would be nice to see it in comparison table.

    ReplyDelete
  14. I thought this was pretty interesting. Even if the numbers where a bit old, I decided to try to reproduce this.

    I'm using Linux kolya 3.2.0-4-amd64 #1 SMP Debian 3.2.35-2 x86_64 GNU/Linux

    I did all this on a the webkit sourcetree found on github.

    Versions:
    hg 2.2.2
    git 1.8.3.1.381.g2ab719e.dirty
    cm 4.1.10.454

    I have no idea what backend cm is using. I did a local installation as root of it. The gui doesn't work =(.

    Anyway here's the stats:
    cm
    real 8m24.532s
    user 5m39.941s
    sys 0m54.559s

    git
    real 4m41.698s
    user 2m9.960s
    sys 0m51.971s

    hg
    real 6m55.230s
    user 4m33.601s
    sys 0m54.727s


    For those of you that's interested in the scripts used:
    #!/bin/sh
    cm mkrep webkit@localhost:8087
    cm mkwk webkit .
    cm add -R *
    cm ci

    #!/bin/sh
    git init
    git add .
    git commit -am "initial"

    #!/bin/sh
    hg init
    hg add
    hg commit -m "init"

    ReplyDelete
  15. This comment has been removed by the author.

    ReplyDelete
  16. Hello there,

    I've just repeated the results with the latest Git & PlasticSCM versions.

    I'm using a smaller workspace (160K files, 10K dirs, 1.5G) with similar results:

    * 1.8.1.msysgit.1 -> 201s add + 48s commit -> 249 s

    * PlasticSCM 4.2.37.455 -> 7s add + 128 checkin -> 135 s

    The tests were run in Windows 7 x64.

    Enjoy it!

    ReplyDelete