2000 checkins per minute and counting

Thursday, May 16, 2013 , 0 Comments

Do you want to know how Plastic SCM performs under really heavy load? If you’re a Plastic enthusiast or just want to learn about the product… then keep on reading.

Why are we so obsessed with performance and scalability?

Version control performance is a critical point for lots of software teams out there. It is not restricted to a single industry: you can have big teams with big projects in aerospace, game dev, finance, telecom and many more. And the thing is that sometimes they can’t work in a distributed fashion, so they put a big burden on the version control server shoulders...

That’s why we regularly run load tests in Amazon to simulate lots of version control users putting a huge load on a single server… And this time we’re going through a quite interesting scenario: how does the server react when it is under a 2000 checkins per minute load?

The big “gang” theory

Big is a concept that changed quite a lot for us while evolving Plastic over the years. I still remember our first ever full checkin of the quake source code that we used for testing. It is not bigger than 1200 files (twelve hundred) but it took… a while. It was back in 2005.

Later we considered big a source tree around 20k files.

But things changed and today a good source tree is one with about 500k files and 40-80k directories and a big number of Gb in the working copy. And big, really big, is 1.5 million files.

What we consider a big number of concurrent users also grew up a little bit: back in the day 50 was sort of ‘hey, are they trying to crash it’ but today we’ve customers with 1200 concurrent coders on the same project on a daily basis… on a 350k files source tree.

The phantom bots menace

Here is the scenario we’re talking about: 300 bot clients crazily loading a single server. The point is not really figuring out how much load the server can handle, today the question is: as a developer doing regular operations, how much I am affected if my server is under heavy load?

Or, in other terms: as a developer, can I work smoothly in a big team or will team size be a bottleneck for the server and hence even the simplest operation will be taking ages?

That’s why the scenario is as follows (thanks to our designer :P): 1 human developer doing operations while 300 bots overload the server with concurrent operations at 2000 checkins per minute on a 350k files repository loaded up with a good changeset count.

Why not distributed?

You might ask: “hey, but Plastic SCM is a Distributed Version Control System, why are you using a central server instead of each developer checking in to their own servers hence removing the bottleneck??”

The answers are:

  • Yes, we can do that: use the distributed mode “if you can” and problems are over.
  • Noticed the “if you can” in the bullet point above? It is there for a reason: there are cases where distributed is not an option, at least not the option for everything.

When distributed is not an option

You read right: sometimes teams can’t go fully distributed. There are some reasons for this:

  • Not every team member is a programmer used to the dvcs paradigm. So asking them to checkin AND push is an overkill.
  • The repo is SOOO big that for whatever reason you don’t want to have a copy on each team-member machine. We can go around this one with Plastic SCM because we allow “partial replica” (a quite unique feature that cost us some headache to implement but now it is helping us to get more traction) but it is still not an option for some teams.

There are more options, but these two are a reality for many teams, especially in the gaming industry.

A few more reasons (for game devs) why “conventional” dvcs gets out of the picture:

  • They need exclusive checkout for artists. They DO need it. This one normally puts git and mercurial out of the table. Plastic supports exclusive locking even on distributed scenarios. We implemented a global list on a server acting as ‘lock master’ that the rest of servers can check (you can’t lock while you’re offline… which is doable but would be quite complicated and error prone).
  • They have also big files (like 1Gb or more) which also tends to be “game over” for Hg and Git and opens Plastic the doors.

But still, sometimes big teams working on big projects (and this can be true for smaller teams dealing with big projects too), simply want to simplify things: 1 server per location (running replicas across the different sites and taking advantage of both the distributed nature of the product plus the partial replica) and team members accessing the site server, plus only some of them working replicated on their laptops, at home and so on.

Then what is the issue with traditional centralized servers?

Easy, can’t scale while branching and merging in a big team (they have problems with small ones too, but it simply explodes on big ones).

If creating a branch takes ages (like minutes or so) or if your backend gets locked on a frequent basis… then you’re in trouble.

As you all know, the version control business is about replacing existing products on teams: almost everyone has a version control nowadays, so our job with Plastic is to probe we’ve more features, we’re faster, easier… the usual :P

But we’ve tough competitors so that’s why we have to create this sort of tests to show what the server can do.

Setting up the server

The server we used for this test is one of the biggest we set up for testing so far: 16 cores and 60Gb RAM.

But the Plastic server binaries are always the same whether they run on a laptop using 80Mb RAM or in a big loaded test server using lots of system memory.

There are some tweaks to consider, the main one is the number of “trees” to be cached in memory. Since Plastic SCM 4 the server deals a lot with “changeset trees”, basically the source tree you have on a certain cset. Since different changesets tend to share most of their source trees, caching them is fast and easy.

When you run Plastic on your laptop the server normally has to deal with only a single user, so it is all about having only a few trees cached (the one you’re working on basically) and loading more on demand (reusing most of the nodes) whenever you merge, diff or switch to a different branch.

But in order to speed things up for a big team where users will be requesting lots of trees, it is good to keep as many of them in memory.

This is the main tweak we configure for big servers: modify db.conf and add:

<MaxCachedTrees>50000</MaxCachedTrees>

Which will cache up to 50k trees (and the number can be bigger if you need to).

There are also options for servers under really heavy load which allow them to load the most recently used trees on startup prior to begin serving clients and hence reach a stable status ASAP. In order to set it up you can add the two following entries to db.conf:

<LastUsedCsetsLoad>40</ LastUsedCsetsLoad>

Which specifies the size of the LRU cache.

<LastUsedCsets>c:\cache\cache.txt</LastUsedCsets>

The file to cache all the info.

Bot clients

In order to create the clients we used what we call a “carnage test” (the name says it all) which is an executable file using the PNUnit framework (our extension to NUnit to create distributed tests) and linking to the Plastic internal client API.

This way we can create quite lightweight clients that use less resources than a regular one (for instance they do not create a regular workspace but run all operations in memory so putting even more burden on the server shoulders) and are quite easy to run with virtually no setup.

The watcher

While the bots are running we run some manual operations on a client machine to check how it is affected by the server being under heavy load.

In the past we just measured bots finishing their workloads, which is ok but it doesn’t really give you a perspective about “how does it look like to work on a server used by lots of developers”.

Numbers measuring completion time for 300 clients just tell you whether it scales or not (or it just multiplies by “n” the single user test result) but not really the feeling.

Next steps

Once we went through the initial tests we’re creating a much bigger repository with about 3+ years of history and a different type of “bot client”: something really simulating a developer instead of crazily doing operations and burning CPU.

The well-finished load test page

We’ve created a quite interesting page describing the scenario here and including a screencast of the “watcher” client.

0 comentarios: