multi-instance: a fault tolerant on-premise server

Wednesday, March 13, 2019 Pablo Santos best practices 0 Comments

Over the last four months, part of the team was quite busy working on an evolution for the Plastic server.

We wanted to provide Enterprise Edition customers with a fault-tolerant, zero-downtime server for their on-premise setups.

This is the story of how we achieved it.

Fault and failure

The following is a quote from the fabulous book "Designing Data-Intensive Applications" which I strongly recommend. Some friends consider it the "Code Complete" of data design.

Fault is not the same as failure. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore, it is usually best to design fault tolerance mechanisms that prevent fault from causing failures.

This is exactly what we tried to achieve—make the Plastic server fault tolerant.

The problem: big servers can't have downtime

We package Plastic in various forms.

We have a Team Edition, suitable for teams under 15 that run Plastic on their own servers.
We also have Cloud Edition, where we run a multi-tenant cloud server that attends requests from multiple customers.
Finally, we have Enterprise Edition, which teams with more than 15 members run on their own servers too.

Enterprise Edition serves teams as small as 15 and scales up to serve corporations with thousands of concurrent users.

In small teams that are running Team Edition, it is normally not a big issue to restart the server to perform an upgrade and downtime in the event of a fault is not that critical.

But in large setups, having a downtime of even a few minutes means that hundreds or thousands of developers get affected. Yes, they might be working in distributed mode and hence be more tolerant to the main server being down, but with high user counts chances are many will be affected anyway.

How we solved this with Plastic Cloud

Plastic Cloud already has a solution for this; we run multiple nodes on different virtual machines, so if a node goes down due to malfunction or during an upgrade, there will be always other nodes attending requests.

We run on Microsoft Azure infrastructure. The Plastic server as a whole is made of a number of roles, each of them a virtual machine managed by the Azure infrastructure (application as a service). Azure provides a load balancer that takes care of redirecting traffic to the right role. During an upgrade, Azure takes care of starting new nodes with the new version and stopping the old ones.

The setup is based on having a shared state between the nodes, which is achieved by a combination of SQL Azure (a scalable cluster of SQL Servers provided by Azure), Azure Blob Storage and Redis.

Having multiple roles means that the entire system can be dynamically horizontally scaled to attend to the increased load. And relying on well-known components like SQL Azure, Blob Storage and Redis means the underlying infrastructure is scalable too.

Each of the Plastic Cloud roles shares more than 90% of the code with a traditional on-premise server. I mean, the server you can run on your laptop just doing plasticd --console is almost the same as the one that serves all our cloud traffic. The key difference is that we replaced local caches (dictionaries, lists and the like) synchronized by simple in-memory mechanisms (locks, mutexes, etc) by more complex caches and inter-machine synchronization systems.

Unfortunately, we couldn't reuse exactly the same setup for on-premise Enterprise servers.

What makes on-premise different than cloud?

The key difference is the need for setup simplicity. When you run a cloud server, you have full control of the environment. You control the operating system, the database, the caches, the upgrades, everything. So, even if the setup is more complicated, the result pays off.

But when your customer is in charge of their own infrastructure the picture changes.

You need to support a bunch of operating system variations: macOS, Windows, and a wide number of Linux flavors.

The only way to do this successfully is by reducing complexity. Installing a Redis is a logical step for cloud, where we have full control of the deployment. But, asking customers, even huge corporate ones, to install one means creating a barrier for the success of the deployment. We want Plastic to be self-contained. You install the Plastic server and it must package all the possible dependencies.

For years we depended on RDBMS like SQL Server and MySQL (we still support up to seven different ones to store data and metadata). This meant that customers needed a working installation of MySQL on Linux to make a Plastic server work. It was not a blocking requirement, by the way, but it introduced friction.

That's why (together with performance reasons) we developed our own storage for repository data and metadata; we named it Jet, inspired by Mach 3 Blackbirds.

Jet increased performance and also reduced server installation complexity. Now, an external pre-configured RDBMS was no longer required.

That's why we didn't even think of asking our Enterprise customers to setup a cluster of SQL Servers, a distributed cache system, and multiple virtual machines behind a load balancer to run a fault-tolerant Plastic server.

We wanted a simpler solution that required almost zero-configuration and certainly very low installation pre-requisites.

Enter the multi-process Plastic server

Most of our Enterprise customers, even in large corporations, run their big Plastic central servers on a single virtual machine. As explained above, this greatly simplifies installation and reduces friction during evaluation.

They just install the Plastic server on a virtual machine on their datacenter and immediately get a boost in performance compared to their previous version control.

Single machine setups mean we couldn't copy the same setup we use for Plastic Cloud.

But we could reuse most of the concepts.

The main challenge for a single process Plastic server is a potential fault. What if there is an unexpected situation that leads to a process that crashes (unlikely) or stop responding? The obvious solution is a restart, with system downtime. But downtime was exactly what we wanted to avoid at all costs with this new version of the Plastic server.

Entire machines going down weren't that usual in our experience, since IT teams took care of the situation.

Then we thought about having a multi-process Plastic server; multiple processes attending requests in round-robin fashion behind a load balancer running on the same machine. In the event of a process going down, the still running ones would continue handling load, and a new process had to be immediately spawned to replace the old one.

The challenges were:

Creating a watchdog system to monitor Plastic server processes.
Making the Jet storage capable of handling multiple read and write processes. (The original design was ready to deal with a single process accessing the Jet storage, for maximum performance. We heavily relied on memory mapped files that were exclusively accessed by the Plastic server process).
Removing in-process caches to avoid inconsistencies between processes while still delivering high performance.

Overall system structure

This diagram shows the details of the system structure of the multi-instance configuration:

(Notice that I'm using my own interpretation of Simon's Brown C4 model).

Watchdog

We developed a new piece of software that we called the watchdog. It is responsible for starting up and monitoring the multiple Plastic server processes.

It runs based on a JSON configuration file that basically defines what nodes are available and how to run them.

A typical configuration file looks as follows:

The previous configuration defines a "spare node" meaning that only 2 out of the 3 nodes will run at a given time, and the third one will be used to perform upgrades.

These are the key capabilities of the watchdog:

Start Plastic server processes, finding free ports to listen.
Monitor processes for crashes. If a process no longer exists, launch a new one to replace it.
Monitor processes for health. The watchdog sends "ping" commands (through a dedicated file-based channel) to each process. If the process doesn't respond in a timely fashion, the watchdog starts a controlled shutdown and finally kills the process after starting a new one to replace it.
Listen for Plastic server alarms. The Plastic server can send alarms to the watchdog in event of a malfunction. This way the watchdog knows if one of the processes is not behaving correctly and will ask it to perform a controlled shutdown. Before shutdown, it will launch a new replacement process to avoid possible service outages.
The watchdog can also recycle processes periodically. If configured to do so, the watchdog will periodically ask processes to perform controlled shutdowns after replacing them by new ones.

We implemented the watchdog using .NET Core. We have been using .NET Core for more than one year now to implement web code, but this is the first official deployment of .NET Core we did for an on-premise component.

Changes in Plastic server—graceful shutdown

The watchdog needs to shutdown the Plastic server in a controlled manner. I mean, it can't just go and kill it but wait for the server to stop in a graceful manner.

What we implemented was:

Watchdog sends a message to the server asking it to stop.
The server closes the socket accept loop. No more incoming connections are accepted.
Then, it waits for the current requests to finish. Suppose there are 10 requests inside the server. Their associated sockets are not closed, so the requests will be attended and new ones could potentially enter the server.
Once there are zero requests in progress, the server closes all the connections (all the sockets), so no more requests can enter.
And then the process shutdowns.

Changes in Plastic server—Jet multi-process

One of the toughest changes was to prepare Jet to work (and perform) in a multi-process environment. Not that it took that much to implement, but it really needed to be thoroughly tested.

From a 3000-foot point of view, the changes were simple:

Jet can operate in two modes: high-perf and low-perf.
Low-perf means that if the server needs to access the changeset list, it will open the file, load the index, access the data, and close.
High-perf means that the index is preloaded so that subsequent accesses know where to go for data. This eliminates the time to "deserialize" the index, and saves a lot of time.
The problem is that if multiple processes need to access the same repos, keeping the indexes open wouldn't be valid, because of the high risk of being hit by inconsistencies. This would mean multi-proc Jet would need to fallback to low-perf mode, with a huge performance impact under heavy load.

What we did was simple: effectively relying on low-perf mode for multi-proc but heavily improving the access patterns to remove bottlenecks and use local read-only caches when possible.

For instance, once of the most time-consuming operations is tree loading. This means loading all the files and directories of a given changeset. Fortunately, trees are immutable, so they can be safely cached by multiple nodes without fear of suffering from inconsistencies. So, we optimized the caches to achieve this behavior.

After that, about a dozen performance tweaks were implemented to remove negative impacts when possible.

Changes in Plastic server—cross-process communication

Most Plastic requests can be completed in a single request. But there are others that require many. Take replica as an example. It works as follows:

Upload metadata.
Wait for the server to process all the metadata.
Meanwhile, while waiting, the client will make "get status" requests to check how everything is going on.

The former means that issues might happen in a multi-process scenario.

Suppose you start your request and get attended by node1. Later you ask for status, but for some reason a new socket connection is created and directed to node2. When you ask node2 for the operation that actually started with node1, it won't have a good answer.

In Cloud, we fixed this in a very simple way: replicas use direct connection to a given role. I mean, the replicas negotiate with the server to obtain a direct connection to the role, bypassing the load balancer. This is something that Azure infrastructure can provide.

But, this "sticky connection" solution was not compatible with the deployment simplicity we were looking for. It would mean opening more ports on the firewall, not just a single one. So, we discarded this choice.

How could we solve the problem, then? Very simple: if node1 receives a request asking for the status of replica abcd17, it will query the other running nodes for it and will return the answer to the caller. It requires some inter-process communication but we were able to solve it quite easily without bending the current server design.

Operations that survive node restarts

This is one test we performed several times:

Two nodes are started on the server.
A client starts downloading 1GB of data, or doing a checkin of 1GB.
While the data is being transferred, we start cycling the nodes: we start a new node1, then we stop the old node1, then we do the same for node2.
At the end of the update/checkin operation, two different nodes than the initial one that was running will be responding to the client. The client doesn't notice the change that happens underneath.

How upgrade works

Suppose we have a server running with 2 nodes, node0 and node1, and one spare node: node2.

> watchdog -- status
 node             | status       |   pid | running | started             | reason       | details
 -----------------+--------------+-------+---------+---------------------+--------------+-----------------
 node0            | ok           |  1728 |   0 min | 2019-02-26 10:16:23 |              | Last hour: 0 requests - 0 ms avg - load index: 0.000%
 node1            | ok           | 18896 |   0 min | 2019-02-26 10:16:23 |              | Last hour: 0 requests - 0 ms avg - load index: 0.000%

Then we run un upgrade as follows:

> watchdog upgrade serverbl3001.zip
 node             | status       |   pid | running | started             | reason       | details
 -----------------+--------------+-------+---------+---------------------+--------------+-----------------
 node0            | ok           |  1728 |   0 min | 2019-02-26 10:16:23 |              | Last hour: 0 requests - 0 ms avg - load index: 0.000%
 node1            | ok           | 18896 |   0 min | 2019-02-26 10:16:23 |              | Last hour: 0 requests - 0 ms avg - load index: 0.000%
 Spare node defined: node2. New spare node: node1
 Stopping monitor...
 Stopped
 Upgrading node2
 - Node 'node2' not running, no need to stop
 - Unpacking bl3001.zip to node: node2
 - Unpacking done
 - Starting node 'node2'
 - Started
 Upgrading node0
 - Stopping node 'node0'
 - Stopped
 - Unpacking bl3001.zip to node: node0
 - Unpacking done
 - Starting node 'node0'
 - Started
 Upgrading node1
 - Stopping node 'node1'
 - Stopped
 - Unpacking windowsserverbl3001.zip to node: node1
 - Unpacking done
 - Not starting node 'node1' because it's a spare one
 Monitor started again

As you can see, the watchdog first identifies a spare node and upgrades it. Then, it starts the upgraded node, and then continuous stopping, upgrading and starting the rest of nodes.

In the case above, node1 will be the new spare node, because node2, the former spare, is now up and running and it makes no sense to stop it again just after being upgraded.

The process can be summarized as follows:

node0 and node1 are running.
node2 binaries are updated.
node2 is started. At this point 3 nodes are running instead of just 2.
Immediately after, node0 is stopped. Only 2 nodes are up now.
Then node0 is upgraded and restarted.
Finally, node1 is stopped, upgraded, but not restarted since 2 nodes are already running.

Conclusion and availability

If you follow our release notes, you probably noticed we didn't release that many big features over the last 3 months. The reason was that we were busy developing the multi-instance server.

The new server is the perfect solution for Enterprise Edition setups that require zero downtime. This zero downtime means that the server becomes fault tolerant, but also that zero downtime upgrades are now available. You can rollout new versions of Plastic without suspending the service.

This new version is available right now, so contact us if you think it will benefit your team.

The regular server installers available at plasticscm.com still don't include the watchdog or the configuration required to run in multi-instance, so we'll be providing companies who contact us with detailed instructions.

Eventually, we'll make the multi-instance setup part of the default Enterprise installation, and configuring it will be just a matter of enabling a few flags. At the time of this writing, though, while the changes are available on the official Enterprise Edition, the actual setup is not yet included.

Pablo Santos

I'm the CTO and Founder at Códice.
I've been leading Plastic SCM since 2005. My passion is helping teams work better through version control.
I had the opportunity to see teams from many different industries at work while I helped them improving their version control practices.
I really enjoy teaching (I've been a University professor for 6+ years) and sharing my experience in talks and articles.
And I love simple code. You can reach me at @psluaces.

Branched Code

Thoughts on version control, software development, branching and merging from the Plastic dev team

Who we are

multi-instance: a fault tolerant on-premise server

Wednesday, March 13, 2019 Pablo Santos best practices 0 Comments

Fault and failure

The problem: big servers can't have downtime

How we solved this with Plastic Cloud

What makes on-premise different than cloud?

Enter the multi-process Plastic server

Overall system structure

Watchdog

Changes in Plastic server—graceful shutdown

Changes in Plastic server—Jet multi-process

Changes in Plastic server—cross-process communication

Operations that survive node restarts

How upgrade works

Conclusion and availability

Pablo Santos

Pablo Santos

0 comentarios:

Popular Posts

Labels

Who we are

multi-instance: a fault tolerant on-premise server Wednesday, March 13, 2019 Pablo Santos best practices 0 Comments

Fault and failure

The problem: big servers can't have downtime

How we solved this with Plastic Cloud

What makes on-premise different than cloud?

Enter the multi-process Plastic server

Overall system structure

Watchdog

Changes in Plastic server—graceful shutdown

Changes in Plastic server—Jet multi-process

Changes in Plastic server—cross-process communication

Operations that survive node restarts

How upgrade works

Conclusion and availability

Pablo Santos

Pablo Santos

0 comentarios:

Popular Posts

Labels

multi-instance: a fault tolerant on-premise server

Wednesday, March 13, 2019 Pablo Santos best practices 0 Comments