Monitoring a Plastic server performance

Wednesday, June 14, 2017 0 Comments

In this blog post we will explain how you can easily analyze your Plastic SCM Server logs to get very useful information of our server behavior and identify potential issues.

The Plastic SCM installation provides a tool (plasticlogstats.exe) that is able to parse your server logs and generate detailed reports.

Plasticlogstats

plasticlogstats is a tool included with Plastic SCM (you can find it inside the Plastic SCM server directory from a long time ago) but we have recently included a new feature that lets you analyze the ChannelCall logs to find hot spots, slow calls and more.

It is a text based GUI, fully interactive, that basically replaces the old analysis we used to manually do processing the data logs and then importing them to Excel files (the new analysis via plasticlogstats requires less RAM memory and it's faster).

Main features and how to run the log analyzer

You just need to run this command:

plasticlogstats.exe analyzer --file= plastic.channelcall.log.txt

The default server log configuration stores the ChannelCall logs in your server folder with the following file name pattern: plastic.channelcall.log.txt.xxxxxxxx

If you want to analyze the data from the last days, you'll need to unify your ChannelCall logs in a single file so you can process and analyze all of them at the same time:

type plastic.channelcall* >> channelcall.txt

Some of the main features of the log analyzer are:

  • Group the Plastic SCM server calls by day/hour/method.
  • Review the different methods inside each group (and then reorganize them in days and hours).
  • Monitor parameters like "average milliseconds per call", which give you a very good understanding of how fast your Plastic SCM server is responding.
  • It also calculates the percentage of fast (<500ms), slow (500-1000ms) and super-slow (>1000ms) calls.

Analyzing a real scenario

Let's analyze the logs of real Plastic SCM server scenario. We will check how the server is responding and we will try to identify potential performance issues.

As we commented, we just need to run the analyzer command and we get the following output:

The initial view shows the server information per day. This information is useful to check the general server behavior, to determine the week days with more/less load, to identify patterns...

The main parameters to monitor are:

  • load - This number represents the server load. It is calculated as: proc(sec) / 3600. We could consider load <1 a good symptom.
  • calls - Number of calls the server is processing.
  • fast (<500ms), slow (500-1000ms) and sslow (>1000ms) calls: The fastest the better :). Fast calls should be always >90%.

Debugging our example, we can see the load is between 0.19 and 0.32 during the working days but pretty lower during the weekend (it makes sense). The calls number depends on the day but it's always over 120K and >98% of the calls are fast. With an average ms/call of X we can conclude that the server behavior looks pretty good.

At the bottom of the GUI, we can see the available keys to navigate between the different views. If we select a specific day (Friday) and clicking on "ENTER", we can get the day information per hour. This way, we can easily debug more specific issues. If we have an estimated time when the issue happened, we can now check the server behavior at this time.

If the slow/sslow values increase a lot in a short period of time, it's an indicator that something wrong happened. After identifying a performance issue and the time it happened, we can review the plastic.debug.log.txt log of this period of time to get extended information.

There could be many reasons to explain a server performance problem but we normally start reviewing the principal machine parameters (RAM, CPU, disk...).

In our example, we can see the server load grows to ~1 (around 7 am) but even when the load is higher the fast calls are >98%.

It's also interesting to check the information per method (clicking on "M") and also then sort the methods per the slow column (clicking on "S"). This way, we easily see the slowest operations and identify potential bottlenecks. Some methods make sense to be slower so we need to be cautious when analyzing this information. Also as the number of calls gets lower, the slow information gets more irrelevant.

At the bottom of the GUI we can see all the sorting options so we can always reorganize the data based on our needs:

Debugging a server issue

Let me explain now a real scenario where we were able to detect a server issue with the information provided by plasticlogstats.

In the following screenshot, we can see the server behavior a specific day (14-11-2016). Reviewing the data, is pretty easy to identify very high values for load and ms/call at 2:00 (even when most of the calls are fast).

Clicking on "ENTER" to analyze the data per method at this specific time, we can also easily identify that there was one single GetChangesetTree call that seemed to be running forever. This method call was spoiling the data because the rest of the calls are very fast:

After detecting this problem via plasticlogstats, and helping us with the server debug log file, we were able to determine the reason why this call was never ending and we implemented a solution to avoid it to happen again in the future :) .

Summary

We have learnt how to easily monitor your Plastic SCM server via plasticlogstats and get useful information like:

  • Days/hours with more server load.
  • Verify that your server is responding as fast as we expect.
  • Identify when a server performance issue happened so we can focus in this period of time to debug a problem.

0 comentarios: