Top ten metrics you should be monitoring for your server infrastructure

You can only know what you observe. So it’s important to make sure you’re looking in the right place. If you’re not actively looking at your network’s latency, for example, you won’t know if your players are having lag problems. (Aside from the complaints, of course.)

Over the years, we’ve found ten metrics that we think you should definitely monitor, if you want to make sure your multiplayer game runs without a problem. Here they are.

1. CPU usage

This is how much of your processor the computer is using. So for each server, you’ll want to measure the CPU usage. A high percentage isn’t necessarily a problem. In fact, you don’t want to waste compute power.

But a high usage can highlight whether you’re going to overload your server or if there are any inefficiencies in your processes. For example, if you get a sudden spike in usage, it might be worth optimizing your code.

2. Memory usage

This is how much of your RAM you’re using. Exactly how much RAM you’ll need will depend on the size of your game. But as a general rule of thumb, you probably need about 2 GB for the main image.

Monitoring your memory usage isn’t just about making sure the server runs smoothly. It also helps you spot any memory leaks. For example, you might notice that the memory usage slowly creeps upward over time. This is a sign that you’ve got a memory leak somewhere in your code. And, eventually, the server is going to run out of memory.

3. Disk Input/Output

This is making sure you’re monitoring how much you’re reading and writing to the storage. While it’s important to know how quickly you’re transferring data, you should be more interested in the percentage of your bandwidth that you’re using.

If you have a particularly high percentage for your Disk I/O, you’re likely to run into a bottleneck fairly quickly. It doesn’t matter how fast your players can connect, if the computer can’t read or write to the disk quickly enough – the whole server is going to slow down. At which point, you might need to consider splitting the load.

4. Disk space usage

This is about how much disk space you have free. Not all this data might be on the same server as your host machine. You could be storing user data in a separate database with another provider, for example.

But you’ll at least be storing some of that data locally, such as the logs for your current match. And running out of disk space will cause everything to grind to a halt. So how often do you move that data to a more central location?

5. Network throughput

This is about tracking how much data is actually moving through your network. How much bandwidth are you using? The more data you need to transfer, the more likely you’ll end up with a bottleneck.

The biggest culprit here is sending irrelevant data back and forth between the player and server. You want to optimize your data packets as much as possible, so that you’re only sending essential data. For example, if you’re running a multiplayer FPS, do you really need to send the player’s current health with each movement?

6. Network latency

The latency is the amount of time it takes for the data to travel from the server and back. If you have high latency, it doesn’t matter how much bandwidth you have – your players are going to experience lag.

This is quite difficult to monitor, as it’ll depend entirely on each players’ route to your server. You can’t change the speed of light – so if a player is on the other end of the world to the server, they’re going to have a higher latency. Instead, you want to be making sure that you’re hosting the match as close as possible to the players.

7. Error rates

You want to keep track of how many errors on the server you get and what types of errors they are. The more you can categorize them, the easier it’ll be to spot patterns.

These will help you spot any potential bugs in your game image, but can also point you towards misconfigurations on the server itself. Or it might highlight a potential resource problem.

8. Number of open files

Keep an eye on how many open file descriptors you have in the operating system. You should know how many files are necessary to keep your game running.

Monitoring this number will help you spot any file descriptor leaks and make sure that the server doesn’t get overwhelmed. You want to know about these early, otherwise you could end up with the server crashing.

9. System load

This is how many processes are waiting for the CPU to become available. Think of it like the queue for the processor. There will always be some processes waiting – that’s perfectly normal. But you want to see how big that queue is and how long it takes to clear.

If you have a high system load, it probably means that you don’t have enough CPU to handle running your game. At which point, you’ll need to consider adding more processor cores.

10. Total number of containers running

You should always have a clear idea of how many containers you’re running on each server. Each container will likely need dedicated processor cores, so you’re going to run into a hard limit.

Keeping track of how many containers you have running at any given time will make it easier for you to decide how much of your other resources you can allocate to each container. Likewise, you don’t want to spin up a new container, only to find you don’t have enough RAM to handle it.

Set up alerts for them all

With each of these metrics, you’ll want to determine where your limits are. One way you could do this is by setting up zones:

Green zone. This is where you expect the metric to be. If it’s here, everything is running well.
Orange zone. It’s okay to enter into this zone, but you want to have an alert fire and to log it. If you’re entering the orange zone a lot, it could be an early warning sign that something isn’t working correctly.
Red zone. This is when you need to act to fix something. Set up an automated alert so you can take action.

Make sure you have your red zone trigger early enough that you can solve the problem. It’s not the disaster scenario, it’s the early warning sign.

It’s just as important the other way

You might also consider a blue zone, before the green. If you’re in the blue zone a lot, you probably need to scale down. It’s a sign that you’re wasting resources.

Or come to us to handle it all

We can orchestrate your multiplayer infrastructure for you on our own network or help you set up your infrastructure so you can easily monitor it yourself. Either way, we can help you make sure that everything is running smoothly. And that you’re in that sweet spot where you have enough leeway, but not too much. Get in touch and let’s chat.