Reading Server Graphs: Connected Users

I’ve spent the last several years working on multi-user server systems in two different companies. Both those companies had a giant monitor hanging off a wall showing a graph of connected users. It won’t give you detailed diagnostic information, but it is a good indicator for the health of your servers, and your product generally. If you learn to notice certain patterns in your user graph, it can also save you precious time when things go wrong.

Assumptions

The patterns I’m describing assume a common architecture: you have some boxes in the front that have your traffic balanced between them, and some stuff behind them that’s not balanced. For the purposes of this discussion, it doesn’t matter if it’s one or more apps in the front, if you have a service tier, or what kind of data storage you use. Anything that’s redundant is going to be called distributed, and anything that’s not is going to be called central. You need some way to track user sessions, and the ability to detect disconnects within a few minutes. Some of these graphs also depend on a load balancer that’s configured to keep a single user session on the same distributed server.

Healthy

This is how a healthy system should look. I’m showing two views, one broken into a couple regions, and another just showing the total within a shorter time span. It’ll be a lot easier to show these patterns on the zoomed in graph, so I’ll use that as a base line for the following examples.

I was surprised when I first saw the smooth wave-like pattern a connected user graph makes. These examples are a pure sine wave because it was easy to produce, but it’s pretty close to what I’ve seen on real systems. The waves might get a little higher and wider on weekends, but it’s always a smooth line when things are normal.

The numbers and data shown are totally fabricated, and do not represent any of the systems I’ve worked on. My focus here is on the disruptions to the lines. I have observed all of the patterns I’m showing in real production environments, some of them numerous times.

Central Component Malfunction / Failure

This is the worst case for a server system, and as you can see, the results are drastic. You can tell it’s a central component because the number of connected users drops very close to zero. You’ll also note that I show the connected users shooting back above the norm. This happens when users try to reconnect once or multiple times when the system becomes unresponsive. This is a pattern you will notice during most malfunctions.

Distributed Component Malfunction

This is a much more common occurrence in a server system; a server starts to malfunction without losing the ability to respond to network traffic. The load balancer doesn’t detect a failure, but users have serious trouble using the app. You will see a serious fluctuation in the graph as users start disconnecting and reconnecting, slowly getting pushed to new working servers. This is one of the reasons why it’s important to have sticky sessions on your load balancer.

Distributed Server Failure

When one of your distributed servers fails outright, you see a much more sudden gouge, but proportional to the number of servers you have. The graph returns to normal fairly quickly once the load balancer detects the bad server, and users finish logging back in.

Application Overloaded

Performance limits can be distinguished because they get worse as the number of connecting users increases. This example shows a hard wall, but the severity you observe will depend on how your system is breaking down. The key indicator is the subtle twitching that gets progressively worse as the pressure builds. The deep downward spikes occur as various parts of the system start throwing large quantities of errors.

Central Component Performance

This is what it looks like when a central component starts to have a performance issue. When it occurs off peak, it’s a good clue that some critical system is acting up. If it’s not obvious what’s wrong, here are a few things you can check: failed hard drives in your storage system, hardware errors in your system logs, unusual latency with a heavily-used APIs, or perhaps someone is running an ad-hoc query on the production database.

Denial-of-Service Attack

Denial of Service attacks are awful, and unfortunately effective things. They look different than network gear failures because attackers have trouble ramping up load generators quickly. Once they do get going though, your networking gear will usually start failing, and nothing will get through until they stop. Most DOS attacks are network-level, so you shouldn’t see increased activity or connections before or during the attack.

Television Ad

Advertising should increase your number of connected users. If your ad hits lot of people at the same time, such as a TV ad, you’ll see a bump like this. There will be a spike just as the ad airs, a bit of hang time, then it starts to trickle down to normal. The size of the bump will depend on the effectiveness and reach of your ad.

Television Event

This is one of my favourite patterns. This is what happens if there is a big event that your audience is interested in. An example would be a sports site during the Super bowl. You see a dip while it airs, then you go back to normal when it ends. If people don’t like the event, the line might start returning to normal sooner.

Monitoring Failure

A flat line like this is almost never real, except maybe during deliberate maintenance windows. If it’s not obvious why you’re flat, you should check that your monitoring and graphing systems are collecting data correctly.

InstallUtil and BadImageFormatException – Facepalm

I had a frustrating issue at work this week: one that was easy to fix, but embarrassingly difficult to find. I came pretty close to giving up, which is not a solution I often explore, but in the end we figured it out and got everything working.

A member of our operations team was installing a Windows service I’d built to monitor some stuff in our production environment. I’ve made a few windows services in my day, and installed them many times on many machines. I’d even installed this one on my development machine with no issue. In our staging environment, however, this is what we got:

C:\Install\TheService>C:\Windows\Microsoft.NET\Framework64\v4.0.30319\InstallUtil.exe TheService.exe
Microsoft (R) .NET Framework Installation utility Version 4.0.30319.1
Copyright (c) Microsoft Corporation. All rights reserved.

Exception occurred while initializing the installation:
System.BadImageFormatException: Could not load file or assembly ‘file:///C:\Monitoring\Service\TheService.exe’ or one of its dependencies. An attempt was made to load a program with an incorrect format.

We checked the likely things: the framework version, the platform the app was built for, even re-copying the files in case they somehow got corrupted. When these didn’t work, we started trying more radical things: forcing all assemblies to 32 bit, even running the service as an executable to see if there was some error in the app.

In my defence, we are both experienced engineers, and I’m not the only person who missed it. Look closely at the command line we used:

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\InstallUtil.exe

Long version: Service applications in Visual Studio 2010 are 32 bit by default, and this is a reasonable default for them to have. We were trying to install the 32 bit service with the 64 bit version of InstallUtil. InstallUtil loads the target assembly to access it’s installation instructions, but you can’t load a 32 bit assembly from a 64 bit application (or vice versa). If you try to, you get a BadImageFormatException.

Short version: Two numbers derailed my entire afternoon.

It would have been nice if the error message from InstallUtil was a little more specific, but I suppose this isn’t a common problem. At least I got a good reminder about the importance of checking the small details when the big ones aren’t bearing fruit.

Working Together and Having Fun

We did one of our monthly releases at work this week. Releases can be stressful and frustrating, and take a lot of methodical preparation to get right. It can be thankless work too; the only time a user notices a release is when it goes badly. We do our releases early on a week day to minimize impact, so if anything does go wrong, there’s not many bodies around to help out. It’s not much fun, but it’s important work that needs to be done.

One thing that makes the experience considerably more enjoyable for me is the team of coworkers that come in to help. There’s a handful of us, each with our own responsibilities. I deploy the applications while someone else monitors the database and another person tests the system to make sure it’s running normally. We back each other up, help out where we can, and make decisions together when they need to be made quickly.

Pressure is never a desireable thing in a work environment, but one benefit is that it quickly builds trust between anyone facing it together. The people I work with are really fantastic: smart, dedicated, and fun. We come from different cultures, like different kinds of food, have different hobbies and different tastes in music, but we still find things to talk about, reasons to laugh and smile.

I’ve been trying ways to make releases more fun. We had a pot luck breakfast once, and a release soundtrack from team favourites another time. This time I made breakfast burritos, and as a joke, a doughnut salad topped with an espresso sauce. We played a few songs during the waiting periods, and had as much fun as anyone could that early in the morning.

A plate of chopped doughnuts topped with an espresso jelly.

Thanks to the people involved, this release went well despite a few hiccups.