Reading Server Graphs: Connected Users

I’ve spent the last several years working on multi-user server systems in two different companies. Both those companies had a giant monitor hanging off a wall showing a graph of connected users. It won’t give you detailed diagnostic information, but it is a good indicator for the health of your servers, and your product generally. If you learn to notice certain patterns in your user graph, it can also save you precious time when things go wrong.

Assumptions

The patterns I’m describing assume a common architecture: you have some boxes in the front that have your traffic balanced between them, and some stuff behind them that’s not balanced. For the purposes of this discussion, it doesn’t matter if it’s one or more apps in the front, if you have a service tier, or what kind of data storage you use. Anything that’s redundant is going to be called distributed, and anything that’s not is going to be called central. You need some way to track user sessions, and the ability to detect disconnects within a few minutes. Some of these graphs also depend on a load balancer that’s configured to keep a single user session on the same distributed server.

Healthy

This is how a healthy system should look. I’m showing two views, one broken into a couple regions, and another just showing the total within a shorter time span. It’ll be a lot easier to show these patterns on the zoomed in graph, so I’ll use that as a base line for the following examples.

I was surprised when I first saw the smooth wave-like pattern a connected user graph makes. These examples are a pure sine wave because it was easy to produce, but it’s pretty close to what I’ve seen on real systems. The waves might get a little higher and wider on weekends, but it’s always a smooth line when things are normal.

The numbers and data shown are totally fabricated, and do not represent any of the systems I’ve worked on. My focus here is on the disruptions to the lines. I have observed all of the patterns I’m showing in real production environments, some of them numerous times.

Central Component Malfunction / Failure

This is the worst case for a server system, and as you can see, the results are drastic. You can tell it’s a central component because the number of connected users drops very close to zero. You’ll also note that I show the connected users shooting back above the norm. This happens when users try to reconnect once or multiple times when the system becomes unresponsive. This is a pattern you will notice during most malfunctions.

Distributed Component Malfunction

This is a much more common occurrence in a server system; a server starts to malfunction without losing the ability to respond to network traffic. The load balancer doesn’t detect a failure, but users have serious trouble using the app. You will see a serious fluctuation in the graph as users start disconnecting and reconnecting, slowly getting pushed to new working servers. This is one of the reasons why it’s important to have sticky sessions on your load balancer.

Distributed Server Failure

When one of your distributed servers fails outright, you see a much more sudden gouge, but proportional to the number of servers you have. The graph returns to normal fairly quickly once the load balancer detects the bad server, and users finish logging back in.

Application Overloaded

Performance limits can be distinguished because they get worse as the number of connecting users increases. This example shows a hard wall, but the severity you observe will depend on how your system is breaking down. The key indicator is the subtle twitching that gets progressively worse as the pressure builds. The deep downward spikes occur as various parts of the system start throwing large quantities of errors.

Central Component Performance

This is what it looks like when a central component starts to have a performance issue. When it occurs off peak, it’s a good clue that some critical system is acting up. If it’s not obvious what’s wrong, here are a few things you can check: failed hard drives in your storage system, hardware errors in your system logs, unusual latency with a heavily-used APIs, or perhaps someone is running an ad-hoc query on the production database.

Denial-of-Service Attack

Denial of Service attacks are awful, and unfortunately effective things. They look different than network gear failures because attackers have trouble ramping up load generators quickly. Once they do get going though, your networking gear will usually start failing, and nothing will get through until they stop. Most DOS attacks are network-level, so you shouldn’t see increased activity or connections before or during the attack.

Television Ad

Advertising should increase your number of connected users. If your ad hits lot of people at the same time, such as a TV ad, you’ll see a bump like this. There will be a spike just as the ad airs, a bit of hang time, then it starts to trickle down to normal. The size of the bump will depend on the effectiveness and reach of your ad.

Television Event

This is one of my favourite patterns. This is what happens if there is a big event that your audience is interested in. An example would be a sports site during the Super bowl. You see a dip while it airs, then you go back to normal when it ends. If people don’t like the event, the line might start returning to normal sooner.

Monitoring Failure

A flat line like this is almost never real, except maybe during deliberate maintenance windows. If it’s not obvious why you’re flat, you should check that your monitoring and graphing systems are collecting data correctly.

InstallUtil and BadImageFormatException – Facepalm

I had a frustrating issue at work this week: one that was easy to fix, but embarrassingly difficult to find. I came pretty close to giving up, which is not a solution I often explore, but in the end we figured it out and got everything working.

A member of our operations team was installing a Windows service I’d built to monitor some stuff in our production environment. I’ve made a few windows services in my day, and installed them many times on many machines. I’d even installed this one on my development machine with no issue. In our staging environment, however, this is what we got:

C:\Install\TheService>C:\Windows\Microsoft.NET\Framework64\v4.0.30319\InstallUtil.exe TheService.exe
Microsoft (R) .NET Framework Installation utility Version 4.0.30319.1
Copyright (c) Microsoft Corporation. All rights reserved.

Exception occurred while initializing the installation:
System.BadImageFormatException: Could not load file or assembly ‘file:///C:\Monitoring\Service\TheService.exe’ or one of its dependencies. An attempt was made to load a program with an incorrect format.

We checked the likely things: the framework version, the platform the app was built for, even re-copying the files in case they somehow got corrupted. When these didn’t work, we started trying more radical things: forcing all assemblies to 32 bit, even running the service as an executable to see if there was some error in the app.

In my defence, we are both experienced engineers, and I’m not the only person who missed it. Look closely at the command line we used:

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\InstallUtil.exe

Long version: Service applications in Visual Studio 2010 are 32 bit by default, and this is a reasonable default for them to have. We were trying to install the 32 bit service with the 64 bit version of InstallUtil. InstallUtil loads the target assembly to access it’s installation instructions, but you can’t load a 32 bit assembly from a 64 bit application (or vice versa). If you try to, you get a BadImageFormatException.

Short version: Two numbers derailed my entire afternoon.

It would have been nice if the error message from InstallUtil was a little more specific, but I suppose this isn’t a common problem. At least I got a good reminder about the importance of checking the small details when the big ones aren’t bearing fruit.

Doubling Data for Performance Testing

Or: The Most Impressive T-SQL Script I’ve Ever Written

I was recently working on a new application. After three months in the field, users were starting to complain about performance issues. We had done some limited performance tuning for the first release, and more as part of the second release, but new issues were popping up as more data got entered into the system. We could have continued fixing issues as they came up, one release at a time, but we wanted to get ahead of the problem, and the client wanted to know that the system would remain usable without developer intervention for a few years at least.

The nature of the business was such that therate of dataentry and number of concurrent users shouldn’t change much over time. This meant that increasing the amount of data would be a good approximation of how the system would look in the future. How do you increase the amount of data in a normalized relational database? It would be difficult to enter realistic test data manually, and unrealistic test data can cause unrealistic test results. Bad results means we waste time fixing issues that wouldn’t appear, and never get a chance to detect issues that will.

I proposed creating a T-SQL script that would double all the existing data. Running it once would simulate six months of usage, running it again simulates a year, twice more and we’re simulating four years of usage. Once it was approved, it took me a couple days to finish, about half of that going to testing and debugging. The result is the most impressive T-SQL script I’ve ever written.

The script I wrote didn’t do the actual duplication, it combined hand-entered metadata with data from sys.tables and sys.columns to generate a much bigger script. Not only was it faster to write this way, but it made it easier to fix issues, and it allowed the script to be reused after columns were added to the database.

I had one table variable with a list of tables to copy, and another defining the relationships between foreign keys and their source tables. Most foreign keys were named the same in all tables that they appeared in, so a single mapping was often enough for all of them. Most of the mappings could be determined from the list of tables itself, so I only had a few other relationships and special column rules that had to be entered manually.

Another factor that helped was our use of guids for all primary keys. Because the can be determined before inserting a row, it was possible to generate themapping from old to newkey at the start of the script. I couldalso use a single insert statement for each table, and the order of execution only mattered where foreign key constraints existed.

The results were tremendous. We found a bunch of issues we wouldn’t have found otherwise, and had a fairly solid indication of how the application would behave years in the future.