Today we have a guest post from ‘Presenting for Geeks’ author Dirk Haun. In “What’s our Status?” Dirk muses on how red and green status lights don’t really help you to fully understand the status of a complex system. He may not have a solution but instead calls for ideas in the hope that it will help get a discussion going.
Part of my day job is to watch over our Jenkins, which is controlling the build system that builds our software for a variety of different platforms. Therefore, people often come to me and ask: “What’s our status?” By which they mean: “Is everything working? And if it isn’t, can you give me a quick summary?”
I found that it’s really hard to answer this supposedly simple question; and I think it has to do with the traffic light metaphor that we’re using to visualise the status of our systems.
Let me explain. Cultural differences aside, we usually use a green light to indicate that everything is okay and a red light to signal that something is wrong. That works for a simple system where green equals that it works whereas red means that it doesn’t.
But what about a more complex system that consists of several subsystems? If you have 9 subsystems and 3 of them are red – what’s the status of the entire system? The convention is that the aggregated state of such a system is the “worst” of the states of the subsystems; so in our example, it would be red. While that correctly indicates that there is a problem, it may cause unnecessary panic in some circumstances.
What we are missing is a way to visualise (and verbalise) a status that’s somewhere in between red and green. I’ve been discussing this problem with a variety of people, including members of the Jenkins community, over the years. I’ve seen some interesting suggestions. For example, you could put all the status information of all the subsystems on one screen. You won’t be able to immediately comprehend the details but it will at least give you an idea of the overall status; something along the lines of “it’s not okay, but not so bad either”. Actually, you could simplify this visualisation and replace it with a simple red/green bar graph and maybe even stick a percentage value on to it.
Most suggestions assume that all the systems we are monitoring are equal. In practice, however, that’s often not the case. Where I work, for example, 60% of our customers are on Windows. Which means that the status of the Windows systems is more important than the status of one of the more exotic systems that we also support. One idea to compensate for that was to make the important systems bigger in a visualisation, but that doesn’t really scale. When you have 26 different configurations to monitor and visualise, you can’t really tell them apart any more.
So, maybe we need to take a step back and look at the problem from a different perspective. Who’s actually asking the question, “What’s our status?”
There’s my boss, for example. When he comes to me and asks “What’s our status?”, he’s really only interested in the high-level view of things. He would need to know if the Windows systems aren’t working, since they are important and he’s probably going to get calls from angry customers soon. He’s less concerned about the more exotic systems. He’d say “Just fix it” – and switch to more important things.
On the other hand, when someone from support comes to me, they usually have a very specific request. They have a ticket open with a customer who’s on Solaris 10 on SPARC, 64 bit, and they are really only interested in that specific system now. They don’t care if the Windows build isn’t working, since that’s not what the customer is using. So that’s a very different view of the system.
Then there’s the system administrator’s perspective, i.e. my perspective. I’m not really interested in things that work; I’m interested in things that don’t work. But I’d still like to get them in an order that tells me which ones to address first. Again, a completely different view of the system.
So I guess what I’m trying to say is this: There is no “one solution”. We need something more flexible, that can be adapted to the requirements of all these different users. It also makes me wonder if we’re trying to reinvent the wheel here. If we look at other industries – manufacturing, power plants, etc. – they often have complex systems that consist of many subsystems. Surely they must have had the same problem? So maybe someone out there in some other industry has already solved that problem and we simply don’t know about it.
Which is where you come in. Have you ever seen or heard about such a solution? I’d be very interested to hear about it! Please feel free to leave a comment below or contact me directly. Thank you.
You can purchase Dirk Haun’s ebook ‘Presenting for Geeks’ over on thewebsite: http://developerpress.com/en/presenting