Dear IT Director,
When we have a major outage, my team jumps into action quickly, but sometimes they get in each other’s way or try the same thing multiple times
– Impatient in Indianapolis
Been there, done that, got the t-shirt. Even if we do everything right to prevent outages, they will happen. Let’s look at a few ways to improve your team’s response.
Communicate to the rest of the company
You likely got a flood of tickets coming in when the outages started. Responding to every ticket on a major outage makes little sense. Unless the outage affects your communications systems, sending an email or group chat message out is the most effective way to get the information out that (1) you know about outage and (2) you are working on it. 2-3 sentences with the minimal facts. People don’t want details, they want to know that you are working to fix it.
Who do you send it to? The default for a major system outage is the entire company. However, if you have usage stats for a major system, and you have done the work ahead of time to create a mailing list or chat channel for it, sending the message only to the people that use the system is even better. Why bother those that don’t use the system?
Send frequent messages. Each communication message should say something like “We will send an update when the problem is fixed or in one hour.” Without the regular updates, longer outages start to feel like a black hole for users. It is ok to give estimates if you have them (often you won’t), the key here if full transparency to the rest of the organization so they can plan their work while the systems are down.
Communicate within the team
As you mentioned in your question, your team jumps in and wants to help. Create a conference call or group chat that everyone jumps on and starts sharing information. There will probably be a small number of people actually trying things to fix the problem and if they can quickly provide status or requests to everyone, they can stay focused on fixing the problem. This allows others to help where they can.
Have one or more people stay on top of what the users are seeing. When the system starts to come back up, this can be helpful. This can also help understand the ripple effect of the outage.
As their manager, you can listen in or monitor the chat to know what is going on. This means you don’t have to bother people to find out status.
Keep track of try/fail cycles
The team may try some different things to fix the outage. It is important to keep track of what they tried so you can unwind the attempts. Often configuration settings get changed, software gets reset, or data gets changed, all in the name of trying to fix the problem. If those changes don’t fix the outage, then they are just sitting there as technical debt that will confuse someone in the future.
Major outages suck. A coordinated response by the team can help.
Good Luck Impatient in Indianapolis,
The IT Director
If you would like to ask a question, send an email to firstname.lastname@example.org.