Mean Time to Restore Service - Improve your MTR by 50-60%

The key lies in asking the right interrogative questions to the right people in order to identify the correct restoration criteria for a quick and accurate restoration. It is easier than you think when you follow a process that can be used in any type of incident. It is all about three components namely; OBJECT, FAULT and UNIQUENESS of the fault.

When pilots are trained on flight simulators, they learn to quickly ascertain a snapshot of what is happening and what is not happening. Based on these nanoseconds of exposure to bits of factual information, they are trained to make “snap” deductions and take the most appropriate actions.

rootcause

COMPONENTS ONE & TWO (OBJECT AND FAULT)

Hearing the words “outage” or “crash” in an IT environment is not something you would like to hear on your watch, because this kind of situation really spells “doom and gloom” for the person accountable for the function. You need to possess the skills to ask questions that will break down these general and vague descriptions so that you and your team can get started with the real issue at hand. Dealing with information that is too general is going to stretch your investigation cycle way beyond the time you have for resolution.

The first reports normally represent a consequence of a fault. You need to chase down to the specifics of the fault. For instance, “Web page down” is describing a consequence of the fault. It could represent over 50 possible different causes. However, by asking “What do you mean by DOWN?” or “Can you be more specific?” you will get a more accurate reaction or description, such as “checkout page dropping”. This sounds more like a workable fault and results in far fewer possible causes to investigate.

So, by asking certain rehearsed interrogative questions (just like the pilot is doing) you will create a much more accurate snapshot of the fault. At the same time, you give the incident investigation team the opportunity to get access to the correct minimal information from the outset.

How do you do that? You must identify a single object start as the focal point –with specificity and then a single fault-with specificity - associated with that object. The ultimate aim is to concentrate on the fault. The description “web page down” is not specific enough for an effective investigation. Such lack of specificity will cause you “to boil the ocean” in search of solutions. The reported issue of web site “down” does not describe a fault, but rather an end state or consequence of a fault. So the trick is to get the team to identify the right fault to start the investigation. Unless you start the investigation with singularity and specificity; you and your team will waste a lot of time with “trial and error” fixes.

COMPONENT THREE (UNIQUENESS OF FAULT)

The next challenge is to look at what is unique about the identified fault. If a fault is a typical fault, your SMEs should be able to fix it in a few minutes. But if they cannot find the fix, it generally means there is something unique/odd/weird about the fault. The uniqueness could be in the location, user experience, timing, pattern of frequency or size of the fault.

Let’s get back to our example of the checkout page dropping. What is unique about this dropping? Is it dropping a specific data field? Is it occurring when you click the “checkout” button? Is this only happening after 4 pm every day? The responses to these questions will also give you an idea of IMPACT quickly.

Identifying the uniqueness of the fault immediately triggers potential reasons for experiencing the fault. For example, if the dropping only happens after 4 p.m. every day and it happens at a certain location only, this should you lead you and your team to think of new possible reasons already.

Based on these three components – a single object, a single fault and any unique characteristics of the fault - you and your team will be able to identify the IMPACT and also what the requirements for restoration will be. This is another lesson on its own.

POTENTIAL IMPACT ON INCIDENT & PROBLEM MANAGEMENT

So what is the impact on Incident and Problem Management staff? We’ve found the following benefits:

It gets the whole investigation team laser focused from the outset, avoiding any time loss with trial and error efforts.
It is a much easier way to determine impact and less time is expended in getting to the bottom of the incident – two for the price of one!
It improves the quality of the data collected. That in turns allows transformation of data into information helping the investigation team to a reach complete understanding of the incident and problem in a shorter period of time.
The common approach with templates and worked questions provides the investigation team with a proven a methodology, resulting in higher confidence levels when approaching an incident or problem situation.

If you follow this simple advice, you will reduce the mean time to restore service (MTR) by at least 50%!

Blog - IT CSI