Author: Jörgen Blomberg, Consultant, Architect, ITSM specialist
Why you need problem management in your organization
If you ask an ITSM/ITIL expert to name the process in service management you should implement first, chances are that they will say “Configuration Management”. I tend to agree. If you don’t know what to manage, you’ll have a hard time managing it. Configuration Management doesn’t bring very much by itself in terms of benefits to the IT organization, though. It’s mainly a supporting process that other processes depend on, which is what makes it so important.
So, if you ask the same person what process organizations usually start implementing first, they’ll probably say “Incident”, “Problem” or a combination of the two. The reason for this is that “incidents” and “problems” are very specific things that most people working in IT know about. Most organizations also have some kind of support process that already includes parts of the incident process and can look like a good candidate for “upgrading”. Another factor is that compared to these processes, Configuration Management seems very abstract and it can be hard to make a good business case for implementing it.
This “low hanging fruit” approach to implementing ITSM has led to quite a few unsuccessful implementations. This, and the tendency to combine incident and problem management, has in turn given Problem Management an undeserved bad name.
In my opinion, if you’re looking for the one process that really improves your IT organization and the systems it is responsible for, Problem Management is the place to start.
“So, what’s the problem?”
The definition of Problem Management is, “the process responsible for managing the lifecycle of problems.” That means that the process includes activities for:
- Identifying problems – to know we have them.
- Classifying and logging problems – to keep track of them.
- Prioritizing problems – so we know that we address the most important ones.
- Communicating knowledge about problems – so that users and support staff know that the problem exists and how to work around it if possible.
- Investigating problems – so we find the root cause and not just the symptoms.
ITIL and other frameworks will describe these activities for you, and there are several tools on the market that help keeping track of the lifecycle and documentation. Most activities are very straightforward and based on best practices that have been around for decades.
So, it isn’t a hard process to understand, but it can still be hard to implement.
“If you only have a hammer, every problem looks like a nail” – Adam Maslow
In ITSM, a problem is never the same as an incident. A problem can cause one or more incidents and incidents can indicate that you have a problem, but not always. For example: A headache is an incident – take an aspirin or rest a while and the incident is resolved. Recurring headaches indicates a problem – you should probably see a doctor to find the cause.
So, why is having a combined incident/problem process so bad?
One reason is because it tries to accomplish two very different things at once, with the same tools.
The purpose of incident management is to keep the day-to-day activities of the organization going, and perhaps most importantly, to keep the users and other stakeholders happy.
As a user reporting an incident you’re probably thinking something like “I need to do this, but something is preventing me from getting it done.” That “something” is the incident. In that mindset, the thing that will fully satisfy you is a way to get whatever you were supposed to do done. You are not interested in the root causes of the incident, or that it will take several months to resolve those root causes. You’re only happy when you either get a reply that the incident is resolved – “It works now, try again!” – or when you get a reasonable workaround that lets you get on with what you were supposed to do – “Try this instead, until we get back to you with a permanent solution”. This should happen before the user gives up and workdays or customers are lost, so we’re talking resolution times in the order of hours.
Problem Management, on the other hand, involves deep investigation of root causes and possibly development, testing and release activities. This can, and should be allowed to, take time.
Mixing these processes will end up in either of these scenarios (or both):
Using a process based on problem management for resolving incidents will leave you with frustrated users who have to wait too long to get their immediate needs satisfied.
Using a process based on incident management for resolving problems will never address the root cause (because there is no time) and the systems will deteriorate.
Going back to my headache example: You either have to go to a hospital every time you have a headache, or die from an undiagnosed disease you’ve been trying to cure with buckets of aspirin.
“Somebody else’s problem” – Douglas Adams (1)
Another common issue involves managing the resources needed for problem management.
Two of the activities I listed earlier, identifying and investigating problems, usually require deep knowledge about the systems. It is often harder to pinpoint a problem in a system than building it in the first place. This means that efficient problem management would involve some of the most technically competent people in the IT organization, and those with the most experience with the systems involved. It will come as no surprise that these individuals will also be the ones who already have a lot of other responsibilities and very little unscheduled time in their calendars. It is also very likely that they are people the organization cannot afford to lose due to frustration or an unreasonable work load.
When a serious problem occurs, these key resources will usually get involved – simply because no one else knows what to do. If they are not already part of the problem management process and have time allocated for investigating problems, any time they spend will be “stolen” from other activities like projects or other maintenance development. The end result will either be that those other activities will be delayed, or that the key resources will have to put in extra hours. This “stolen” or “extra” time is also both significant and fairly constant. A ballpark estimate based on experience from several medium to large IT organization is around 20% of the normal work hours over a month.
This adds up to a serious risk for any organization without a plan for allocating resources to problem management. After seeing many, extremely competent, IT professionals ending up quitting in frustration, or on extended sick leave after having to deal with unmanageable workloads, I cannot overemphasize this risk.
“Houston, we have a problem” – Jack Swigert, Apollo 13 (2)
All but the most dysfunctional organizations have ways to deal with problems. They might call them “issues”, “incidents” or “support tickets” instead of “problems” and have no formal process for handling them – still, when the server goes down for the third time in a week and the users are starting to riot, someone is bound to have a look at what went wrong.
In small organizations or for non-critical systems this kind of “implicit” problem management could very well be good enough. On the other hand, if you have any experience from larger organizations or complex system environments with poorly implemented Problem Management, you’ll probably recognize one or more of these situations:
- Business driven development consumes most, if not all, of the time from key resources. This means that the people who have the capability to investigate and resolve the most complex problems never have the time to.
- New functionality will usually be prioritized over resolving existing non-critical problems. Each new addition to the platform potentially increases the probability that an existing problem will cause incidents and be harder to address as complexity increases. The non-critical problem can suddenly turn into a very critical one.
- Knowing that a problem exists, and that nobody has the time or mandate to address it, is frustrating. This is bad for morale and can cause conflicts between IT and other departments dependent on the systems.
“The problem is not that there are problems. The problem is expecting otherwise and thinking that having problems is a problem.” – Theodore Isaac Rubin
By now, I hope you see why I think problem management is important, and why it is well worth the effort to implement a process for it.
In closing I’d like to list a few steps towards preparing for a successful implementation of problem management:
- Understand that problems occur, and that if they are left unresolved the capabilities of the systems and organization will deteriorate over time.
- Understand that all IT organizations have a process for resolving problems, explicit or implicit, and the difference is whether you have control over the process or not.
- Understand the difference between incident and problem management.
- Realize that resolving problems is equally (or even more) valuable to the business as adding new functionality. Prioritize resources accordingly.
- Get acceptance and commitment, from all levels of the organization, that Problem Management is important and needs allocated time from key resources.
- Have a realistic view of how much time and resources you need to commit to problem management. If key resources are allocated 20% to problem management, they cannot be allocated 100% to other activities.
(1) “An SEP is something we can’t see, or don’t see, or our brain doesn’t let us see, because we think that it’s somebody else’s problem…. The brain just edits it out, it’s like a blind spot. If you look at it directly you won’t see it unless you know precisely what it is. Your only hope is to catch it by surprise out of the corner of your eye.” – The Hitchhiker’s Guide To The Galaxy
(2) Swigert, the command module pilot on Apollo 13, actually said “Houston, we’ve had a problem here”. That phrase was then repeated by the mission commander, James Lovell. In the 1991 movie Lovell, played by Tom Hanks, says “… we have a problem” and that version stuck.