1.

Problem Manager

Answer»

Introduction

If you want to get an understanding of where this role comes into the picture, then understand this – a Problem Manager is typically someone that wears the hat of an investigator and deep-dives into places that the Incident Manager LEAVES behind for investigation. Problem Manager roles are often merged with other roles but should never be merged with the Incident Manager.

Problem Managers have a crucial role in the IT services industry, and service providers would typically look out for a person with an ‘investigative’ and ‘fact-finding’ mindset. Product companies and service providers that specialise in niche skills would be on the lookout for problem managers. Even if there is no advertised job, they would still value the mindset.

1. What is ‘systems thinking’?

Because a system is made up of many sub-systems – e.g. servers, cloud solutions, integration components, front-end technology, back-end technology etc. and each requires different kind of skills and specialization – it is difficult to form a team that can cater to the entire span. Consequently, the sub-systems start working in silos and end up competing or at worse, conflict with each other. The prevailing mindset is about ‘throwing it over the wall’. Obviously, this reduces the efficiency of the system (as a unit) thereby reducing the business value delivered, increases the lead time for implementing changes and portfolio managers incur more overheads in terms of management oversight.

‘Systems thinking’ is a mindset that requires the IT organization to think of the entire system ‘as a whole’, as opposed to parts of it. Even if you can visualize this as an ‘ideal situation’, you can guess that this requires a certain level of organizational maturity and certain type of organization culture where people and departments are transparent with each other in sharing of knowledge and clear segregation of roles and responsibilities. DevOps is one step in that direction with the tooling and automation, but there is more to it. The life of a problem manager is likely to be easier in an organization that identifies with ‘systems thinking’.

2. What is a ‘known error’?

A known error has a history of past occurrence, possibly repeated more than once. However, unlike an incident, the root cause and solution or workaround to the problem is known and documented. If the incident repeats, the support analyst may look up the documentation (we call this the ‘known error database’ or KEDB) and follow the steps to resolution.

If errors are ‘known’ – why aren’t we resolving them? The reasons are many – maybe there are other higher priority items, maybe the management does not want to invest in a service that will soon be retired, maybe there is already a change being implemented in PARALLEL or even maybe the customer is unwilling to pay a premium for a permanent resolution.

‘Known errors’ play an important role in enabling a Service Desk for efficiency. The Service Desk should have access to this database and a proper way to correlate the newly reported incident with the known error and possibly inform the customer about the workaround. In certain cases, as in end-user-computing – implementing the workaround may be a self-help step, e.g. restarting the laptop to implement a security patch.

3. How are Problem Management teams formed?

Problems need to be resolved once the root cause has been found. This requires the formation of a team (or squad) that prioritizes and focuses on this activity for a pre-defined period. You can liken this to the definition of a ‘project’. Problem management teams are generally temporary and should be composed of individuals that have subject matter expertise and systems knowledge. Once the objective is achieved, the team can disband or start working on similar problems. Problem management teams may comprise of members from the incident management team (i.e. those who have first-hand experience of the incident), the technical team (if there is an infrastructure impact), the application management teams (depending on which applications are impacted), development teams, testing teams and even the service improvement teams (whenever we have ‘proactive problem management’). 

At a given point of time, there may be multiple problem management teams that can be working on different problems. Each of these teams select a leader for themselves who follows the problem management process under the watchful eyes of the process owner, i.e. the problem manager. 

4. Someone tells you that Incidents and Problems are one and the same. Do you agree?

They aren’t the same. 

Incidents are events that cause disruption to normal provisioning of services. Incidents are real-time and happen when a system is ‘on’ and catering to business needs and users. 

Due to some fault within an IT system, incidents may tend to get repetitive. E.g. a printer is sometimes unable to print documents back-to-back. As a result of this multiple users are ending up with a ‘paper jam’ and documents are not getting printed. Now, if this printer is in the warehouse where the shipment documents need to be printed and delivery trucks dispatched, it is a major incident – because without the paper documentation, the trucks just stay there causing the supply chain to slow down. The technical team may default the printer to single-side printing, but this means that double the paper will be used. So, this can only be a ‘workaround’ – i.e. temporary solution till a permanent one is found. 

The Problem Management process would typically be looking at finding the permanent fix. First step is to find out the root cause as to why this happens only at times. E.g. is it user-specific? Is it really a hardware problem (like misalignment) or a software problem (e.g. the printer drivers)? After investigation, problem management team will provide the most feasible solution, which could be anything from having a standby printer, replacing the printer etc. 

5. Is Problem Management always reactive?

No, we also have ‘proactive problem management’. Proactive problem management originates in service operations, but most of the activities would fall under the Continual Service Improvement (CSI). Every IT service provider should consciously work on proactive problem management as this ensures that the probability of incidents occurring is reduced. Remember that every incident is undesirable and causes disruption to the normal business and consequently customer dissatisfaction as they are not receiving the expected service. 

Proactive problem management can be done in many ways. Pain-value analysis can be done – which may include the incident count, duration, severity and other weighting factors to arrive at ‘pain levels’ for a system or component of the system (in ITIL terms – per configuration item (CI)). Do this for all the systems in scope and you have a full picture of what is most troublesome for you. Work on a proactive PLAN to reduce the ‘pain’. Another great way is to do a Pareto Analysis and find the group of incidents that take up the most efforts or cause the most amount of SLA breaches. Again, work on a proactive targeted plan for the top 20% of this group.

A final note – proactive problem management could be one of the key differentiators for an IT service provider in a competitive environment.

6. What are the most important pre-requisites for Problem Management?

The most important element for problem management is the availability of data. Without the availability of data, the problem management process does not have a starting point. IMAGINE a scenario where a new employee finds on day 1 that the company laptop allotted to him is very slow. He goes to the technical team who make some changes to the anti-virus software, and it works better. After a week, the laptop has slowed down again, and the technical team now disable some background services. After a month, the employee is back to the technical team for the same issue, and this time they disable some browser plug-ins.

In the absence of a ticketing system, there would be no data for the above, and no one would remember the history of this employee’s laptop. Having a ticketing system ensures that each time an incident is logged, and adequate information is stored – it would be three incidents already in our above example. If the same employee turns up for the fourth time, maybe it is high time to do a deeper investigation; but you will not be able to do this if you do not have the previous data with you.

And it is not just Incident data, there should be data from all the other ITIL processes – Asset Management, Change Management, configuration management as well. Only then can the problem analysis happen.

7. Do you know any problem-solving techniques?

One of the simplest techniques to solve a problem is brainstorming. Brainstorming sessions are usually chaired by a moderator and the people in attendance will contribute their ideas about the solution in a round-robin fashion, till the end of the meeting or when people have run out of ideas. The moderator ensures the session proceeds without doing the evaluation on feasibility of the ideas.

Cause-effect analysis is used to drill down from the problem back to the possible causes. Causesare usually categorized under – people, processes, products (e.g. technology), partners (i.e. suppliers) and ‘Mother Nature’ (something beyond control, e.g. natural calamity). This is done diagrammatically, resulting in an Ishikawa (or fish-bone diagram).

The Kepner-Tregoe method is a simple process for structured problem solving where the solutioning is kept separate from the cause identification process. The phases are – describing the problem, identifying possible causes, evaluating the possibilities, and confirming the real cause.

Fault Tree Analysis links events with Boolean logical operators–AND, XOR, OR and graphicallyindicates the chain of events that lead to the failure (or the problem).

In the Component Failure Impact Analysis technique, all the components are mapped, and we identify which of these are single point of failure and which have failovers. While this could be a proactive problem management technique, the Service Outage Analysis is a reactive technique where past outages are analysed – which was the most significant and what are the addressable parameters. These parameters can be categorized in the same manner as in “Cause-effect analysis”.

Problem Reviews happen after the solution has been implemented, the objective being to findout what worked and what could have worked better. Again, the four dimensions as in “Cause-effect analysis” are used for this.

8. If you are looking for a Problem Manager for your organization, what are the basic skills that you will be looking for?

The biggest competence for a Problem Manager is the ability to connect the dots in a situation where there are seemingly unrelated events. Anyone may be able to acquire the necessary technical and functional skills via training and learning on the job, but the ability to connect the dots is an ability that comes with experience and possibly talent.

In an IT services scenario, Problem Managers are usually experts in a particular technology, e.g. server administration, or functional experts e.g. SAP Master Data Governance (SAP MDG).

The other quality I would look for is analytical capability. Providing solutions is only a secondary responsibility of the Problem Manager, often there are experts from the technical, operations and change management teams that will work on providing the solution. The primary role of the Problem Manager is to derive the Problem statement – in short, what exactly is the issue? He has tools and techniques at his disposal, which we have already discussed about. Identifying the real problem is key to allocating and investing resources to resolve the problem.

9. What is a Problem Record? When does it start and when is it considered closed?

A 'Problem' is possibly the cause of one or more incidents; this cause is unknown at the time a Problem Record is created (based on symptoms). The Problem Record is the lifecycle artefact of the Problem, from detection to closure.

A Problem Record is uniquely identified. It contains information such as - when it was detected, the problem owner, symptoms, affected users, IT services and business functionality, a problem category and priority (that is a function of urgency and impact), configuration item (CI) correlation, status history, progress log and linkages to other problems, incidents, known errors or knowledge database items.

To close a Problem record, the solution must have been accepted as 'implemented and solved the problem' by the impacted parties. If there has been a workaround that was put in place, this should now be withdrawn, and relevant parties informed - e.g. the Service Desk and/or the application management or the technical teams that the workaround is no longer required. All linked incidents may be now closed and the documentation for the impacted CIs must be updated to reflect the changes that may have been made to the CI. Last, but not the least, the Problem Record should be updated and only then, should it be marked as 'closed'.

10. Have you heard of RCA? What is it?

RCA is an acronym for Root Cause Analysis. RCA is a collection of problem solving methods used to uncover the actual reason for an issue happening - in the ITSM scenario this could be repeated incidents, degradation of service quality etc. Using RCA, a problem is first defined (what to solve), understood and analysed (diagnosis) and finally solved (fixing the 'what').

The 'root cause' is a low-level (or in-depth) non-conformance - the epicentre from which it all starts. The RCA technique ensures that it solves at the epicentre rather than at any higher level. Workarounds or 'band-aid' fixes may be temporarily applied to the higher levels (which are possibly just symptoms) as an interim. It is always cost-effective to fix at the epicentre, rather than anywhere else - this is the logic behind doing an RCA.

Even in the ITSM scenario, root causes may not necessarily lie in technical components. E.g. the problem of repeated malfunctioning of a printer may not be related to the printer setup or printer drivers. RCA may reveal that it was a poorly qualified technician that serviced the printer last time, so the issue is with resource skills (people) and nothing else. Imagine the costs of using a band-aid approach in this case, where you keep replacing cartridges to achieve better results.



Discussion

No Comment Found