InterviewSolution
| 1. |
Incident Manager |
|
Answer» Introduction The Incident Manager role is one of the most critical roles in the ITIL® world. When break fix service is provided by an IT service provider, the Incident Management process becomes central to the service provisioning. Incidents are governed by Service Level Agreements (SLAs) and failure to meet the SLAs are likely to result in financial penalty being imposed on the service provider. Proper management of incidents is key to successfully delivering the break-fix services. Since the Incident Manager owns the incident Management process, the day to day responsibility of ensuring smooth service becomes key to fulfilment of the contract. The Incident Manager role can be a good stepping-stone for support analysts to get into higher IT Service Management roles. An Incident Manager gets good exposure to the relationship between the day-to-day operations and service level management. Incident managers need to be very thorough with the Incident Management process being followed due to the continuous time pressure arising out of the SLAs in force. 1. What is an incident? How does it relate to any other event? Events happen all the while within an IT system. Events are defined as a detectable change of state which has significance for the IT management or for the delivery of the IT service. Events are generally notifications which are created by the IT services monitoring tools or the system itself (i.e. the Configuration Item (CI). E.g. when disk space reaches a certain percentage (the event), an alert may be generated, but an incident may not be generated as there is still some capacity remaining. However, when disk space becomes 100% full (another event) and no more data can be written, an incident will be generated because the normal functioning of writing to the disk can no longer happen. Incidents can also be reported by users to the Service Desk or through a self-help tool. Sophisticated incident management systems also lend themselves to users sending an email to a predefined email-id. Incidents may also be logged by the technical staff, the Service Desk, the first line and the second lines of support. As you can see from the above scenario, incidents can be defined as unplanned interruptions to an IT service or reduction in the quality of service. However, when a CI has failed, e.g. failure of one of the components of a continuous availability (CA) cluster that does not result in disruption of the service but still needs to be fixed by logging an incident (usually generated automatically by the event). The impact of an incident may be reduced by implementing a workaround e.g. restarting of a failed CI. All incidents are subject to service level agreements (SLA). Response and resolution SLAs involve timescales that are predefined per priority of the incident and failure to fulfil them results in escalation, customer dissatisfaction and consequently, financial penalty. Major incidents are a special kind of incident that have shorter time scales and greater urgency to be resolved. 2. You are the IT Delivery Manager for a project where the systems are not very stable. Due to cost pressures, you are contemplating having the same person to look after Incident and Problem Management. Will this work? To answer this question, let us understand the role of the Incident Manager first. An incident is an event that causes disruption to the normal functioning of the system. The responsibility of the incident manager is to fix this incident or provide a workaround that can make the system work as closely possible as desired. The Incident Manager must fulfill the service level agreements (SLA) that have been agreed to. It does not really matter whether a deep investigation as to why the incident happened has been carried out or not. Once the resolution or the workaround is put in place the system is expected to start functioning as normal or near normal. To find out what caused the incident, a deep dive into the symptoms and context in which the incident occurred is necessary. The Problem Manager's role is that of an investigator, and he must involve other groups such as application management, operations management, the product vendor etc. The problem manager must also analyze the data that can be requested from the Service Desk. All this will typically take much longer, and SLA timescales may not really have any room for it. A good Incident Manager is expected to be SLA focused and wouldn't care about finding the root cause of the incident, on the other hand a good Problem Manager will try to get to the root cause of the incident no matter how much time and effort it takes. An Incident Manager’s responsibility is to make the system work again quickly after the incident occurrence; a Problem Manager’s responsibility is to prevent this incident from happening again in the future. Considering the above discussion, it is evident that the working of the Incident Manager and the Problem Manager roles are nearly in opposite directions and therefore combining of the incident manager and the problem manager roles is not at all a great idea. At the same time, note that they have one common goal – to provide a smooth service. 3. What are the responsibilities of an Incident Manager? The Incident Manager is the owner for the Incident Management process. As process owner, he is responsible for monitoring the effectiveness and compliance to the Incident Management process. He is also responsible for seeking suggestions for improvement of the process in the long run. On a day-to-day basis, the Incident Manager manages the work of the first and second line of support staff. He may be assigning the incidents based on the INDIVIDUAL workload and monitor the number of incidents that are flowing through the process. E.g. at any given time, he should be aware of how many incidents are being worked on by the team, how many incidents are queued up with his staff or waiting for user inputs etc. He also needs to do a quality check on the incidents – e.g. whether the incident has been put on-hold due to the right reasons, whether the solution has been adequately documented etc. The Incident Manager is also responsible for managing the Incident Management tool and should possibly have in-depth operating knowledge of the tool. Since the Incident Management process will be bound by SLA, he needs to ensure that the support staff have been well educated to use the tool effectively. Improper use of the tool may reflect poorly on the service levels. When major incidents occur, the Incident manager acts as a pivotal point AROUND whom all the other groups – support staff, Service Desk, application management, technical management, leadership team and the impacted stakeholders revolve. He is accountable for ensuring the major incident process is followed, key people are kept informed, and a post-facto major incident review process happens. The incident manager’s role is extremely dynamic – most of the time he will be on his toes. 4. Have you heard about first, second and third lines of support? These are different roles in the Incident Management process. The first line is usually the Service Desk. They act as the first point of communication with the users. They can be contacted via raising tickets on a self-help ticketing tool, over the phone, online chat or via e-mail. Being the first line, they are responsible for logging all the incident and service request details ACCURATELY, categorizing them and assigning the tickets to the right queue or groups (second line). Often, they are equipped with information relevant for doing an initial investigation and diagnosis of the issue being reported, e.g. when a user reports a browser related issue, they may guide the user step-by-step to check the proxy settings. If settings are found to be not as expected they will guide the user to enter the desired details. Only if this does not resolve the issue, an incident will be logged and assigned to the second line of support. The second line of support will comprise of staff that may have staff more technical than the Service Desk but are still not the technical or application specialists. If the team is big enough, there may be a certain number of more technical or functional staff. They are not responsible for communicating with the end-users, so they can stay more focused on resolving the incident. The third line are the specialists, and they have much deeper technical or functional knowledge than the second line and will possibly be organized into their own functions or departments such as server administrators, network support, active directory support etc. An incident may require involving multiple third line groups for resolution. 5. Describe the incident lifecycle. Every incident goes through the following lifecycle:
6. What are the differences between ‘prioritization’ and ‘categorization’?
Categorization may be aligned to the assets hierarchy in the Configuration Management Database (CMDB). Proper categorization helps the support group to shortlist and zero in on the potential source of the incident. Having the correct categorization hierarchy helps to free up more SLA time for the investigation and diagnosis of the incident. Since categorization will mostly be done by the Service Desk, they should be provided with the relevant guidance for correctness.
7. When does an incident get escalated? What do you mean by escalation? Any incident is first logged with the Service Desk. In case the Service Desk is not able to resolve it within the agreed timeframe or realizes that resolution requires deeper expertise, they will escalate it to the next level. Similarly, when the second level of support realizes that they will require more in-depth expertise from the application or the operations management team, they will escalate it to the third line of support. This is called functional escalation. The levels of functional escalation may vary from project to project. If the project is using a commercially available software product, there may be incidents that require the intervention of the product team, which is typically an external organisation. This is also a functional escalation. In this case however there may be Operational Level Agreements (OLA) or Underpinning Contracts (UC) with the external support group. There is yet another kind of escalation called hierarchical escalation. If incidents have high priority e.g. Priority 1 incidents, then the IT managers must be informed regardless of whether the time to resolve it has elapsed or not, regardless of whether the group with which the incident is has the capability to resolve it or not. Priority 1 incidents typically arise when services are unavailable causing high customer dissatisfaction; IT managers must therefore know about this right from the time the incident is logged. IT managers may respond to such escalations by assigning additional members, calling in subject matter experts early in the incident lifecycle. Hierarchical escalation can also be triggered by the user or the customers i.e. the person who logged the incident. One of the reasons for doing this is that the user is not satisfied with the resolution provided. Such an incident resolution is called as ‘not first-time-right’. The number of levels, time scales and quality of resolution requirements for both functional and hierarchic escalations needs to be agreed and SLA targets defined in the contract. Usually all major support tools like Remedy, Service Now etc. include automatic escalation management functionality that can be configured by product experts based on need. 8. Who owns an Incident? Incident ownership always remains with the Service Desk. Regardless of where the incident is in its life cycle, its priority, whether it has been hierarchically escalated or functionally escalated, the ownership will always remain with the Service Desk. The Incident Manager owns the Incident Management process, but the Service Desk owns the incident. The Service Desk remains responsible for tracking the progress of the incident, keeping users informed about the status of resolution of the incident and ultimately ensure that the incident has been closed. When an incident is being resolved by another support group e.g. the second or the third level of support, there is need for an effective communication mechanism to ensure that the Service Desk is kept updated on the progress being made on the resolution of the incident. This is usually achieved through the Incident Management tool by updating fields such as ‘solution description’. Often, many support groups will update this field only when the incident has been finally resolved leading to internal miscommunication. Because the Service Desk does not know the status, they are not able to provide the correct status of resolution back to the user or the customer who logged the incident and may ultimately result in customer dissatisfaction. For hierarchically escalated or high priority incidents the Service Desk acts as a communication hub that the Incident Manager must appropriately keep informed and leverage to ensure that communication is made horizontally, vertically and even externally to the customer. The Incident Manager must maintain a constructive working relationship with the Service Desk manager and the Service Desk. 9. What are the typical metrics for an Incident? Incident Management reports are produced by the Incident Manager. Since the Service Desk owns every incident, the Incident Manager must prepare this report in close collaboration with the Service Desk and the support groups that are handling the incidents; usually quantitative data is provided by the Service Desk and qualitative data (e.g. related to functional and technical aspects) will be provided by the support groups. Many modern tools for Incident Management provide automatic dashboards to provide to senior IT Service Management a real-time view into the status of the Incident Management process. Some important metrics for incidents are as follows:
10. What is a service request? Who works on it? A Service Request (SR) is logged by a user seeking information, advice or to effect a standard change or to gain access to an IT Service. The request fulfilment process is put into place so that standard services may be provided when needed, without the necessity of going through the APPROVAL cycles every time. Standard services are referred to those where pre-approval exists e.g. when a new employee needs access to the company intranet. SRs may also be logged when the user needs some information or advice e.g. when an account holder calls up the Service Desk to enquire about the change of her mailing address in the system or to check the status of an incoming payment. In this case a SR will be logged for record (e.g. subject to later audit) and to ensure that the time spent by the Service Desk in responding to the user can be justified. Some SRs may be frequently recurring – so a predefined process flow can be devised to include the steps and information needed to fulfil the request, individuals that must be involved, timelines for fulfilment and escalation paths when not fulfilled. These are predefined SR models. This ensures consistency of resolution and assigns accountability for the model to the appropriate service groups. E.g. in the case of a new employee a laptop may be provided that has a ‘standard build’ with specific software. The cost of each new laptop and the approvals for installing the specific software has been given by the IT and possibly the Information Security department already. There is no need to take approvals every time a new employee join. As with incidents, the ownership of SRs also lies with the Service Desk. In an IT system service requests will generally follow predefined models. Often, there can be confusion about what qualifies as an incident versus a service request. It may so happen that a few incidents cause SR models to be created or changed. E.g. in an IT service provider organisation, it is observed that most of the new employees are reporting incidents related to the pre-installed anti-virus software on their laptops. After analysis of these incidents, the IT Department may decide to install a different anti-virus software in the future standard build. |
|