Download Service Desk Incident Triage Matrix ...
Incident Management Process:
24x7 Response and Control
April 6, 2005
V1.12
Revision History
Revision History Version
Date
Author
1.08
23 Feb 2005
Nan McKenna
1.09
15 Mar 2005
Erik Cummings
1.10
22 March 2005
Erik Cummings
Notes (Initial tracked version) Extract “return to work” as Appendix C, add proposed 15/30 minute response times. Add “Revision History” page. Differentiate between “Initial PCG Incident Classification” and “Final Incident Classification”. Added PCG Process Flowchart Updated Revision History Table From header, removed “Draft” In header, body of document, moved “Operations Excellence” to top left margin, placed “Incident Management Process” top, right margin Re-applied styles, numbering, and organization
1.11
23 March 2005
Bruce Campbell
Added ‘On-Call’ to Appendix E Added ‘Management On-Call’ to Appendix F Turned-off numbering in Appendixes E & F Re-organized appendixes so that process flow diagrams were “one-after-the-other” Updated references to the various appendixes throughout document
1.12
04 April 2005
Erik Cummings
Reworded Section 2.3 a ‘Note’ Removed Appendix C (PCG Process) Renumbered Appendix DC and any references to it. Changed Appendix D (now C! - Communications Matrix). Removed Contact Action, 1st and 2nd Level Notification columns. Added Client Comm Interval and SME Work Started columns. Added new Appendix D – Priority and Internal Response Time Commitments Added new definitions – Priority, Impact, Urgency
Table 1 Revision History
8/8/2008
v1.12
Page ii
Table of Contents
Table of Contents Revision History.......................................................................... ........ii Table of Figures.............................................. ...................................iii List of Tables.................................................................................... ..iv 1.0 Executive Summary.................................................... ....................5 1.1 Document Contents.......................................................................................................... ..............5 1.2 Intended Audience............................................................................................................ ..............5
2.0 Background................................................................................... .6 2.1 Primary Responsibilities of the Production Control Group.............................................. ................6 2.2 Incident reporting and escalation techniques will:................................................................... ........6 2.3 Additional Responsibilities of the Production Control Group................................................ ...........6
3.0 Roles and Definitions............................................................... .......7 4.0 Process Review................................................. .............................9 4.1 Process Outline...................................................................................................... ........................9 4.2 Incident Detection and Reporting.................................................................................................. ..9 4.3 Incident Level Classification: See Appendices C and D.............................................................. ....9 4.4 Incident Notification.............................................................................................. ..........................9 4.5 Incident Escalation....................................................................................................................... ...9 4.6 Incident Resolution..................................................................................................... ....................9 4.7 Post-Incident Activities........................................................................................................ ..........10
5.0 Detailed Incident Control Process.................................... ..............11 5.1 Detailed Process Flow Explanation Table. Reference Appendix A................................... ............11
6.0 High-Level Incident Process Explanation........................................12 6.1 Detailed Process Explanation: See Appendix B......................................................................... ..12
7.0 Outstanding Issues......................................................... ..............14 7.1 A common paging system is required...................................................................... .....................14 7.2 Definition of Service Hours.................................................................................... .......................14 7.3 Definition of “availability,” “outage,” and “service degradation”.....................................................14 7.4 Service-level procedures for client notification....................................................................... .......14
Appendix A Incident Management Process Flowchart..........................15 Appendix B High-Level Incident Management Process Flow.................16 Appendix C Incident Level Communications Matrix..............................17 Appendix D Priorities and Internal Response Times.............................18 Appendix E On-Call Guidelines...........................................................19 Guideline Purpose.......................................................................................................... ...................19 Duties...................................................................................................................... ..........................19 Responsibilities........................................................................................................................... .......19 Communications................................................................................................................... .............19 Communications Elements................................................................................................ ................20 Notification Protocol................................................................................................................... ........20 Initial Communications Tracking...................................................................................... ..................20 Response Protocol................................................................................................... .........................20 Scheduling.............................................................................................................................. ...........20
Appendix F Management On-Call Guidelines.......................................21 Return-To-Work Guidelines.............................................................................................................. ..21
8/8/2008
v1.12
Page ii.b
Table of Figures
Table of Figures Figure 1 Incident Detection and Reporting....................................................................................... .....15 Figure 2 High-Level Incident Management Process Flow..................................................... ...............16
8/8/2008
v1.12
Page iii
List of Tables
List of Tables Table 1 Revision History................................................................................................ ...........................ii Table 2 Detailed Incident Control Process ...................................................................................... ......11 Table 3 Explanation of High-Level Incident Management Process Flow............................................13 Table 4 Incident Level Classification Matrix........................................................................ ..................17 Table 5 Return-To-Work Guidelines...................................................................................................... ..21
8/8/2008
v1.12
Page iv
Operations Excellence
1.0
Incident Management Process
Executive Summary 1.1Document Contents 1.a.
This document contains processes, through the use of which the new Production Control Group will be able to quickly and efficiently respond to, manage, and resolve incidents. Documentation includes on-call definitions and guidelines, escalation processes, process flow diagrams, and data tables, sets general expectations, defines roles and responsibilities, and provides general guidelines.
1.2Intended Audience 1.a.
8/8/2008
This document is directed at and intended for executive level and management personnel, ITSS personnel, including all of those are included in this process, such as: Subject Matter Experts (SMEs) Technical Leads, Line Managers, Systems Administrators, DBAs, project leaders, and facilities personnel.
v1.12
Page 5 of 21
Operations Excellence
2.0
Incident Management Process
Background It is expected that most services supported by ITSS are available 24x7. As a result of this expectation, it is in the best interest of ITSS Shared Services workgroups and ITSS as a whole to develop and establish a combined staff– the Production Control Group (PCG) – dedicated to proactively managing and responding to events as they occur. Eventually, the role of the PCG will include incident evaluation, and depending on the severity of the event, escalate to upper management. In some situations, the more experienced level technical personnel will take action to effect repairs and/or restore services. As the PCG acquires experience, and as ITSS adds monitoring and troubleshooting capability, they will assume additional incident response responsibilities. 2.1Primary Responsibilities of the Production Control Group 1.a.
Managing and controlling a widespread service outage, including incident reporting and escalation.
2.2Incident reporting and escalation techniques will: 1.a.
Specify a point-of-contract (owner) for all issues and ensure that services are restored through the prudent use of departmental resources, including documentation of the incident from beginning to its resolution.
1.b.
Effectively manage the communication of information within ITSS when there are issues that actually or potentially impact ITSS-supported services or facilities.
1.c.
Pro-actively respond to issues that impact ITSS-supported services and facilities; evaluate, classify, escalate, and manage service restoration efforts efficiently and as expeditiously as possible, up through incident resolution.
2.3Additional Responsibilities of the Production Control Group 1.a.
8/8/2008
Note: It is anticipated that any single-shift of the PCG will NOT be consumed by continuously resolving issues. Because of this, supplemental duties and tasks, detailed below, will be assigned. 1
Assist offsite Subject Matter Experts by performing requested tasks, such as visual inspections of hardware and recycling the power on equipment as instructed.
2
Manage and prepare magnetic media for rotation, offsite shipment and storage, including organizing and filing transmittal logs.
3
Control building and facility access, escort vendors to restricted areas for the purposes of inspection, maintenance, and repair of equipment.
4
Monitor building/facility/ data center environmentals, such as: air conditioning, fire suppression system, lighting, and so on, log times and results of the monitoring activity.
5
After normal working hours, perform 1st tier triage of reported issues, classify and escalate as necessary.
6
Receive and log calls from end users, and generate Remedy tickets, escalate as necessary.
7
Set up Video/Telephone conferences. v1.12
Page 6 of 21
Operations Excellence
3.0
Incident Management Process
8
Accept and sign for emergency delivery of replacement parts from vendors.
9
Perform other tasks deemed necessary by department supervision.
Roles and Definitions • • • • • • • • • • • •
• •
• • •
8/8/2008
Account Manager – A member of the ITSS Account Management team in Client Support who is responsible for the relationship with one or several key clients (e.g. GSB, H&S, Libraries) Client – A primary paying customer of ITSS services and support End User – Person who directly uses a service. An end user could be an internal or external to ITSS. End users are directly impacted during an outage, and generally have an established relationship with the Client or Service Owner Impact – Level of effect or impact on the Stanford Campus. This is relative to the Campus as a whole, not specifically to the client. (Values= Campus-Wide, Major – School or Dept wide, Minor – Group or Single User, and Non-Service Affecting) Incident Manager – The Shared Services Line Manager who is designated as responsible for a specific incident Incident/Event/Problem/Issue – For the purposes of this document, these terms are intended to mean a failure of any component of any system or service, and are used interchangeably throughout this document ITSS Client Support – Group which does client relations, account management, functional analysis, sales & marketing, documentation, software licensing, end user training, and Help Desk and CRC support ITSS Engineering and Projects – Group which does technology R&D, service enhancements, new product and service projects ITSS Shared Services – Group which does operations ITSS Strategic Planning – Includes technology strategy & architecture and finance groups Line Manager – Workgroup managers in ITSS Shared Services On-Call Subject Matter Expert (SME) – SME (see below) who is designated to be available to respond to reported outages, triage the incident, perform the needed tasks to restore services, assist other workgroups in the restoration process, or determine which other members within their own workgroup are needed to assist in service restoration Operations Owner – The ITSS staff person who has the ultimate authority for a service including its functionality and approval for any changes to the service Priority – Level of response and effort directed towards resolving an incident. It is determined by the inherent service level commitment of the service, as well as a combination of Urgency and Impact. Priority is sometime referred to as “severity”. (Values = Urgent, High, Medium, Low) Product Manager – Own product quality and client satisfaction for a service Production Control Group (PCG) – Group which will perform monitoring and basic problem determination and evaluation, escalation, communication and in some cases, incident resolution Subject Matter Expert (SME) – Any technical ITSS staff person whose job requires extensive technical knowledge of network and service components and their related v1.12
Page 7 of 21
Operations Excellence
•
•
8/8/2008
Incident Management Process
requirements. SMEs are considered experts and possess a detailed knowledge of service functionality, restoration, component/service repair. Satellite Operations Center (SOC) – The SOC is a partner with the University Emergency Operations Center (EOC) during Level 2 (major building fire, extended power outage) or Level 3 (major earthquake or extensive flooding) emergencies. The ITSS SOC team provides real-time field information to the EOC as well as coordinating and directing emergency responses. Urgency – End user or client’s assessment of the importance and/or urgency of the issue as it affects their ability to perform their work. This value is provided by the customer. (Values = Urgent, High, Medium, Low)
v1.12
Page 8 of 21
Operations Excellence
4.0
Incident Management Process
Process Review 4.1Process Outline 1.a.
Note; There are six major steps in this process, from the time of incident detection through root cause analysis and implementing preventative measures.
4.2Incident Detection and Reporting 1.a.
An incident can be detected by: 1
From an end-user
2
From a client
3
From an SME
4
From automated monitoring
1.b.
It is important that the sharing of information occur between and among groups.
1.c.
The process of reporting of problems is different between “normal” working hours, 8:00 A.M. to 5:00 P.M., M-F, and after those hours.
4.3Incident Level Classification: See Appendices C and D 1.a.
This includes assigning a severity level to the incident, and its subsequent entry into the Remedy incident tracking system.
4.4Incident Notification 1.a.
This includes notification to an ITSS Incident Manager and clients, and includes outage information posted on the SU Web site, Cable TV, informational messages left on the designated voice mail box, and email sent to designated personnel and other client notification as deemed appropriate.
4.5Incident Escalation 1.a.
This includes escalation to the ITSS Incident Manager, and any subsequent escalation calls deemed necessary. Note that the severity level will dictate who in the management chain of command to contact, and when to provide them status reports. Additionally, the PCG will determine whether or not the incident needs to be escalated to the SOC.
4.6Incident Resolution
8/8/2008
1.a.
This covers work performed during the incident itself, with responsibilities as follows:
1.b.
The Incident Manager is responsible and accountable for the overall recovery effort, performing the following functions: 1
Establishing recovery priorities
2
Coordinating and delegating responsibilities as they relate to the recovery effort.
3
Issuing requests for additional resources
v1.12
Page 9 of 21
Operations Excellence
Incident Management Process
4
Ensuring the participation of critical internal and external support groups and vendors, such as the recall of media from the off-site storage vendor, or the purchase of replacement parts and equipment
5
Reviewing and approving tactical plans
6
Communicating incident status to ITSS management/executives as needed
7
Working with Client Support to approve and authorize the release of information to other schools and departments
1.c.
SMEs and Line Managers are responsible for analyzing technical problems and making technical decisions, implementing tactical plans, and communicating to other SMEs as well as the Incident Manager.
1.d.
The PCG is responsible for coordination of the incident resolution effort and for communication as deemed necessary.
4.7Post-Incident Activities 1.a.
8/8/2008
This covers the activities after the incident is resolved. 1
The first task is to ensure that any post-incident cleanup is completed
2
Perform root cause analysis of the incident,
3
To avoid similar, future incidents, determine what process improvements and preventative measures that can be put into place.
4
Implement changes in process or technical support as appropriate.
5
Ensure that PCG receives feedback and input from the user community,
6
Perform client follow-up and ensure that an incident response quality survey form is available for end-user and client feedback.
v1.12
Page 10 of 21
Operations Excellence
5.0
Incident Management Process
Detailed Incident Control Process
5.1Detailed Process Flow Explanation Table. Reference Appendix A Process # Process Name Detailed Description
Action By
Incident Detection and Reporting
1
Problem Reporting: End Users
2
Problem Reporting: Clients
3
Problem Reporting: End Users After Hours
4
Problem Reporting: Monitoring to SMEs
5
Problem Reporting: Monitoring to PCG
6
Resolve?
7
Urgent?
8
Forward To SME
9
Resolve Quickly?
10
Enter Solution In Remedy
11
Forward To PCG
End-users will call 5-HELP or use the web at http://helpsu.stanford.edu/. Telephone calls are directed to the ITSS Help Desk where the problem is evaluated
If the Help Desk (any tier) determines that this is an urgent incident, the call/ticket should be directly escalated to the PCG In most cases, clients should call 5-HELP or use the web at http://helpsu.stanford.edu/. In some special cases, clients may have direct access to the PCG for reporting problems and receiving updates. In this case, skip to step 12. If an end-user calls 5-HELP after hours, the user will get the recorded phone tree. Users can choose to get through to the PCG directly, or leave a recorded message. For after hour’s calls, the PCG will determine whether call is urgent. If the issue is not urgent, the PCG will enter a ticket in Remedy for review the following business day. In some cases, monitoring may notify a SME or a problem before a user, client or the PCG. If the issue is urgent, escalate directly to the PCG for coordination and entry into Remedy. Monitoring reports information directly to PCG Help Desk assesses whether the ticket can be resolved at this point. If so, the Help Desk will resolve and close. If the ticket cannot be resolved, Help Desk to determine whether the ticket should be forwarded to SME/Help Desk Tier 2 or to the PCG If the case does not appear to be severity Urgent/High, forward to SME Can the case be resolved by the SME and is it Severity Level Medium/Low? If the SME can quickly resolve the case, enter solution in Remedy and close ticket. If the SME determines that there is impact beyond a simple fix and the Severity Level is Urgent/High, notify the PCG.
End-User
Client
End User, PCG
SME PCG Help Desk Help Desk Help Desk SME SME SME/PCG
Classification
12
Assign Severity Level
13
Enter In Remedy
Assign a severity level to the incident; using the standard ITSS categories (see Appendix C and D). The severity levels govern: Level of action to be taken by the Production Control Group Notification and escalation guidelines Time intervals in which to provide status reports Time intervals in which to initiate escalation and management decision processes Enter a ticket for the incident into the Remedy Help Desk application.
PCG
PCG
Table 2 Detailed Incident Control Process
8/8/2008
v1.12
Page 11 of 21
Operations Excellence
6.0
Incident Management Process
High-Level Incident Process Explanation
6.1Detailed Process Explanation: See Appendix B Notification SME Update itss-servicealerts@lists
Notify appropriate SME(s) if necessary, using AMCOM on-call system
PCG
Send a message to
[email protected]
PCG
Message information will include: the date and time, a brief description of the problem, and if available, the estimated time of resolution/restoration. Post Messages To Web, Phone, TV
Web: Update status on down.stanford.edu Telephone: In the event of a major network failure, update the designated voicemail box: 7-DOWN
PCG
SU Cable TV – ITSS can have pre-worded messages set for broadcast, where the group can just fill in the blanks. Escalation Notify Line Manager
Contact the Shared Services Line Manager of the affected system. If a Line Manager is unavailable, use the AMCOM system to determine the backup.
PCG
Determine Incident Manager
If the incident falls into the area of a single Line Manager, that Line Manager will contact the Incident Manager. If multiple Line Managers are involved, they must determine a single Incident Manager.
Shared Services Line Managers
Send Email
Send first email to appropriate lists/clients, based on Service Level Agreements. Use the
[email protected] list for campuswide outages; the Incident Manager should approve any messages which go to this list.
PCG, Incident Manager
Escalate To Senior Management
The Severity Level (see Appendix C and D) will determine the escalation to management
PCG
Resolution The Incident Manager will take ownership of the problem and manage the incident. Responsibilities: Establish priorities Coordinate and delegate responsibilities in regards to the recovery effort Request additional internal or external resources Incident Management
Ensure and manage the participation of critical internal and external support groups and vendors Review and approve tactical plans Communicate incident status to ITSS management/executives as needed
Resolve Incident
8/8/2008
Work with Client Support to release information as needed to clients/users across campus SMEs are responsible for analyzing technical problems, implementing tactical plans, and communicating to other SMEs and with the PCG.
v1.12
SMEs
Page 12 of 21
Operations Excellence
Incident Management Process
Message information will include: the date and time, a brief description of the problem, and if available, the estimated time of resolution/restoration. Post Resolution Information To Web, Phone, TV
Web: Update status on down.stanford.edu Telephone: In the event of a major network failure, update the designated voicemail box: 7-DOWN
PCG
SU Cable TV – ITSS can have pre-worded messages set for broadcast, where the group can just fill in the blanks Post Incident Analysis Complete Cleanup Tasks Root Cause Analysis
Incident Prevention
Client/User Follow-up Quality Survey
Determine whether cleanup is required, and identify who will own and perform the additional clean-up tasks It is the responsibility of the manager of the PCG to initiate root cause analysis, collecting as much information as possible, and to ensure that any information which will help in resolving future incidents is entered into the related Remedy ticket for future use.
SME, PCG PCG Manager Shared Services Managers, SMEs
Determine processes which can be implemented to prevent a repeat of the incident. Ensure selected members of the recovery team make follow up calls to the affected users, to solicit their constructive comments. Share results of the analysis with workgroups and clients where appropriate. ITSS will make an on-line survey available for user/client feedback, and for ITSS staff. The PCG is responsible for tallying survey results and making them available to the appropriate ITSS staff and managers.
PCG PCG
Table 3 Explanation of High-Level Incident Management Process Flow
8/8/2008
v1.12
Page 13 of 21
Operations Excellence
7.0
Incident Management Process
Outstanding Issues 7.1A common paging system is required 1.a.
AMCOM for manual paging
1.b.
What to use for automated paging from monitoring systems?
7.2Definition of Service Hours 7.3Definition of “availability,” “outage,” and “service degradation” 7.4Service-level procedures for client notification
8/8/2008
v1.12
Page 14 of 21
Operations Excellence
Incident Management Process
Appendix A Incident Management Process Flowchart Reference Table 1 Detailed Incident Control Process 1.a. Note that the circle numbers in the flowchart correspond to the numbers on table 2, page 10. Incident Detection & Reporting End User
1
Client
1 Report Problem: HelpSU/5HELP
4 Report Problem: HelpSU/5HELP
Report Problem
Resolve?
8
No
7
PCG
5 Report Problem
6
Automated Monitoring
SME
Help Desk
Urgent?
No
Forward To SME (Help Desk Tier 2) For Additional Analysis
Yes 9
Resolve Quickly?
No
Yes
10
2
Enter Solution In Remedy
Report Problem: Directly To PCG
11 Forward Directly To PCG
12 Determine Severity Level
13 3
Calls 5-HELP After Hours
3
Calls 5-HELP After Hours
Enter Incident Ticket In Remedy
Figure 1 Incident Detection and Reporting
8/8/2008
v1.12
Page 15 of 21
Operations Excellence
Incident Management Process
Appendix B High-Level Incident Management Process Flow
Client End User Em
Remedy Database
Update
HelpSU/5-HELP Help Desk Tier 1
y nc ge er
Detection & Reporting
Communicate
Subject Matter Expert
Monitoring
Production Control Group
Classify Incident Level & Enter in Remedy
Classification
Remedy Database
Update
Escalation
Com mun ic
Notify
Production Control Group
ate
Communicate Co mm u
Line Manager
Liaison
Duty Manager
SOC/EOC
nic ate
System Status Notification
Self-Service
itss-service-alert@lists down.stanford.edu 7-DOWN
End User
U
SME
Post Incident Activities
Duty Manager
Account Manager
SME
pd at e
w
ith
da te
Remedy Database
Upda So lu tio n
In fo rm
at on
te
Production Control Group
Line Manager
PCG Manager
Update With So lution
Up
Resolution
Client
Production Control Group
Figure 2 High-Level Incident Management Process Flow
8/8/2008
v1.12
Page 16 of 21
Operations Excellence
Incident Management Process
Appendix C Incident Level Communications Matrix Level
Urgent
High
Medium
Low
Description
Incident Examples
A major service outage with significant and immediate business impact and no workaround. • Large number of users • Outage of significant length • No available workaround • Mission/ business critical
Fire suppression system activation in data center
A major service outage or degradation with significant business impact and an unsustainable workaround. • Multiple users • Work performance reduced • Mission/ business critical
Failure of Storage system (storage area network SAN) Failure of a server of a sensitive client or user
A service outage or degradation with an acceptable workaround. • Service-affecting • Minimal performance degradation • Affects non-critical business function Non service-affecting. • Cosmetic problem • System enhancement
Loss of electrical power Entire network switch, closet and/or building outages Failure of 1 or more high priority services – e.g. Exchange, Oracle Financials, HRMS, PeopleSoft Large denial of service attacks/; successful hacking; loss or altering of data; theft of data, simultaneous virus infections
Client Update Interval
SME Work Started w/in:*
Initial Immediate. Notification on-going:
30 minutes
½ hour
SU telephony systems
Severely degraded performance
Initial Immediate. Notification on-going:
1hour
1 hour Smaller denial of service attacks
Cannot connect to the internet, send or receive email Hardware failure, cannot access data, cannot print
As applicable. By SME working issue.
4 business hours
Degraded performance
Previously requested enhancements to a system
Upon issue resolution or as applicable with. By SME working issue.
1 business day
Table 4 Incident Level Classification Matrix
* Note: This column indicates the most amount of time that will transpire before a technician begins working on an Incident. Times will generally be much faster for all severities.
8/8/2008
v1.12
Page 17 of 21
Operations Excellence
Incident Management Process
Appendix D Priorities and Internal Response Times Note: The following table refers to Priority, not to Urgency or Impact. Priority is a combination of the combined Urgency, Impact, and existing Service Level Commitments for the service in question. This is an important concept to adhere to – Urgency is offered by the customer, Priority is assigned by the Helpdesk, PCG, and/or SME involved from a system-wide perspective. Usage: These Priority levels (and the associated Urgency and Impact values) are used to track incidents as they are reported and worked on. Each of Priority, Urgency, and Impact relate directly to Remedy ticket fields.
Priority
Urgent
High
Medium
Low
8/8/2008
Description A major service outage with significant and immediate business impact and no workaround. • Large number of users • Outage of significant length • No available workaround • Mission/ business critical A major service outage or degradation with significant business impact and an unsustainable workaround. • Multiple users • Work performance reduced • Mission/ business critical A service outage or degradation with an acceptable workaround. • Service-affecting • Minimal performance degradation • Affects non-critical business function
Non service-affecting. • Cosmetic problem • System enhancement
Committed Service Hours
PCG Call Initiate
SME Call Response
Escalation Interval
SME Work Started
24x7
Immediate
15 Minutes
10 minutes
30 minutes
24x7
Immediate
15 Minutes
10 Minutes
1 hour
Ticket Assignment/e Mail
As appropriate (work begins, work update, work completed)
Standard SME Group Remedy settings
4 business hours
Ticket Assignment/e Mail
As appropriate (work begins, information required, work completed)
Standard SME Group Remedy settings
1 business day
8-5, M-F
8-5, M-F
v1.12
Page 18 of 21
Operations Excellence
Incident Management Process
Appendix E On-Call Guidelines Guideline Purpose To generally define and standardize: On-call duties and responsibilities A methodology for communications and engagement of problem determination and resolution On-call scheduling Response expectations/guidelines and general escalation processes in the event 24 X7 on-site group is engaged in an on-going event or incident. System generated notifications will continue to be handled within the required time frames by the individual SME groups. Duties Requirements for on-call responsibility must be identified in the appropriate job descriptions, including: carrying a pager, cell phone, availability of the employee’s home phone number, and email. Responsibilities Share on-call responsibilities with other members of the work group Begin working on the event as soon as notified This may require working from home or traveling to work. The decision to make a physical appearance at work depends on the circumstances of the event, such as: “swapping” hardware components or, an on-site appearance by a vendor. Communications Teleconference Phone Bridge –Telecom will have a teleconference number available to technical personnel, and the PCG. This will be used when the expertise of multiple SMEs is required to resolve an incident. It will also permit the technical staff the capability to communicate as a group. Additionally, first-hand, the PCG will be able to determine the status of the incident and keep management informed without them actually being involved in the conference call. The AMCOM system will be the primary contact information/procedures lookup and paging tool for the 24 X 7 on-site groups. Staff will provide and track individual work group on-call schedules. 8/8/2008
v1.12
Page 19 of 21
Operations Excellence
Incident Management Process The work group establishes the rotation. Members of the work groups are responsible for maintaining and keeping current, the contact and coverage information on the on-call database.
Communications Elements Required communications devices: pager or cell phone, personal phone. Additional communications devices as recommended by the SME groups: DSL, Treo, wireless-laptop, email. Notification Protocol Initial outgoing page Re-page in 10 minutes If a call-back is NOT received from the designated on-call SME within 15-minutes, begin escalation to the next on-call person, including re-contacting the primary on-call person and the on-call Shared Services manager on all subsequent pages. Recipient to confirm garbled pages, follow call-back protocol. Initial Communications Tracking Use AMCOM system for initial communications tracking Response Protocol 15 minute call-back Within 30 minutes, be actively engaged in problem determination and resolution Actively engaged via: Home system Wireless laptop On-site SME groups may establish accelerated response profiles based upon their response criticality Scheduling By SME group design SME schedule to be established and published in AMCOM system SME contact instructions to be included
8/8/2008
v1.12
Page 20 of 21
Operations Excellence
Incident Management Process
Appendix F Management On-Call Guidelines Return-To-Work Guidelines These guidelines are for Management to consider if extended hours have been worked due to outage/issue by an on-call representative. These guidelines should be used to ensure there is always an effective on-call representative, while protecting the oncall SME from overly extensive work-time. If the primary on-call SME has already worked consecutive extended hours, or multiple shifts, and a new event has occurred: Either the manager will provide a backup and notify the backup of their modified on-call status, or the entire group of SMEs will make a decision on the selection of an alternate SME to be used in this situation. To allow staff members who are involved with an after hour call-out on Sunday through Thursday to obtain adequate rest, the following is provided as a sample set of guidelines for a return-to-work policy: On-Call SME works until 0200 0300 0400 0500
Report to work no later than 1100 1200 1300 Take rest of day off
Table 5 Return-To-Work Guidelines
8/8/2008
v1.12
Page 21 of 21