Service Desk Incident Triage Matrix

July 17, 2016 | Author: a4agarwal | Category: N/A

Share Embed Donate

Report this link

Short Description

Download Service Desk Incident Triage Matrix ...

Description

Incident Management Process:

24x7 Response and Control

April 6, 2005

V1.12

Revision History

Revision History Version

Date

Author

1.08

23 Feb 2005

Nan McKenna

1.09

15 Mar 2005

Erik Cummings

1.10

22 March 2005

Erik Cummings

Notes (Initial tracked version) Extract “return to work” as Appendix C, add proposed 15/30 minute response times. Add “Revision History” page. Differentiate between “Initial PCG Incident Classification” and “Final Incident Classification”. Added PCG Process Flowchart Updated Revision History Table From header, removed “Draft” In header, body of document, moved “Operations Excellence” to top left margin, placed “Incident Management Process” top, right margin Re-applied styles, numbering, and organization

1.11

23 March 2005

Bruce Campbell

Added ‘On-Call’ to Appendix E Added ‘Management On-Call’ to Appendix F Turned-off numbering in Appendixes E & F Re-organized appendixes so that process flow diagrams were “one-after-the-other” Updated references to the various appendixes throughout document

1.12

04 April 2005

Erik Cummings

Reworded Section 2.3 a ‘Note’ Removed Appendix C (PCG Process) Renumbered Appendix DC and any references to it. Changed Appendix D (now C! - Communications Matrix). Removed Contact Action, 1st and 2nd Level Notification columns. Added Client Comm Interval and SME Work Started columns. Added new Appendix D – Priority and Internal Response Time Commitments Added new definitions – Priority, Impact, Urgency

Table 1 Revision History

8/8/2008

v1.12

Page ii

Table of Contents

Table of Contents Revision History.......................................................................... ........ii Table of Figures.............................................. ...................................iii List of Tables.................................................................................... ..iv 1.0 Executive Summary.................................................... ....................5 1.1 Document Contents.......................................................................................................... ..............5 1.2 Intended Audience............................................................................................................ ..............5

2.0 Background................................................................................... .6 2.1 Primary Responsibilities of the Production Control Group.............................................. ................6 2.2 Incident reporting and escalation techniques will:................................................................... ........6 2.3 Additional Responsibilities of the Production Control Group................................................ ...........6

3.0 Roles and Definitions............................................................... .......7 4.0 Process Review................................................. .............................9 4.1 Process Outline...................................................................................................... ........................9 4.2 Incident Detection and Reporting.................................................................................................. ..9 4.3 Incident Level Classification: See Appendices C and D.............................................................. ....9 4.4 Incident Notification.............................................................................................. ..........................9 4.5 Incident Escalation....................................................................................................................... ...9 4.6 Incident Resolution..................................................................................................... ....................9 4.7 Post-Incident Activities........................................................................................................ ..........10

5.0 Detailed Incident Control Process.................................... ..............11 5.1 Detailed Process Flow Explanation Table. Reference Appendix A................................... ............11

6.0 High-Level Incident Process Explanation........................................12 6.1 Detailed Process Explanation: See Appendix B......................................................................... ..12

7.0 Outstanding Issues......................................................... ..............14 7.1 A common paging system is required...................................................................... .....................14 7.2 Definition of Service Hours.................................................................................... .......................14 7.3 Definition of “availability,” “outage,” and “service degradation”.....................................................14 7.4 Service-level procedures for client notification....................................................................... .......14

Appendix A Incident Management Process Flowchart..........................15 Appendix B High-Level Incident Management Process Flow.................16 Appendix C Incident Level Communications Matrix..............................17 Appendix D Priorities and Internal Response Times.............................18 Appendix E On-Call Guidelines...........................................................19 Guideline Purpose.......................................................................................................... ...................19 Duties...................................................................................................................... ..........................19 Responsibilities........................................................................................................................... .......19 Communications................................................................................................................... .............19 Communications Elements................................................................................................ ................20 Notification Protocol................................................................................................................... ........20 Initial Communications Tracking...................................................................................... ..................20 Response Protocol................................................................................................... .........................20 Scheduling.............................................................................................................................. ...........20

Appendix F Management On-Call Guidelines.......................................21 Return-To-Work Guidelines.............................................................................................................. ..21

8/8/2008

v1.12

Page ii.b

Table of Figures

Table of Figures Figure 1 Incident Detection and Reporting....................................................................................... .....15 Figure 2 High-Level Incident Management Process Flow..................................................... ...............16

8/8/2008

v1.12

Page iii

List of Tables

List of Tables Table 1 Revision History................................................................................................ ...........................ii Table 2 Detailed Incident Control Process ...................................................................................... ......11 Table 3 Explanation of High-Level Incident Management Process Flow............................................13 Table 4 Incident Level Classification Matrix........................................................................ ..................17 Table 5 Return-To-Work Guidelines...................................................................................................... ..21

8/8/2008

v1.12

Page iv

Operations Excellence

1.0

Incident Management Process

Executive Summary 1.1Document Contents 1.a.

This document contains processes, through the use of which the new Production Control Group will be able to quickly and efficiently respond to, manage, and resolve incidents. Documentation includes on-call definitions and guidelines, escalation processes, process flow diagrams, and data tables, sets general expectations, defines roles and responsibilities, and provides general guidelines.

1.2Intended Audience 1.a.

8/8/2008

This document is directed at and intended for executive level and management personnel, ITSS personnel, including all of those are included in this process, such as: Subject Matter Experts (SMEs) Technical Leads, Line Managers, Systems Administrators, DBAs, project leaders, and facilities personnel.

v1.12

Page 5 of 21

Operations Excellence

2.0

Incident Management Process

Background It is expected that most services supported by ITSS are available 24x7. As a result of this expectation, it is in the best interest of ITSS Shared Services workgroups and ITSS as a whole to develop and establish a combined staff– the Production Control Group (PCG) – dedicated to proactively managing and responding to events as they occur. Eventually, the role of the PCG will include incident evaluation, and depending on the severity of the event, escalate to upper management. In some situations, the more experienced level technical personnel will take action to effect repairs and/or restore services. As the PCG acquires experience, and as ITSS adds monitoring and troubleshooting capability, they will assume additional incident response responsibilities. 2.1Primary Responsibilities of the Production Control Group 1.a.

Managing and controlling a widespread service outage, including incident reporting and escalation.

2.2Incident reporting and escalation techniques will: 1.a.

Specify a point-of-contract (owner) for all issues and ensure that services are restored through the prudent use of departmental resources, including documentation of the incident from beginning to its resolution.

1.b.

Effectively manage the communication of information within ITSS when there are issues that actually or potentially impact ITSS-supported services or facilities.

1.c.

Pro-actively respond to issues that impact ITSS-supported services and facilities; evaluate, classify, escalate, and manage service restoration efforts efficiently and as expeditiously as possible, up through incident resolution.

2.3Additional Responsibilities of the Production Control Group 1.a.

8/8/2008

Note: It is anticipated that any single-shift of the PCG will NOT be consumed by continuously resolving issues. Because of this, supplemental duties and tasks, detailed below, will be assigned. 1

Assist offsite Subject Matter Experts by performing requested tasks, such as visual inspections of hardware and recycling the power on equipment as instructed.

2

Manage and prepare magnetic media for rotation, offsite shipment and storage, including organizing and filing transmittal logs.

3

Control building and facility access, escort vendors to restricted areas for the purposes of inspection, maintenance, and repair of equipment.

4

Monitor building/facility/ data center environmentals, such as: air conditioning, fire suppression system, lighting, and so on, log times and results of the monitoring activity.

5

After normal working hours, perform 1st tier triage of reported issues, classify and escalate as necessary.

6

Receive and log calls from end users, and generate Remedy tickets, escalate as necessary.

7

Set up Video/Telephone conferences. v1.12

Page 6 of 21

Operations Excellence

3.0

Incident Management Process

8

Accept and sign for emergency delivery of replacement parts from vendors.

9

Perform other tasks deemed necessary by department supervision.

Roles and Definitions • • • • • • • • • • • •

• •

• • •

8/8/2008

Account Manager – A member of the ITSS Account Management team in Client Support who is responsible for the relationship with one or several key clients (e.g. GSB, H&S, Libraries) Client – A primary paying customer of ITSS services and support End User – Person who directly uses a service. An end user could be an internal or external to ITSS. End users are directly impacted during an outage, and generally have an established relationship with the Client or Service Owner Impact – Level of effect or impact on the Stanford Campus. This is relative to the Campus as a whole, not specifically to the client. (Values= Campus-Wide, Major – School or Dept wide, Minor – Group or Single User, and Non-Service Affecting) Incident Manager – The Shared Services Line Manager who is designated as responsible for a specific incident Incident/Event/Problem/Issue – For the purposes of this document, these terms are intended to mean a failure of any component of any system or service, and are used interchangeably throughout this document ITSS Client Support – Group which does client relations, account management, functional analysis, sales & marketing, documentation, software licensing, end user training, and Help Desk and CRC support ITSS Engineering and Projects – Group which does technology R&D, service enhancements, new product and service projects ITSS Shared Services – Group which does operations ITSS Strategic Planning – Includes technology strategy & architecture and finance groups Line Manager – Workgroup managers in ITSS Shared Services On-Call Subject Matter Expert (SME) – SME (see below) who is designated to be available to respond to reported outages, triage the incident, perform the needed tasks to restore services, assist other workgroups in the restoration process, or determine which other members within their own workgroup are needed to assist in service restoration Operations Owner – The ITSS staff person who has the ultimate authority for a service including its functionality and approval for any changes to the service Priority – Level of response and effort directed towards resolving an incident. It is determined by the inherent service level commitment of the service, as well as a combination of Urgency and Impact. Priority is sometime referred to as “severity”. (Values = Urgent, High, Medium, Low) Product Manager – Own product quality and client satisfaction for a service Production Control Group (PCG) – Group which will perform monitoring and basic problem determination and evaluation, escalation, communication and in some cases, incident resolution Subject Matter Expert (SME) – Any technical ITSS staff person whose job requires extensive technical knowledge of network and service components and their related v1.12

Page 7 of 21

Operations Excellence

•

•

8/8/2008

Incident Management Process

requirements. SMEs are considered experts and possess a detailed knowledge of service functionality, restoration, component/service repair. Satellite Operations Center (SOC) – The SOC is a partner with the University Emergency Operations Center (EOC) during Level 2 (major building fire, extended power outage) or Level 3 (major earthquake or extensive flooding) emergencies. The ITSS SOC team provides real-time field information to the EOC as well as coordinating and directing emergency responses. Urgency – End user or client’s assessment of the importance and/or urgency of the issue as it affects their ability to perform their work. This value is provided by the customer. (Values = Urgent, High, Medium, Low)

v1.12

Page 8 of 21

Operations Excellence

4.0

Incident Management Process

Process Review 4.1Process Outline 1.a.

Note; There are six major steps in this process, from the time of incident detection through root cause analysis and implementing preventative measures.

4.2Incident Detection and Reporting 1.a.

An incident can be detected by: 1

From an end-user

2

From a client

3

From an SME

4

From automated monitoring

1.b.

It is important that the sharing of information occur between and among groups.

1.c.

The process of reporting of problems is different between “normal” working hours, 8:00 A.M. to 5:00 P.M., M-F, and after those hours.

4.3Incident Level Classification: See Appendices C and D 1.a.

This includes assigning a severity level to the incident, and its subsequent entry into the Remedy incident tracking system.

4.4Incident Notification 1.a.

This includes notification to an ITSS Incident Manager and clients, and includes outage information posted on the SU Web site, Cable TV, informational messages left on the designated voice mail box, and email sent to designated personnel and other client notification as deemed appropriate.

4.5Incident Escalation 1.a.

This includes escalation to the ITSS Incident Manager, and any subsequent escalation calls deemed necessary. Note that the severity level will dictate who in the management chain of command to contact, and when to provide them status reports. Additionally, the PCG will determine whether or not the incident needs to be escalated to the SOC.

4.6Incident Resolution

8/8/2008

1.a.

This covers work performed during the incident itself, with responsibilities as follows:

1.b.

The Incident Manager is responsible and accountable for the overall recovery effort, performing the following functions: 1

Establishing recovery priorities

2

Coordinating and delegating responsibilities as they relate to the recovery effort.

3

Issuing requests for additional resources

v1.12

Page 9 of 21

Operations Excellence

Incident Management Process

4

Ensuring the participation of critical internal and external support groups and vendors, such as the recall of media from the off-site storage vendor, or the purchase of replacement parts and equipment

5

Reviewing and approving tactical plans

6

Communicating incident status to ITSS management/executives as needed

7

Working with Client Support to approve and authorize the release of information to other schools and departments

1.c.

SMEs and Line Managers are responsible for analyzing technical problems and making technical decisions, implementing tactical plans, and communicating to other SMEs as well as the Incident Manager.

1.d.

The PCG is responsible for coordination of the incident resolution effort and for communication as deemed necessary.

4.7Post-Incident Activities 1.a.

8/8/2008

This covers the activities after the incident is resolved. 1

The first task is to ensure that any post-incident cleanup is completed

2

Perform root cause analysis of the incident,

3

To avoid similar, future incidents, determine what process improvements and preventative measures that can be put into place.

4

Implement changes in process or technical support as appropriate.

5

Ensure that PCG receives feedback and input from the user community,

6

Perform client follow-up and ensure that an incident response quality survey form is available for end-user and client feedback.

v1.12

Page 10 of 21

Operations Excellence

5.0

Incident Management Process

Detailed Incident Control Process

5.1Detailed Process Flow Explanation Table. Reference Appendix A Process # Process Name Detailed Description

Action By

Incident Detection and Reporting

1

Problem Reporting: End Users

2

Problem Reporting: Clients

3

Problem Reporting: End Users After Hours

4

Problem Reporting: Monitoring to SMEs

5

Problem Reporting: Monitoring to PCG

6

Resolve?

7

Urgent?

8

Forward To SME

9

Resolve Quickly?

10

Enter Solution In Remedy

11

Forward To PCG

End-users will call 5-HELP or use the web at http://helpsu.stanford.edu/. Telephone calls are directed to the ITSS Help Desk where the problem is evaluated

If the Help Desk (any tier) determines that this is an urgent incident, the call/ticket should be directly escalated to the PCG In most cases, clients should call 5-HELP or use the web at http://helpsu.stanford.edu/. In some special cases, clients may have direct access to the PCG for reporting problems and receiving updates. In this case, skip to step 12. If an end-user calls 5-HELP after hours, the user will get the recorded phone tree. Users can choose to get through to the PCG directly, or leave a recorded message. For after hour’s calls, the PCG will determine whether call is urgent. If the issue is not urgent, the PCG will enter a ticket in Remedy for review the following business day. In some cases, monitoring may notify a SME or a problem before a user, client or the PCG. If the issue is urgent, escalate directly to the PCG for coordination and entry into Remedy. Monitoring reports information directly to PCG Help Desk assesses whether the ticket can be resolved at this point. If so, the Help Desk will resolve and close. If the ticket cannot be resolved, Help Desk to determine whether the ticket should be forwarded to SME/Help Desk Tier 2 or to the PCG If the case does not appear to be severity Urgent/High, forward to SME Can the case be resolved by the SME and is it Severity Level Medium/Low? If the SME can quickly resolve the case, enter solution in Remedy and close ticket. If the SME determines that there is impact beyond a simple fix and the Severity Level is Urgent/High, notify the PCG.

End-User

Client

End User, PCG

SME PCG Help Desk Help Desk Help Desk SME SME SME/PCG

Classification

12

Assign Severity Level

13

Enter In Remedy

Assign a severity level to the incident; using the standard ITSS categories (see Appendix C and D). The severity levels govern: Level of action to be taken by the Production Control Group Notification and escalation guidelines Time intervals in which to provide status reports Time intervals in which to initiate escalation and management decision processes Enter a ticket for the incident into the Remedy Help Desk application.

PCG

PCG

Table 2 Detailed Incident Control Process

8/8/2008

v1.12

Page 11 of 21

Operations Excellence

6.0

Incident Management Process

High-Level Incident Process Explanation

6.1Detailed Process Explanation: See Appendix B Notification SME Update itss-servicealerts@lists

Notify appropriate SME(s) if necessary, using AMCOM on-call system

PCG

Send a message to [email protected]

PCG

Message information will include: the date and time, a brief description of the problem, and if available, the estimated time of resolution/restoration. Post Messages To Web, Phone, TV

Web: Update status on down.stanford.edu Telephone: In the event of a major network failure, update the designated voicemail box: 7-DOWN

PCG

SU Cable TV – ITSS can have pre-worded messages set for broadcast, where the group can just fill in the blanks. Escalation Notify Line Manager

Contact the Shared Services Line Manager of the affected system. If a Line Manager is unavailable, use the AMCOM system to determine the backup.

PCG

Determine Incident Manager

If the incident falls into the area of a single Line Manager, that Line Manager will contact the Incident Manager. If multiple Line Managers are involved, they must determine a single Incident Manager.

Shared Services Line Managers

Send Email

Send first email to appropriate lists/clients, based on Service Level Agreements. Use the [email protected] list for campuswide outages; the Incident Manager should approve any messages which go to this list.

PCG, Incident Manager

Escalate To Senior Management

The Severity Level (see Appendix C and D) will determine the escalation to management

PCG

Resolution The Incident Manager will take ownership of the problem and manage the incident. Responsibilities: Establish priorities Coordinate and delegate responsibilities in regards to the recovery effort Request additional internal or external resources Incident Management

Ensure and manage the participation of critical internal and external support groups and vendors Review and approve tactical plans Communicate incident status to ITSS management/executives as needed

Resolve Incident

8/8/2008

Work with Client Support to release information as needed to clients/users across campus SMEs are responsible for analyzing technical problems, implementing tactical plans, and communicating to other SMEs and with the PCG.

v1.12

SMEs

Page 12 of 21

Operations Excellence

Incident Management Process

Message information will include: the date and time, a brief description of the problem, and if available, the estimated time of resolution/restoration. Post Resolution Information To Web, Phone, TV

Web: Update status on down.stanford.edu Telephone: In the event of a major network failure, update the designated voicemail box: 7-DOWN

PCG

SU Cable TV – ITSS can have pre-worded messages set for broadcast, where the group can just fill in the blanks Post Incident Analysis Complete Cleanup Tasks Root Cause Analysis

Incident Prevention

Client/User Follow-up Quality Survey

Determine whether cleanup is required, and identify who will own and perform the additional clean-up tasks It is the responsibility of the manager of the PCG to initiate root cause analysis, collecting as much information as possible, and to ensure that any information which will help in resolving future incidents is entered into the related Remedy ticket for future use.

SME, PCG PCG Manager Shared Services Managers, SMEs

Determine processes which can be implemented to prevent a repeat of the incident. Ensure selected members of the recovery team make follow up calls to the affected users, to solicit their constructive comments. Share results of the analysis with workgroups and clients where appropriate. ITSS will make an on-line survey available for user/client feedback, and for ITSS staff. The PCG is responsible for tallying survey results and making them available to the appropriate ITSS staff and managers.

PCG PCG

Table 3 Explanation of High-Level Incident Management Process Flow

8/8/2008

v1.12

Page 13 of 21

Operations Excellence

7.0

Incident Management Process

Outstanding Issues 7.1A common paging system is required 1.a.

AMCOM for manual paging

1.b.

What to use for automated paging from monitoring systems?

7.2Definition of Service Hours 7.3Definition of “availability,” “outage,” and “service degradation” 7.4Service-level procedures for client notification

8/8/2008

v1.12

Page 14 of 21

Operations Excellence

Incident Management Process

Appendix A Incident Management Process Flowchart Reference Table 1 Detailed Incident Control Process 1.a. Note that the circle numbers in the flowchart correspond to the numbers on table 2, page 10. Incident Detection & Reporting End User

1

Client

1 Report Problem: HelpSU/5HELP

4 Report Problem: HelpSU/5HELP

Report Problem

Resolve?

8

No

7

PCG

5 Report Problem

6

Automated Monitoring

SME

Help Desk

Urgent?

No

Forward To SME (Help Desk Tier 2) For Additional Analysis

Yes 9

Resolve Quickly?

No

Yes

10

2

Enter Solution In Remedy

Report Problem: Directly To PCG

11 Forward Directly To PCG

12 Determine Severity Level

13 3

Calls 5-HELP After Hours

3

Calls 5-HELP After Hours

Enter Incident Ticket In Remedy

Figure 1 Incident Detection and Reporting

8/8/2008

v1.12

Page 15 of 21

Operations Excellence

Incident Management Process

Appendix B High-Level Incident Management Process Flow

Client End User Em

Remedy Database

Update

HelpSU/5-HELP Help Desk Tier 1

y nc ge er

Detection & Reporting

Communicate

Subject Matter Expert

Monitoring

Production Control Group

Classify Incident Level & Enter in Remedy

Classification

Remedy Database

Update

Escalation

Com mun ic

Notify

Production Control Group

ate

Communicate Co mm u

Line Manager

Liaison

Duty Manager

SOC/EOC

nic ate

System Status Notification

Self-Service

itss-service-alert@lists down.stanford.edu 7-DOWN

End User

U

SME

Post Incident Activities

Duty Manager

Account Manager

SME

pd at e

w

ith

da te

Remedy Database

Upda So lu tio n

In fo rm

at on

te

Production Control Group

Line Manager

PCG Manager

Update With So lution

Up

Resolution

Client

Production Control Group

Figure 2 High-Level Incident Management Process Flow

8/8/2008

v1.12

Page 16 of 21

Operations Excellence

Incident Management Process

Appendix C Incident Level Communications Matrix Level

Urgent

High

Medium

Low

Description

Incident Examples

A major service outage with significant and immediate business impact and no workaround. • Large number of users • Outage of significant length • No available workaround • Mission/ business critical

Fire suppression system activation in data center

A major service outage or degradation with significant business impact and an unsustainable workaround. • Multiple users • Work performance reduced • Mission/ business critical

Failure of Storage system (storage area network SAN) Failure of a server of a sensitive client or user

A service outage or degradation with an acceptable workaround. • Service-affecting • Minimal performance degradation • Affects non-critical business function Non service-affecting. • Cosmetic problem • System enhancement

Loss of electrical power Entire network switch, closet and/or building outages Failure of 1 or more high priority services – e.g. Exchange, Oracle Financials, HRMS, PeopleSoft Large denial of service attacks/; successful hacking; loss or altering of data; theft of data, simultaneous virus infections

Client Update Interval

SME Work Started w/in:*

Initial Immediate. Notification on-going:

30 minutes

½ hour

SU telephony systems

Severely degraded performance

Initial Immediate. Notification on-going:

1hour

1 hour Smaller denial of service attacks

Cannot connect to the internet, send or receive email Hardware failure, cannot access data, cannot print

As applicable. By SME working issue.

4 business hours

Degraded performance

Previously requested enhancements to a system

Upon issue resolution or as applicable with. By SME working issue.

1 business day

Table 4 Incident Level Classification Matrix

* Note: This column indicates the most amount of time that will transpire before a technician begins working on an Incident. Times will generally be much faster for all severities.

8/8/2008

v1.12

Page 17 of 21

Operations Excellence

Incident Management Process

Appendix D Priorities and Internal Response Times Note: The following table refers to Priority, not to Urgency or Impact. Priority is a combination of the combined Urgency, Impact, and existing Service Level Commitments for the service in question. This is an important concept to adhere to – Urgency is offered by the customer, Priority is assigned by the Helpdesk, PCG, and/or SME involved from a system-wide perspective. Usage: These Priority levels (and the associated Urgency and Impact values) are used to track incidents as they are reported and worked on. Each of Priority, Urgency, and Impact relate directly to Remedy ticket fields.

Priority

Urgent

High

Medium

Low

8/8/2008

Description A major service outage with significant and immediate business impact and no workaround. • Large number of users • Outage of significant length • No available workaround • Mission/ business critical A major service outage or degradation with significant business impact and an unsustainable workaround. • Multiple users • Work performance reduced • Mission/ business critical A service outage or degradation with an acceptable workaround. • Service-affecting • Minimal performance degradation • Affects non-critical business function

Non service-affecting. • Cosmetic problem • System enhancement

Committed Service Hours

PCG Call Initiate

SME Call Response

Escalation Interval

SME Work Started

24x7

Immediate

15 Minutes

10 minutes

30 minutes

24x7

Immediate

15 Minutes

10 Minutes

1 hour

Ticket Assignment/e Mail

As appropriate (work begins, work update, work completed)

Standard SME Group Remedy settings

4 business hours

Ticket Assignment/e Mail

As appropriate (work begins, information required, work completed)

Standard SME Group Remedy settings

1 business day

8-5, M-F

8-5, M-F

v1.12

Page 18 of 21

Operations Excellence

Incident Management Process

Appendix E On-Call Guidelines Guideline Purpose To generally define and standardize: On-call duties and responsibilities A methodology for communications and engagement of problem determination and resolution On-call scheduling Response expectations/guidelines and general escalation processes in the event 24 X7 on-site group is engaged in an on-going event or incident. System generated notifications will continue to be handled within the required time frames by the individual SME groups. Duties Requirements for on-call responsibility must be identified in the appropriate job descriptions, including: carrying a pager, cell phone, availability of the employee’s home phone number, and email. Responsibilities Share on-call responsibilities with other members of the work group Begin working on the event as soon as notified This may require working from home or traveling to work. The decision to make a physical appearance at work depends on the circumstances of the event, such as: “swapping” hardware components or, an on-site appearance by a vendor. Communications Teleconference Phone Bridge –Telecom will have a teleconference number available to technical personnel, and the PCG. This will be used when the expertise of multiple SMEs is required to resolve an incident. It will also permit the technical staff the capability to communicate as a group. Additionally, first-hand, the PCG will be able to determine the status of the incident and keep management informed without them actually being involved in the conference call. The AMCOM system will be the primary contact information/procedures lookup and paging tool for the 24 X 7 on-site groups. Staff will provide and track individual work group on-call schedules. 8/8/2008

v1.12

Page 19 of 21

Operations Excellence

Incident Management Process The work group establishes the rotation. Members of the work groups are responsible for maintaining and keeping current, the contact and coverage information on the on-call database.

Communications Elements Required communications devices: pager or cell phone, personal phone. Additional communications devices as recommended by the SME groups: DSL, Treo, wireless-laptop, email. Notification Protocol Initial outgoing page Re-page in 10 minutes If a call-back is NOT received from the designated on-call SME within 15-minutes, begin escalation to the next on-call person, including re-contacting the primary on-call person and the on-call Shared Services manager on all subsequent pages. Recipient to confirm garbled pages, follow call-back protocol. Initial Communications Tracking Use AMCOM system for initial communications tracking Response Protocol 15 minute call-back Within 30 minutes, be actively engaged in problem determination and resolution Actively engaged via: Home system Wireless laptop On-site SME groups may establish accelerated response profiles based upon their response criticality Scheduling By SME group design SME schedule to be established and published in AMCOM system SME contact instructions to be included

8/8/2008

v1.12

Page 20 of 21

Operations Excellence

Incident Management Process

Appendix F Management On-Call Guidelines Return-To-Work Guidelines These guidelines are for Management to consider if extended hours have been worked due to outage/issue by an on-call representative. These guidelines should be used to ensure there is always an effective on-call representative, while protecting the oncall SME from overly extensive work-time. If the primary on-call SME has already worked consecutive extended hours, or multiple shifts, and a new event has occurred: Either the manager will provide a backup and notify the backup of their modified on-call status, or the entire group of SMEs will make a decision on the selection of an alternate SME to be used in this situation. To allow staff members who are involved with an after hour call-out on Sunday through Thursday to obtain adequate rest, the following is provided as a sample set of guidelines for a return-to-work policy: On-Call SME works until 0200 0300 0400 0500

Report to work no later than 1100 1200 1300 Take rest of day off

Table 5 Return-To-Work Guidelines

8/8/2008

v1.12

Page 21 of 21

Service Desk Incident Triage Matrix

Short Description

Description

Comments

We need your help!