Top 14 Considerations for Addressing Data Center Facilities Management Risks

Share Embed Donate


Short Description

Thought piece...

Description

Top 14 Considerations for Addressing Data Center Facilities Management Risks Meeting operational challenges in the data center requires organization, planning, and focus By Stephen Burgess

Data center facilities managers face enormous pressure every day. The challenge of operating a complex and ever-changing ever-changing facility is a considerable one, especially considering the increasing business demands and budget pressures prevalent in the industry. Yet, successful data center facilities managers continue to meet the constant challenge. Uptime Institute has compiled these 14 considerations for reducing risk in the data center that facilities managers can embrace to identify and minimize problems affecting operations. 1. Overtime Sustained overtime rates of 10% or more can produce chronically overworked facility personnel, which correlates very strongly to increased rates of incidents, including outages and even serious injury or loss of life. Staffing a facility properly and correctly aligning workloads to the real needs of the facility is the best way to eliminate chronic fatigue-inducing OT, maximize personnel safety, and minimize the potential for outages, given the vast majority of them are due to operator error.

Trying to save money by running a facility with a very lean staff is just about the most dangerous, risky, and worst decision any data center owner can make, given the cost of facility operations relative to the cost of the facility and that of the IT asset it supports. 2. Critical Spares Ensuring the ready availability of spare parts in the event of a loss of infrastructure redundancy or availability is essential for mission critical facilities such as data centers. These critical spare parts can either be stored on-site or can be provided by vendors. When there is vendor dependency, due diligence should be applied to ensure the availability of the required spare parts. This can be stipulated in the service and maintenance contracts service level agreements (SLA) held by the data center owner.

Developing a comprehensive critical spares inventory starts with a single point of failure (SPOF) analysis of the data center design. Most high-quality data centers actually do not have high-impact SPOF points, so identifying

50

which failures would reduce redundancy is what is really important. For example, if an uninterruptible power supply (UPS) module unexpectedly goes into static bypass, there is typically no loss of critical load; rather, the impact is a loss of redundancy of fully conditioned, battery-backed power. The availability of a UPS critical spares kit will dramatically reduce the time spent at the reduced redundancy level.  An effective critical spares kit should include large circuit breakers or automatic /manual/static transfer switches (ATS/MTS/STS). With breakers in particular, the need for a critical spare often manifests itself during scheduled maintenance, such as during tri-annual or 5-year maintenance and inspections that use primary injection testing. This is particularly important in older facilities, where a large, expensive breaker may be difficult to procure, if it can be located at all. Finally, inventory management is critical to maintaining any critical spares kit. The inventory should be a detailed asset list with robust controls for ensuring parts readiness (such as breaker certifications) and timely replenishment when any part is taken from the inventory. 3. Diesel Fuel Ensuring a reliable supply of usable diesel fuel can be a source of numerous concerns. These include:

• Suppliers. For most data centers, formal contracts should be in place with at least three local suppliers. These contracts must have well-defined service level agreements (SLA) to guarantee fuel delivery quantity and time minimums. • Certificate of Conformance. Every fuel supplier should be required to maintain and comply with a Certificate of Conformance to ensure any fuel delivered conforms to the ASTM D975 standard. Additional language should be included that forbids any contamination with biofuels. • Fuel quality and polishing. Given that most fuel must be stored on site for a very long time (even decades), fuel quality must be maintained. Data center managers should be sure to polish the fuel (multi-stage filtering and circulation) regularly and regularly test fuel, remove water, and manage additives. An independent laboratory that specializes in diesel fuel should be retained to address phenomena such as stratification and self-polymerization, unless this service is included in generator service and maintenance contracts. Fuel quality can be maintained by • Using it • Polishing it with permanently installed systems • Having a vendor polish it with an on-site visit no less frequently than annually • Having a vendor remove the fuel and replace it with fresh fuel conforming to ASTM D975. • Acceptance testing. Any fuel received on site should be sampled with tools such as a “bacon bomb,” taking samples at several depths once the fuel truck has been parked for at least 15 minutes to let the fuel adequately settle. Classic tests such as the visual beaker test (“bright and clear”) should be performed for any fuel delivery. This scrutiny, along with regular lab samples sent to an independent laboratory and the fuel vendors’ compliance Certificates of Conformance, will help ensure contaminant free and chemically correct fuel is always delivered by the fuel vendors. • Correct fuel filter size. The simplest of errors —incorrect filtration specification (micron size) of fuel filters— have caused some of the most dramatic data center failures. Data center managers should load bank the engine generators at 100% load to prevent against fuel starvation after changing fuel filters. This also validates the quality of the fuel. 4. Emergency Operating Procedures (EOPs) EOPs should be developed for the ten most likely and high-impact abnormal conditions. These are pre-approved, fully scripted responses for abnormal high-impact conditions that could reasonably occur.

51

Most modern data centers do not actually require a physical response to unexpected abnormal conditions. The real purpose of the EOP is to verify the condition of the facility and correctly escalate and report it. The other essential purpose of a well-developed EOP library is to ensure that facility operators do not try to be heroes, which often makes matters worse and can endanger personnel. Training personnel to follow EOPs helps prevent the hero response. Eight essential EOPs might include: • Loss of municipal power • Loss of municipal water • Activation of the fire alarm, including sustained level three detection, charged pipes, or a dry agent dump event • Recovery from emergency power off (EPO) activation • Loss of controls/PLC (programmable logic control) or automation of either mechanical or electrical systems • Loss of chilled water flow • Generator fail to start • UPS in static bypass 5. Drills Having well-written EOPs means little if facility personnel are not familiar with them. The best way to maintain a high level of operational readiness is to regularly simulate all the scenarios addressed by the site’s EOP library. These simulations are usually referred to as site drills. The more realistic the site drills the better. Site drills are important refresher training that should be conducted in any data center.

In a live data center, there is usually very limited opportunity, if any, to replicate the actual infrastructure conditions that warrant the use of EOPs. Many data center owners are uncomfortable at the notion of abruptly disconnecting pumps, chillers, computer room air conditioners, and other equipment to trigger authentic BMS alarms and require the personnel to interpret them and exercise the appropriate EOPs, with just a few exceptions such as scheduled pull the plug tests. Given this limitation, effective drilling requires the use of visual aids and props to safely simulate abnormal conditions or behavior of real infrastructure. For example, a combination of printouts of building management system (BMS)/emergency power management system (EPMS) graphics, switchgear enunciators, and human machine interface (HMI) screens, with various signs and markings that can be taped to computer screens, panel boards, and equipment can help simulate abnormal conditions that are anticipated by the site’s EOP library. The operations team should drill using the actual procedures in use at the facility. This produces a detailed historical document that accurately measures the performance of the drill. Any drill conducted should produce one or more completely filled out EOPs for the scenario. These documents should be filed and retained as formal site training. Scheduling and performing formal site drills must consider any scheduled maintenance activity, meaning it needs full visibility to data center operations management and approval by the formal change management process and policies established to control all activities in the data center facility environment. 6. A Procedure-Based Control Methodology  Any and all interaction with data center facility infrastructure should be done according to pre-approved, detailed, and fully vetted procedures. These include:

• Methods of procedure (MOP). A detailed and scripted activity for formally scheduled and approved preventive and corrective maintenance activities. MOPs ideally capture all details about the purpose of

52

the maintenance and everyone involved with it. A good MOP has very detailed steps to complete the activity, including time stamps, initial blocks, and signature fields. • Standard operating procedures (SOP). Any routine interaction that involves a basic change of state or configuration of the infrastructure, often to support planned maintenance, should be controlled with a wellwritten SOP. SOPs share many features of the MOP, such as time stamp and operator-annotated steps. Many data centers require procedure libraries that include hundreds of documents. Such a large collection of documents requires a formal policy that defines how these documents are written, reviewed, and formally approved for use. These policies should also address revision and formatting processes and controls. Finally, SOPs and MOPs are meaningless if they are not followed. Procedure deviation is a major cause of incidents and outages. Experienced facility technicians can become cavalier and complacent, especially with the repetition of large maintenance evolutions. Therefore, it is crucial that management strictly enforce strict adherence to the steps in all procedures and provide training to ensure the procedures are understood. 7. Safety Program  Any facility or portfolio must have a local authority having jurisdiction (AHJ/LAHJ) compliant safety program. Having a current NFPA 70E compliant program is especially important for data centers (OSHA defers to NFPA 70E for electrical safety in the workplace). This includes a complete, fully tested personal protective equipment (PPE) kit and associated lockout-tagout (LOTO) kit for hazardous energy isolation from both electrical and mechanical sources. Fully formalized and AHJ/LAHJ compliant safety programs entail writing various program definitions, policies, and procedures that explicitly define how safety is managed and administered for the facility. 8. Short Circuit Coordination Study (SSCS) and Arc Flash Assessment  A facility must have a current SSCS and associated arc-flash hazard assessment with arc-flash stickers correctly placed in all areas of the environment. All breakers must be verified to have trip unit settings set to those recommended in the SCCS. 9. Battery Monitoring System  Analogous to fuel for engine generators, having a UPS means nothing if the batteries do not respond when the UPS input voltage goes away or out of tolerance (loss of city power or severe power quality problems). Using a battery monitoring system that gives real-time condition and predictive maintenance capabilities with associated alarming is the best way to achieve full confidence in the UPS batteries. If no battery monitoring system exists, then quarterly battery inspections should be performed with industry standard tools. This is especially important for valve regulated lead-acid (VRLA) absorbent glass mat (AGM) batteries because the cells usually fail open. One open cell in a 40-jar string renders the whole string useless.

Real-time data provided by contemporary battery systems not only validates the availability of the battery plant, it allows very accurate measurement of its capacity and expected reasonable end of life replacement period, typically extending VRLA-type battery retention by 25% or more. Such an extension of battery utilization amounts to a very significant operational cost deferral given the multimillion-dollar value of many data center battery installations. Battery spares should be purchased from the same batch as the battery installation and should be kept in the same environmental and charging conditions as those connected to the UPS itself, so that the spares age and degrade at the same rate as the batteries in use. In this way, when a battery develops unacceptably high internal resistance and must be changed, the replacement battery has very similar or nearly identical functional characteristics as the other batteries in the string. This ensures no upset or imbalance to the charging voltage applies to the other batteries in the string. The battery monitoring system should ideally be extended to the UPS battery spares and the batteries used for starting engine generators. Using real-time, condition-based maintenance rather than the time-interval replacement that is common produces confidence in these batteries. Using a battery monitoring system for such components generates reliable expectations from them, results in their maximum utilization, and reduces maintenance.

53

Deploying a high-quality battery management system does not preclude physical inspections of the battery plant, which should include visual checks on all battery connections and connector fastener torque checking. A combination of a battery monitoring system and periodic physical inspection will ensure the maximum reliable utilization of a data center’s battery plant. 10. Training Training is a complicated topic that can cover many components and activities. The only formal training curriculum in many data centers relates to corporate compliance (how to be a company employee), not actual facilities activities or knowledge. This is because many facilities rely on informal on-the-job training (OJT).  While this approach can be effective, it means that achieving fully qualified staff depends on a large number of undocumented quality variables, with the quality "fully qualified" being a largely subjective determination. Informal OJT may also be deficient in key areas because it is a largely reactive approach.

 At the minimum, a formalized training program and curriculum can be divided into two main categories: operational readiness and planned activities. Formal training includes mastering the facility’s sequence of operations (SOO) for electrical and mechanical systems and the integrated system SOO related to how all systems work together in concert. This training often involves studying the alarms generated by the controls, BMS, and EPMS to respond correctly to them, often leading to the use of an EOP for critical impact alarms. Studying a facility’s SOOs and the alarms the monitoring systems generates can enable the staff to correctly respond to any abnormal facility condition. Formal training related to planned activities should focus on things like access control, vendor escort, and supervision, and the use of procedures to conduct what are mostly preventive maintenance activities. Thus, this training might include policy review, courses, and materials focused on the use of procedures, where the approved procedures are located, how to write a procedure, the use of the change management system, the use of the maintenance management system, the basis of the maintenance program, the navigation of the BMS/EPMS, and other shift presence and site rounds requirements. 11. Maintenance  A high-quality maintenance program keeps equipment in like-new condition and maximizes its reliability, performance, and lifespan. At a minimum, all major assets equipment should be maintained to original equipment manufacturer (OEM) recommendations. Expanding maintenance considerations to include  ASHRAE, International Electrical Testing Association (NETA), National Electrical Manufacturers Association (NEMA), Institute of Electrical and Electronics Engineers (IEEE), National Fire Protection Association (NFPA), ASTM International; and American National Standards Institute (ANSI), design engineer recommendations; and authorized contractor recommendations further enhances the maintenance standard of the facility. Once fully informed, service and maintenance contracts can be configured beyond the conservative and sometimes excessive recommendations from the OEMs.

Maintenance should be performed at the minimum intervals needed to maintain good equipment condition that minimizes abnormal behavior and maximizes the efficiency and life of the asset, typically monthly, quarterly, semi-annually, or annually. Many times this interval can be less frequent than OEM recommendations, which can be overly conservative. Since scheduled maintenance usually involves some direct manipulation of equipment, facilities should be wary of “maintenance-induced failure,” a phenomenon associated with unnecessary interactions with equipment that increases the potential for human error and incidents. The minimum frequency of interaction with equipment should be the level of interaction that captures its condition and keeps the asset in like-new condition. Any greater frequency is excessive, offers no real benefit to the equipment, consumes personnel resources, and increases risk of incidents. In one case, a data center with 100 large air handling units (AHUs) determined that there was no real benefit to performing monthly or quarterly preventive maintenance inspections, so those were removed from the maintenance calendar and replaced by enhanced semi-annual inspections that still kept the equipment in likenew condition but greatly reduced workload and unnecessary interaction with the equipment, allowing those resources to be better applied elsewhere in the environment.

54

The industry currently follows several dominant maintenance methodologies, with most plans combining traditional condition-based maintenance, run-to-fail, and predictive maintenance. Because of their sheer size and high levels of redundancy and resiliency, some very large data centers may find it cost effective to let some asset classes operate until they begin to show degraded performance, at which point maintenance can be performed to restore the normal operating condition. Such approaches have to be carefully considered in order to ensure risk is appropriately addressed. Ultimately the goals of any maintenance plan should be the elimination of incidents due to abnormal equipment behavior or excessive interaction with the equipment using the most cost-effective approach. Deferred maintenance, or skipping of maintenance due to scheduling or resource issues, must be aggressively avoided, especially when the deferral is a consequence of pushback against intrusive or redundancyreducing maintenance from the IT organization. Ultimately postponing important maintenance can be counterproductive. Any deferred maintenance should be recorded, tracked, and communicated to IT asset stakeholders to ensure it gets appropriate managerial visibility and resolution. Predictive maintenance programs such as infrared scanning of power distribution systems, vibration analysis of rotating assemblies, and lubrication oil analysis are powerful ways of getting advanced warning of potential equipment degradation. Predictive maintenance can capture potential problems early, well before they begin to impair the performance of critical equipment. The key to predictive maintenance is creating an equipment baseline and then trending the data being collected in order to detect unusual rates of rise for degraded condition indicators.  A well-formulated maintenance program requires a maintenance management system, or MMS. An effective MMS contains all the asset information and the scheduling, approval, and tracking information needed to complete all recurring and corrective maintenance activities. The MMS can be flat-file or computer based, with the primary benefit of a computer-based MMS being resource tracking and administration (staff hours and work orders completed on time) coupled with a relational database that can quickly access all aspects of scheduled and recorded maintenance activities. Whether computerized or not, a key requirement for any MMS is the capture and accessibility of maintenance history per asset. This facilitates the ability to clearly trend any maintenance per asset as well as meet SLA compliance requirements and client due diligence information requests. 12. Access Control and Vendor Supervision Only authorized personnel should be allowed into critical infrastructure areas; therefore, some access control policy and some type of physical system must be in place to control traffic into the facility, with measures in place to keep access lists current and enforced. Vendors must also be screened and qualified and supervised based on area and activity in the facility. The standard approach to vendors is complete supervision in addition to formal compliance with the facility’s “house rules,” or policy documents, often referred to as critical facility or data center house rules, which list and define allowed and non-allowed activities and what do to in the case of abnormal situations or emergencies. 13. SOO, Integrated Systems Testing (IST), and Major Switchgear Validation Most normal, steady-state automation is continuously verified in any running data center; however, the most important automation is often merely assumed to work. Specifically, in the event of a loss of municipal power, many data centers are stressed in a way that hasn’t happened since the facility was originally commissioned. Coupled with lack of preparation due to poor EOPs and failure to drill, loss of utility power can be a make or break moment for a data center.

Maintenance oversight often overlooks the importance of preventive maintenance inspections of the programmable logic controller (PLC) for the switchgear, which include protective relays, power quality meters (PQMs), ATS/MTS/STS programming, and firmware revisions, and PLC used in generator paralleling switchgear lineups. Additionally, operator interaction with human-machine interface (HMI) and other high level normal mode override functions can change the original intended configuration of the automation settings over time.  Without a regular (at least annual) pull the plug (PTP) test, neither the automation nor the switchgear itself is validated to perform as expected. Many data centers are averse to the PLP test, with IT departments and customers pushing back on any such testing with the mistaken idea that such testing is not needed and exposes

55

customers pushing back on any such testing with the mistaken idea that such testing is not needed and exposes them to unneeded risk. In addition to regularly performing a PTP test, there are many routine checks of the PLC environment that should be as regularly conducted as any other scheduled maintenance of major infrastructure assets. 14. Change Management  A robust change management system should be put in place for any activity that crosses pre-established level of risk (LOR) criteria. The change management system should include a format review process based on a welldefined LOR matrix that captures and ranks all activities that can occur at the data center. Basically, any activity with real potential for impact on the data center must be formally scheduled and then approved by accountable persons in the data center facilities and IT organizations, before any such scheduled activities can occur.

Stephen Burgess is a consultant with Uptime Institute Professional Services. He performs reviews and assessments for Tier topology design and constructed facility certifications, assessments for Operational Sustainability certifications and the M&O Stamp of Approval as well as teaching the Accredited Tier Specialist (ATS) course.

56

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF