Flight Data Recorder System Management by a Haider

January 7, 2017 | Author: adnanscientist33 | Category: N/A

Share Embed Donate

Report this link

Short Description

Download Flight Data Recorder System Management by a Haider...

Description

Flight data recorder-System Management A Haider [email protected] Research scientist Dept of Electrical Engineering State University of New York Boeing Aircraft B-747 System modifications

Abstract Mismanagement of the persistent state of a system—all the executable files, configuration settings and other data that govern how a system functions—causes reliability problems, security vulnerabilities, and drives up operation costs. Recent research traces persistent state interactions—how state is read, modified, etc.—to help troubleshooting, change management and malware mitigation, but has been limited by the difficulty of collecting, storing, and analyzing the 10s to 100s of millions of daily events that occur on a single machine, much less the 1000s or more machines in many computing environments. We present the Flight Data Recorder (FDR) that enables always-on tracing, storage and analysis of persistent state interactions. FDR uses a domain-specific log format, tailored to observed file system workloads and common systems management queries. Our lossless log format compresses logs to only 0.5-0.9 bytes per interaction. In this log format, 1000 machine-days of logs—over 25 billion events—can be analyzed in less than 30 minutes. We report on our deployment of FDR to 207 production machines at MSN, and show that a single centralized collection machine can potentially scale to collecting and analyzing the complete records of persistent state interactions from 4000+ machines. Furthermore, our tracing technology is shipping as part of the Windows Vista OS.

A Haider

Page 1

1. Introduction Misconfigurations and other persistent state (PS) problems are among the primary causes of failures and security vulnerabilities across a wide variety of systems, from individual desktop machines to large-scale Internet services. MSN, a large Internet service, finds that, in one of their services running a 7000 machine system, 70% of problems not solved by rebooting were related to PS corruptions, while only 30% were hardware failures. In [24], Oppenheimer et al. find that configuration errors are the largest category of operator mistakes that lead to downtime in Internet services. Studies of wide-area networks show that misconfigurations cause 3 out of 4 BGP routing announcements, and are also a significant cause of extra load on DNS root servers [4,22]. Our own analysis of call logs from a large software company’s internal help desk, responsible for managing corporate desktops, found that a plurality of their calls (28%) were PS related.[1] Furthermore, most reported security compromises are against known vulnerabilities—administrators are wary of patching their systems because they do not know the state of their systems and cannot predict the impact of a change [1,26,34]. PS management is the process of maintaining the “correctness” of critical program files and settings to avoid the misconfigurations and inconsistencies that cause these reliability and security problems. Recent work has shown that selectively logging how processes running on a system interact with PS (e.g., read, write, create, delete) can be an important tool for quickly troubleshooting configuration problems, managing the impact of software patches, analyzing hacker break-ins, and detecting malicious websites exploiting web browsers [17,35-37]. Unfortunately, each of these techniques is limited by the current infeasibility of collecting and analyzing the complete logs of 10s to 100s of millions of events generated by a single machine, much less the 1000s of machines in even a medium-sized computing and IT environments. There are three desired attributes in a tracing and analysis infrastructure. First is low performance overhead on the monitored client, such that it is feasible to always be collecting complete information for use by systems management tools. The second desired attribute is an efficient method to store data, so that we can collect logs from many machines over an extended period to provide a breadth and historical depth of data when managing systems. Finally, the analysis of these large volumes of data has to be scalable, so that we can monitor, analyze and manage today’s large computing environments. Unfortunately, while many tracers have provided low-overhead, none of the state-of-the-art technologies for “always-on” tracing of PS interactions provide for efficient storage and analysis. We present the Flight-Data Recorder (FDR), a high-performance, always-on tracer that provides complete records of PS interactions. Our primary contribution is a domainspecific, queryable and compressed log file format, designed to exploit workload characteristics of PS interactions and key aspects of common-case queries—primarily that most systems management tasks are looking for “the needle in the haystack,” searching for a small subset of PS interactions that meet well-defined criteria. The result is a highly A Haider

Page 2

efficient log format, requiring only 0.47-0.91 bytes per interaction, that supports the analysis of 1000 machine-days of logs, over 25 billion events, in less than 30 minutes. We evaluate FDR’s performance overhead, compression rates, query performance, and scalability. We also report our experiences with a deployment of FDR to monitor 207 production servers at various MSN sites. We describe how always-on tracing and analysis improve our ability to do after-the-fact queries on hard-to-reproduce incidents, provide insight into on-going system behaviors, and help administrators scalably manage largescale systems such as IT environments and Internet service clusters. In the next section, we discuss related work and the strengths and weaknesses of current approaches to tracing systems. We present FDR’s architecture and log format design in sections 3 and 4, and evaluate the system in Section 5. Section 6 presents several analysis techniques that show how PS interactions can help systems management tasks like troubleshooting and change management. In Section 7, we discuss the implications of this work, and then conclude. Throughout the paper, we use the term PS entries to refer to files and folders within the file system, as well as their equivalents within structured files such as the Windows Registry. A PS interaction is any kind of access, such as an open, read, write, close or delete operation.

2. Related Work In this section, we discuss related research and common tools for tracing system behaviors. We discuss related work on analyzing and applying these traces to solve systems problems in Section 6. Table 1compares the log-sizes and performance overhead of FDR and other systems described in this section for which we had data available [33,11,21,20,40]. Table 1: Performance overhead and log sizes for related tracers. Performance Overhead

A Haider

Log size

Log Size

(B/event)

(MB/machineday)

FDR

Flight Data Recorder System Management by a Haider

Short Description

Description

Comments

We need your help!