SQL Server 2012 Enterprise Information Management Whitepaper
Enterprise Information Management with Microsoft SQL Server 2012 Delivering Credible Consistent Data to Every Organization
Writer: Graeme Malcolm (CM Group) Published: January 22, 2013 Applies to: SQL Server 2012
Summary: Enterprise Information Management (EIM) is a growing priority for organizations that want to gain a competitive advantage by basing key business decisions on credible, consistent, data. Some of the challenges involved in implementing an effective EIM solution include: • Integrating data from an increasing number of diverse sources and in a growing number of formats into a common platform for decision making • Empowering information workers who understand the business to manage data governance, while ensuring IT maintain control SQL Server 2012 provides a comprehensive platform for EIM, which makes it possible to: • Integrate any data from applications and systems across the enterprise • Make trusted decisions based on cleansed and standardized data • Empower business users to manage data governance and easily gain insights from the data For the latest information, see http://www.microsoft.com/sqlserver/en/us/solutionstechnologies/enterprise-information-management.aspx.
Copyright ©2012 Microsoft Corporation. All rights reserved. This document is provided "as-is." Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes.
Contents Introduction .................................................................................................................................................. 4 The SQL Server 2012 EIM Platform ............................................................................................................... 5 SQL Server Integration Services ................................................................................................................ 6 SQL Server Data Quality Services .............................................................................................................. 8 SQL Server Master Data Services ............................................................................................................ 13 Integrating Data across the Enterprise and Beyond ................................................................................... 17 SSIS Connection Managers and Data Sources ........................................................................................ 18 SSIS Extensibility...................................................................................................................................... 19 Empowering Business Users ....................................................................................................................... 20 Empowering Users to Manage Data Quality........................................................................................... 21 Empowering Business Users to Manage Master Data ............................................................................ 22 Case Study – Areva .............................................................................................................................. 24 Make Trusted Decisions on Credible, Consistent Data ............................................................................... 25 Conclusion ................................................................................................................................................... 27
Introduction An increasingly competitive and difficult business environment means that organizations need to get any competitive advantage they can in terms of making smart business decisions. Most organizations recognize the value of basing decisions on credible, consistent data data, and at a time when businesses, their customers, and third-party services on the Web are generating increasing volumes of information, there’s no shortage of data to inform those decisions. The problem is that data is usually created and stored in isolated application silos with varied levels of consistency and accuracy; and the challenges of integrating and standardizing the data can prevent companies from getting the comprehensive “single view of the truth” needed to drive effective decision making. “Everywhere you look, the quantity of information in the world is soaring. According to one estimate, mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes. Merely keeping up with this flood, and storing the bits that might be useful, is difficult enough. Analysing it, to spot patterns and extract useful information, is harder still. Even so, the data deluge is already starting to transform business, government, science and everyday life.” The Economist: “The Data Deluge” (Feb 2010 ) Many organizations are looking to enterprise information management (EIM) as a way to integrate, consolidate, and cleanse data for decision making. A good EIM solution can integrate day to day business operations and support data warehousing and business intelligence (BI) to help organizations learn from their data and become more effective. As early as 2010, Garner observed: “In 2010, exploiting business data and information is a nearly universal priority for organizations. During the next decade, mastering EIM is considered a critical success factor by CEOs, CFOs, CIOs and other business executives. These executive leaders know their organizations generate an enormous amount of high-value data and information; however, most also believe that much of the value is untapped and underutilized in managing their organizations” Gartner: “Business Leaders Must Learn Why, When and How to Exploit Enterprise Information Management” (October 2010). The reason for this trend is clear. Executives in many organizations believe that by bringing together as much information as possible into a trusted source of data for decision making, they can improve the financial performance of their organizations. In 2011, Gartner included the following strategic planning assumption in its research on data management and integration: “Through 2015, enterprises integrating high-value and diverse new information types and sources into a coherent information management infrastructure, will financially outperform their industry peers by more than 20%.” Gartner: “Key Issues for Data Management and Integration, 2011” (Feb 2011)
Microsoft SQL Server 2012 builds on the data integration and management features of previous releases to provide a comprehensive platform for EIM, and support data warehousing and business intelligence solutions that empower businesses to make critical business decisions on trusted data. Moreover, Microsoft’s data platform is designed to enable organizations to capitalize on the wealth of business knowledge held by information workers – enabling business users to take on the role of data stewards, and manage data quality and consistency with minimal support from IT specialists.
The SQL Server 2012 EIM Platform SQL Server 2012 provides all the components needed for an effective EIM solution in a single product. Key components of SQL Server 2012 that help you build an EIM solution are:
• SQL Server Integration Services • SQL Server Data Quality Services • SQL Server Master Data Services These technologies work together to create an EIM solution that supports other SQL Server technologies for data warehousing and BI, and which ensures that the entire business decision making ecosystem begins and ends with the business user. Figure 1 shows how SQL Server and other Microsoft technologies work together to provide a user-centric approach to business decision making.
Figure 1: A user-centric approach to business decision making
SQL Server Integration Services SQL Server Integration Services (SSIS) is a platform for data integration that provides a comprehensive and extensible solution for extract, transform, and load (ETL) operations. In an EIM context, SSIS provides a workflow and data flow engine that you can use to integrate data from virtually any data source into an ecosystem for business decision making. You can use Integration Services to automate tasks such as copying or downloading files, sending e-mail messages in response to events, updating data warehouses, cleaning and mining data, and managing SQL Server objects and data. Unlock and integrate the data from any industry standard third party source like SQL Server, Oracle, Teradata, DB2, SAP, CRM, SharePoint, real time, cloud-based applications, and more. SSIS consists of a workflow engine that you can use to automate control flow tasks and data flows. Data flows consist of a sequence of data sources, transformations, and destinations arranged as a pipeline through which data is passed between buffers. The buffer-based nature of the data flow pipeline enables ETL developers to maximize data throughput and optimize the overall performance of the data flow. ETL developers can use SQL Server Data Tools, a graphical development interface built on the Visual Studio environment, to create SSIS packages. Each package encapsulates a control flow, which may in turn contain multiple data flows. SQL Server Data Tools provides a simple to use, highly productive development environment that makes it possible for developers to quickly create and deploy complex ETL solutions.
Figure 2: Creating an SSIS Package in SQL Server Data Tools
SQL Server 2012 introduces a new project-level deployment model for SSIS packages, enabling organizations to deploy and manage multiple related SSIS packages as a single unit. You can define multiple execution environments, with associated configuration settings in the form of variables that can be mapped to project-level parameters defined in the SSIS project. Projects are deployed to an SSIS catalog on a SQL Server instance, and can be managed with SQL Server Management Studio. You can also schedule execution of individual SSIS packages by creating SQL Server Agent jobs, enabling you to create fully automated ETL solutions that power your EIM data integration processes.
Figure 3: SSIS Project Deployment and Management
When you have deployed a project in an SSIS catalog, you can monitor details of package execution easily though built-in reporting and status tracking, as shown in figure 4. This enables you to verify or troubleshoot package execution and monitor performance over time.
Figure 4: Monitoring SSIS Package Execution
SQL Server Data Quality Services The ability to integrate data from multiple data sources into a data warehouse to support business decision making is clearly of great benefit to organizations seeking a competitive advantage. However, decisions must be based on data that is trusted to be accurate, consistent, and complete. Microsoft® SQL Server 2012 Data Quality Services (DQS) is a new offering as part of SQL Server 2012 allowing customers to cleanse, match, standardize, and enrich their data to deliver trusted information for business intelligence, data warehouse, and transaction processing workloads. End users can even cleanse their personal files in unmanaged documents. SQL Server Data Quality Services (DQS) provide an approachable data quality solution for organizations of all sizes to help improve the quality of their data. SQL Server Data Quality Service (DQS) provides a knowledge-based approach to managing data quality. Organizations can leverage the business knowledge of their users to create knowledge bases that define known values and validation rules for the data domains used in data records for business entities. For example, you might create a knowledge base for customer data that defines the data domains, or fields, that are commonly used in customer records (such as Customer ID, First Name, Last Name, Gender, Email, Street Address, City, State, Country, etc.). You can then perform knowledge discovery against existing data to identify known values for these fields (such as “California” and “Washington” for the
State field), and define rules to validate any new domain values as they are discovered (such as a rule to ensure that all Email values contain a “@” character, or that all Gender values begin with “M” or “F”). DQS provides a client application for managing knowledge bases, as shown in figure 5.
Figure 5: Data Quality Services Client Application
As well as defining validation rules for domains in a knowledge base, you can identify synonyms and common data entry errors for domain values, and specify a leading value to which all instances of these values should be corrected. For example, your knowledge discovery might reveal that records for customers who live in California most commonly have a State value of “California”; but often an application user will enter alternative values with the same meaning, such as “CA”, “Calif.”, or they will commonly mistype the value and accidentally enter “Californa”. Customer records with variants of the same state value might have minimal impact in the line of business application in which they are entered, but if the data in that application is to be used for analysis or reporting that aggregates vales by state, the presence of multiple values for the same state can result in some misleading information on which to base business decisions. To avoid this problem, you can identify these as known values in the DQS knowledge base, and specify that they are synonyms that should always be corrected to a leading value of “California”. Then, when you use DQS to perform data cleansing, the resulting cleansed data will include consistent values for the
state domain. Figure 6 shows a DQS knowledge base in which a Country/Region domain includes the leading value “United Kingdom”, and several synonyms for this value that should be corrected.
Figure 6: Correcting domain values
While a DQS knowledge base is often primarily based on your own organization’s institutional knowledge about business-specific data, there are some cases where it can be useful to incorporate external knowledge for common types of data, such as postal address or telephone number validation. The Microsoft Windows Azure Marketplace includes several commercial datasets that are specifically designed for data cleansing and validation and for which you can purchase a subscription. When you have subscribed to one of these datasets, you can use it as reference data for a domain in a DQS knowledge base and supplement your own business-specific data validation and value correction rules. For example, figure 7 shows how external data, purchased in the Windows Azure Marketplace, can be used to validate and correct company names in a Company domain by referencing a comprehensive dataset of US registered companies.
Figure 7: Using external reference data in a DQS knowledge base
You can perform data cleansing interactively with the DQS client application by specifying a data source such as an Excel spreadsheet or a table in a SQL Server database, and mapping the fields in the data source to domains in the knowledge base. Additionally, you can incorporate data cleansing into ETL processes by using the Data Cleansing transformation in an SSIS data flow, as shown in figure 8.
Figure 8: Incorporating DQS data cleaning into an SSIS data flow
As well as using DQS for data cleansing, you can create matching policies and perform data matching to identify and consolidate duplicate records for the same business entity. For example, it’s possible that a customer has registered on your organization’s e-commerce Web site as “Jenny Russell”, but also made a purchase in a physical store where the name has been recorded as “Jennifer Russell”. The organization now has multiple customer records for the same customer, which will affect the accuracy of any reporting or analysis that aggregates data by customer. With DQS, you can create a matching policy that compares multiple domains across records, assigning a weighted value for fields that are exact or approximate matches. So your matching policy might compare customer records on FirstName, LastName, Address, Email, and DateOfBirth domains. When multiple records have enough matching domains to satisfy the matching policy, DQS identifies the records as possible duplicates. For example, if a dataset includes a record for Jenny Russell and a record for Jennifer Russell, but the address, email, and date of birth values for the two records are the same, you can reasonably assume that these records might relate to the same customer.
Figure 9: A Matching Policy
The data cleansing and data matching functionality in DQS can help organizations manage the quality and integrity of their data, and help ensure that decisions are based on trusted information.
SQL Server Master Data Services Master Data Services is the SQL Server solution for master data management, focused on creation, maintenance and storage of master data structures used for object mapping, reference data, metadata management, and dimensions and hierarchies for data integration operations. This includes business intelligence and data warehousing, and integration between operational systems. With the Master Data Services Add-in for SQL 2012, business users can directly manage existing database or data warehouse dimensions and hierarchies from within Excel without IT intervention. IT is still given oversight to track and reverse changes made by the business. With DQS, an organization can apply knowledge about individual data field values to cleanse datasets and identify duplicate records. However, large enterprises often need to maintain data representations of core business entities in multiple applications and systems across the business. For example, a company might store employee data in an HR management system and also in a payroll application; or it might store product data in a stock management system and in an e-commerce product catalog. When the same business entities are represented in multiple systems, it can be useful to maintain a definitive, master record for each entity to ensure that any data relating to a specific entity is consistent across the enterprise. You may approach this challenge by designating one of your application data stores as the master system of record for a given type of business entity (for example, you could use the HR management system as the definitive source of information for employees), or you could create a separate master data hub that ensures consistency across all systems. The discipline of maintaining a central data definition for business entities is commonly called master data management (MDM), and SQL Server Master Data Services (MDS) provides a SQL Server-based solution that you can use to implement MDM for any kind of business entity.
Figure 10: Managing data models with Master Data Services
As figure 10 shows, MDS enables you to create master data models for your core business entities. These models contain entity definitions, which in turn define the data attributes for each entity. You can also organize your entities into hierarchical relationships, so for example a product might belong to a subcategory, which in turn belongs to a category. After you have created a master data model, you can manage the data entities in the model to define their attributes (which you can categorize into multiple attribute groups for specific applications or user scenarios). Figure 11 shows the attributes defined for a Product entity.
Figure 11: An entity and its attributes
When you have defined the entities and attributes in your master data model, MDS provides staging tables that you can use to load data into the model. Additionally, you can create subscription views for the entities and hierarchies you have defined so that applications can retrieve master data from the model by simply submitting regular Transact-SQL queries. This database-oriented architecture for transferring data into and out of the master data model makes it easy to build a master data hub, in which new data is loaded into the MDS model to be brought under the governance of master data management, and applications can consume master data to ensure enterprise-wide consistency. In many cases, SSIS is used as the “engine” to manage the flow of data into and out of the master data hub as shown in figure 12.
Figure 12: Using SSIS to insert and extract master data
When your master data model has been populated with data, you can view and manage the data instances of the entities it defines, and create custom hierarchies and collections of entities for specific business scenarios. For example, you could create an explicit hierarchy of products that are sold through a specific retail partner channel, as shown in figure 13.
Figure 13: An explicit hierarchy
You can also use MDS to validate the data in your master data model by applying custom business rules. For example, you could define a rule verifies that all product prices are greater than zero as shown in figure 14.
Figure 14: Defining a business rule
MDS includes many more features that enable you to implement complex MDM solutions and ensure that consistent data representations of key business entities are used across the enterprise. The combination of this ability to manage master data with MDS, the data cleansing and matching functionality of DQS, and the data integration capabilities of SSIS, creates a comprehensive platform for EIM.
Integrating Data across the Enterprise and Beyond One of the key aims of an EIM solution is to consolidate the information from multiple, disparate sources and provide users with a “single version of the truth” on which to base their decisions. One of the main challenges to achieving this consolidation in many organizations is that the required data is locked in discrete application silos, or needs to be obtained from external sources.
SSIS Connection Managers and Data Sources Earlier in this paper, you learned how SSIS provides a platform for creating ETL solutions that integrate data from multiple sources. One of the key benefits of SSIS is the broad range of data connectivity it supports, from relational database systems to XML and flat files or Excel workbooks. The primary way in which SSIS connects to data sources is through an extensible architecture of connection managers, a significant number of which are provided “out of the box” in SSIS.
Figure 15: SSIS Connection Managers
Figure 15 shows a range of connection managers, including ODBC and OLEDB connection managers that can be used to connect to a wide range of common data sources, including SQL Server, Oracle, DB2, MySQL, and other database systems. You can even connect to and consume data from cloud-based databases in SQL Azure. Additionally, connection managers are available for enterprise applications such as SAP and Teradata. SSIS also includes a large number of connection managers for commonly used data file formats, such as Excel, XML, or comma-delimited text files. You can combine these with control flow tasks to manage file system resources, FTP connections, and Web services to create complex workflows that process and consume data files. 18
SSIS data flows can include distributed transactions for data sources that support them, so you can use them to create reliable ETL processes that produce consistent data. You can also use the checkpoint capability of SSIS to restart failed data flows without repeating workflow tasks that have already completed successfully. If your data resides in SQL Server or Oracle databases, new features in SQL Server 2012 make it easier than ever to identify and extract modified data through enhanced support for Change Data Capture (CDC). These features make it easy to detect data that has changed since the previous data extraction cycle, and restrict data retrieval to include only the modified rows. This significantly improves the performance of your ETL workflows while ensuring that the information your organization uses to make business decisions reflects the latest version of the data.
SSIS Extensibility If your data resides in a source for which no connection manager is provided in SSIS, you can take advantage of the extensibility of SSIS and either procure or create a custom connection manager to suit your needs. SSIS components are based on Microsoft .NET base classes, and developers can easily create components that inherit from these classes to implement custom connection managers, data sources and destinations, transformations, and control flow tasks. For example, many organizations store business data in SharePoint lists, and need to consume this data in ETL processes. A custom component to consume SharePoint list data is available from the CodePlex Web site (www.codeplex.com), and you can install and use this to integrate SharePoint data into your EIM solution 1. You can also create your own custom data sources, transformations, and destinations by using the Script workflow component, which is provided as standard in SSIS. The Script component enables you to implement custom functionality by creating a Visual Studio Tools for Applications (VSTA) script in a supported language such as C# or Visual Basic, as shown in figure 16.
For more information about integrating SharePoint data into an SSIS data flow, see http://technet.microsoft.com/en-us/library/hh368261.aspx.
Figure 16: Implementing a custom script in an SSIS data flow
The ability to create custom component and scripts makes it possible to build data integration solutions for virtually any data. For example, you could integrate event data from a data stream generated by plant machinery or sensors by creating a custom SSIS solution that consumes data from SQL Server StreamInsight 2.
Empowering Business Users One of the key differentiators of the Microsoft solution for EIM when compared to competitors is the notion that business data belongs to the business, not to IT. The IT department is great at managing application and data infrastructure, but knowledge of what that data actually means and how it should be cleansed and made consistent is best understood by the information workers who use it in their dayto-day roles. SQL Server 2012 gives IT specialists the tools they need to build a comprehensive data integration solution and manage data governance and compliance across data infrastructure, but also gives business users intuitive tools that they can use to manage the quality and integrity of their own data.
For more information about using StreamInsight and SSIS, see http://technet.microsoft.com/enus/library/gg567304.aspx.
Empowering Users to Manage Data Quality The DQS client application provides an intuitive wizard-based tool with which business users can create and manage knowledge bases, and perform data quality tasks such as data cleansing or matching, as shown in figure 17. This ability to manage data quality with minimal technology or database expertise makes it possible for business users to take on the role of “data steward”, and manage the integrity of the data used by the business.
Figure 17: A wizard-based approach to data quality management
After performing a data cleansing or matching operation with the DQS client application, user can export the results as a Microsoft Excel workbook as shown in figure 18. This enables them to use a familiar tool to examine and verify the suggestions that DQS has generated before applying them to production data.
Figure 18: Data cleansing results in Microsoft Excel
Empowering Business Users to Manage Master Data Excel is also the primary tool with which business users can manage master data. With SQL Server 2012 Master Data Services, information workers can use the MDS Add-In for Microsoft Excel to create and existing database or data warehouse dimensions and hierarchies from within Excel as shown in figure 19. Excel provides a familiar and intuitive environment for managing master data, and business users can build and publish master data models quickly and efficiently, without specialist support from IT or external consultants.
Figure 19: Managing a master data model with Excel
When the master data model is built, Excel continues to provide a user-friendly environment for adding and editing entity records to the model by using standard Excel functionality to type individual attribute values or copy and paste entire ranges of cells that represent multiple entity instances. Users can also save and share queries against the master data model, and even validate data against the business rules defined in MDS as shown in figure 20.
Figure 20: Validating master data against business rules in Excel
Case Study – Areva Areva is a French nuclear power company that provides governments and utilities with solutions for lowcarbon power generation. A critical part of the company’s business is providing its customers with up-todate, accurate data. This was a challenge because Areva had no centralized management of its master data. Instead, the company stored and managed customer data at each individual subsidiary site, where multiple databases were used. As a result, data was often redundant and, in some cases, inaccurate. To validate the data and make it useful for customers, Areva employees had to manually make changes, a costly and time-consuming process that often meant customers did not receive data on time. In October 2011, Areva began to implement a new centralized master data hub based on Microsoft SQL Server 2012 Enterprise data management software, focusing on SQL Server 2012 Master Data Services and Data Quality Services. The new solution consolidates dozens of databases containing legal information, internal organizational data, and specific customer data. It uses SQL Server 2012 Integration Services to clean and transform data.
Areva takes advantage of SQL Server 2012 Master Data Services to enforce the company’s processes, validations, and rules to provide the most correct data. This eliminates the processes it previously had in place to keep the company’s data current. With SQL Server 2012 Data Quality Services the company further improves data quality by profiling, cleansing, and matching its most critical data. In addition, Areva is using the SQL Server 2012 Master Data Services Add-In for Excel, a new tool that gives users the ability to gather data and publish it to the database from within Microsoft Excel 2010 spreadsheet software, which the company had already been using. Areva used these new technologies to build an identity access management application that stores centralized, updated personal employee data from the IT, Financial, and Human Resources departments of all 330 of the company’s subsidiaries. The application includes a 2-gigabyte database with 48,000 rows of data in its largest table. The new solution improves data quality and timeliness. It also helps Areva make better business decisions. What’s more, because the solution is based on familiar tools, it doesn’t require special training. Finally, Areva saves time and money by eliminating manual data cleansing.
Make Trusted Decisions on Credible, Consistent Data The overall aim of any EIM solution is to enable business users to rely on the data they use to make critical business decisions. With SQL Server 2012, business users can use DQS to define and manage the knowledge bases on which data cleansing and matching rely; and they can manage the consistency of business entity data through Master Data Services. The result of this user-centric approach is a solution that maximizes the value of business data, quickly and cost-effectively. To complete the solution, users must be able to easily consume the standardized data they have created, and use it to make effective business decisions. To be of any practical use, a user-centric EIM solution must support user-centric BI. SQL Server 2012 leads the way in self-service BI, delivering insights from trusted data directly to users. SQL Server PowerPivot, shown in figure 21, provides a massively scalable, but easy to use Excel-based data analysis tool with which business users can slice and dice data, and easily share their analysis through SharePoint.
Figure 21: Analyzing data with PowerPivot
SQL Server 2012 also introduces Power View, a user-centric tool for interactively visualizing data in an intuitive and easy-to use interface, as shown in figure 22. The ability for business users to take on the role of data steward with DQS and MDS, and to directly analyze and visualize data with self-service BI tools like PowerPivot and Power View enables them to take an active role in the complete EIM lifecycle. This user-centric approach empowers organizations to use their IT resources to manage data infrastructure and integration processes effectively, while reducing the burden on IT to manage business data and analytics – helping reduce the overall cost of implementing EIM and facilitating a dynamic approach to business decision making that promotes business responsiveness and flexibility.
Figure 22: Interactive data visualization with Power View
Conclusion A good Enterprise Information Management (EIM) solution should start and end with the business users who drive the success of the company. SQL Server 2012 empowers business users to manage the quality, integrity, and standardization of the data they use every day allowing them to trust that they are making decisions on credible, consistent data. In this model, IT still retains oversight of the organization’s data infrastructure. With SQL Server 2012 Integration Services, Master Data Services, and Data Quality Services, you can easily bring together data from all across your enterprise, and use the data quality and governance rules defined by the business to create a reliable, trusted source of data for business decision making.
For more information: http://www.microsoft.com/sqlserver/: SQL Server Web site http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example: • •
Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason? Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release. Send feedback.