November 11, 2016 | Author: mfarrukhhanif | Category: N/A
BODS30 - SAP Data Services - Data Quality Management...
BODS30 Data Services - Data Quality Management SAP BusinessObjects - Data Services
Date Training Center Instructors Education Website
Participant Handbook Course Version: 96 Course Duration: 2 Day(s) Material Number: 50104426
An SAP course - use it to learn, reference it for work
Copyright Copyright © 2011 SAP AG. All rights reserved. No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed without prior notice. Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors.
Trademarks •
Microsoft®, WINDOWS®, NT®, EXCEL®, Word®, PowerPoint® and SQL Server® are registered trademarks of Microsoft Corporation.
•
IBM®, DB2®, OS/2®, DB2/6000®, Parallel Sysplex®, MVS/ESA®, RS/6000®, AIX®, S/390®, AS/400®, OS/390®, and OS/400® are registered trademarks of IBM Corporation.
•
ORACLE® is a registered trademark of ORACLE Corporation.
•
INFORMIX®-OnLine for SAP and INFORMIX® Dynamic ServerTM are registered trademarks of Informix Software Incorporated.
•
UNIX®, X/Open®, OSF/1®, and Motif® are registered trademarks of the Open Group.
•
Citrix®, the Citrix logo, ICA®, Program Neighborhood®, MetaFrame®, WinFrame®, VideoFrame®, MultiWin® and other Citrix product names referenced herein are trademarks of Citrix Systems, Inc.
•
HTML, DHTML, XML, XHTML are trademarks or registered trademarks of W3C®, World Wide Web Consortium, Massachusetts Institute of Technology.
•
JAVA® is a registered trademark of Sun Microsystems, Inc.
•
JAVASCRIPT® is a registered trademark of Sun Microsystems, Inc., used under license for technology invented and implemented by Netscape.
•
SAP, SAP Logo, R/2, RIVA, R/3, SAP ArchiveLink, SAP Business Workflow, WebFlow, SAP EarlyWatch, BAPI, SAPPHIRE, Management Cockpit, mySAP.com Logo and mySAP.com are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. All other products mentioned are trademarks or registered trademarks of their respective companies.
Disclaimer THESE MATERIALS ARE PROVIDED BY SAP ON AN "AS IS" BASIS, AND SAP EXPRESSLY DISCLAIMS ANY AND ALL WARRANTIES, EXPRESS OR APPLIED, INCLUDING WITHOUT LIMITATION WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, WITH RESPECT TO THESE MATERIALS AND THE SERVICE, INFORMATION, TEXT, GRAPHICS, LINKS, OR ANY OTHER MATERIALS AND PRODUCTS CONTAINED HEREIN. IN NO EVENT SHALL SAP BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, CONSEQUENTIAL, OR PUNITIVE DAMAGES OF ANY KIND WHATSOEVER, INCLUDING WITHOUT LIMITATION LOST REVENUES OR LOST PROFITS, WHICH MAY RESULT FROM THE USE OF THESE MATERIALS OR INCLUDED SOFTWARE COMPONENTS.
g201192861428
About This Handbook This handbook is intended to complement the instructor-led presentation of this course, and serve as a source of reference. It is not suitable for self-study.
Typographic Conventions American English is the standard used in this handbook. The following typographic conventions are also used. Type Style
Description
Example text
Words or characters that appear on the screen. These include field names, screen titles, pushbuttons as well as menu names, paths, and options. Also used for cross-references to other documentation both internal and external.
2011
Example text
Emphasized words or phrases in body text, titles of graphics, and tables
EXAMPLE TEXT
Names of elements in the system. These include report names, program names, transaction codes, table names, and individual key words of a programming language, when surrounded by body text, for example SELECT and INCLUDE.
Example text
Screen output. This includes file and directory names and their paths, messages, names of variables and parameters, and passages of the source text of a program.
Example text
Exact user entry. These are words and characters that you enter in the system exactly as they appear in the documentation.
Variable user entry. Pointed brackets indicate that you replace these words and characters with appropriate entries.
© 2011 SAP AG. All rights reserved.
iii
About This Handbook
BODS30
Icons in Body Text The following icons are used in this handbook. Icon
Meaning For more information, tips, or background
Note or further explanation of previous point Exception or caution Procedures
Indicates that the item is displayed in the instructor's presentation.
iv
© 2011 SAP AG. All rights reserved.
2011
Contents Course Overview ............................................................................. vii Course Goals.................................................................................vii Course Objectives ...........................................................................vii
Unit 1: Describing Data Services ...........................................................1 Defining Data Services....................................................................... 2
Unit 2: Using Data Quality Transforms .................................................. 21 Data Quality Transforms ................................................................... 22
Unit 3: Using Address Cleanse Transforms............................................ 51 Using the Address Cleanse Transforms ................................................ 52
Unit 4: Using Data Cleanse Transforms ................................................. 93 Using the Data Cleanse Transform....................................................... 94
Unit 5: Matching and Consolidating Data..............................................129 Using the Match Wizard .................................................................. 131 Configuring the Match Transform........................................................ 156 Performing Post-Match Processing ..................................................... 204 Consolidating Matching Records ........................................................ 230 Using Advanced Match Strategies ...................................................... 261
2011
© 2011 SAP AG. All rights reserved.
v
Contents
vi
BODS30
© 2011 SAP AG. All rights reserved.
2011
Course Overview SAP BusinessObjects Data Quality enables you to parse, cleanse, standardize, consolidate, and enhance records. In this two-day course, you will learn about cleansing address and firm data, and match and consolidate records. As a business benefit, by being able to create efficient data quality projects, you can use the transformed data to help improve operational and supply chain efficiencies, enhance customer relationships, create new revenue opportunities, and optimize return on investment from enterprise applications.
Target Audience This course is intended for the following audiences: • •
Consultants and project team members responsible for implementing data quality projects Customer users responsible for administering and managing data quality projects
Course Prerequisites Required Knowledge • • • • •
Thorough knowledge of ETL (Extraction, Transformation and Loading) data processes Thorough knowledge of defining source and target metadata in the SAP BusinessObjects Designer Thorough knowledge of creating batch jobs Thorough knowledge of using functions, scripts, and variables Thorough knowledge of using Data Integrator Platform transforms
Course Goals This course will prepare you to: • •
2011
Use Data Quality transforms to parse, standardize, cleanse, enhance and match records Consolidate matching records to form a best record
© 2011 SAP AG. All rights reserved.
vii
Course Overview
BODS30
Course Objectives After completing this course, you will be able to: • • • • •
viii
Complete an Address Cleanse transform Work with global address data Complete a Data Cleanse transform Set up matching using the Match Wizard Consolidate matching records to form a best record
© 2011 SAP AG. All rights reserved.
2011
Unit 1 Describing Data Services Unit Overview Lesson introduction Data Services is a graphical interface for creating and staging jobs for data integration and data quality purposes After completing this lesson, you will be able to: • • • •
Describe the purpose of Data Services Describe Data Services architecture Define Data Services objects Use the Data Services Designer interface
Unit Objectives After completing this unit, you will be able to: • •
Define Data Services objects Use the Data Services Designer interface
Unit Contents Lesson: Defining Data Services ..................................................... 2
2011
© 2011 SAP AG. All rights reserved.
1
Unit 1: Describing Data Services
BODS30
Lesson: Defining Data Services Lesson Overview Data Services is a graphical interface for creating and staging jobs for data integration and data quality purposes.
Lesson Objectives After completing this lesson, you will be able to: • •
Define Data Services objects Use the Data Services Designer interface
Business Example For reporting in SAP NetWeaver Business Warehouse, your company needs data from diverse data sources, such as SAP systems, non-SAP systems, the Internet, and other business applications. You should therefore examine the technologies that SAP NetWeaver BW offers for data acquisition.
Describing Data Services BusinessObjects Data Services provides a graphical interface that allows you to easily create jobs that extract data from heterogeneous sources, transform that data to meet the business requirements of your organization, and load the data into a single location. Note: Although Data Services can be used for both real-time and batch jobs, this course covers batch jobs only. Data Services combines both batch and real-time data movement and management with intelligent caching to provide a single data integration platform for information management from any information source and for any information use.
2
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Defining Data Services
Figure 1: Data Services Architecture
This unique combination allows you to: • • • •
Stage data in an operational data store, data warehouse, or data mart. Update staged data in batch or real-time modes. Create a single environment for developing, testing, and deploying the entire data integration platform. Manage a single metadata repository to capture the relationships between different extraction and access methods and provide integrated lineage and impact analysis.
Data Services performs three key functions that can be combined to create a scalable, high-performance data platform. It: •
• •
Loads Enterprise Resource Planning (ERP) or enterprise application data into an operational datastore (ODS) or analytical data warehouse, and updates in batch or real-time modes. Creates routing requests to a data warehouse or ERP system using complex rules. Applies transactions against ERP systems.
Data mapping and transformation can be defined using the Data Services Designer graphical user interface. Data Services automatically generates the appropriate interface calls to access the data in the source system. For most ERP applications, Data Services generates SQL optimized for the specific target database (Oracle, DB2, SQL Server, Informix, and so on). Automatically-generated, optimized code reduces the cost of maintaining data
2011
© 2011 SAP AG. All rights reserved.
3
Unit 1: Describing Data Services
BODS30
warehouses and enables you to build data solutions quickly, meeting user requirements faster than other methods (for example, custom-coding, direct-connect calls, or PL/SQL). Data Services can apply data changes in a variety of data formats, including any custom format using a Data Services adapter. Enterprise users can apply data changes against multiple back-office systems singularly or sequentially. By generating calls native to the system in question, Data Services makes it unnecessary to develop and maintain customized code to manage the process. You can also design access intelligence into each transaction by adding flow logic that checks values in a data warehouse or in the transaction itself before posting it to the target ERP system.
The Data Services Architecture Data Services relies on several unique components to accomplish the data integration and data quality activities required to manage your corporate data. Data Services includes the standard components: • • • • • • • • • •
Designer Repository Job Server Engines Access Server Adapters Real-time Services Address Server Cleansing Packages, Dictionaries, and Directories Management Console
This diagram illustrates the relationships between these components: The Data Services Designer Data Services Designer is a Windows client application used to create, test, and manually execute jobs that transform data and populate a data warehouse. Using Designer, you create data management applications that consist of data mappings, transformations, and control logic.
4
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Defining Data Services
Figure 2: Data Services Designer Interface
You can create objects that represent data sources, and then drag, drop, and configure them in flow diagrams. Designer allows you to manage metadata stored in a local repository. From the Designer, you can also trigger the Job Server to run your jobs for initial application testing. The Data Services Repository The Data Services repository is a set of tables that holds user-created and predefined system objects, source and target metadata, and transformation rules. It is set up on an open client/server platform to facilitate sharing metadata with other enterprise tools. Each repository is stored on an existing Relational Database Management System (RDBMS).
2011
© 2011 SAP AG. All rights reserved.
5
Unit 1: Describing Data Services
BODS30
Figure 3: Data Services Repository
There are three types of repositories: •
•
•
A local repository (known in Designer as the Local Object Library) is used by an application designer to store definitions of source and target metadata and Data Services objects. A central repository (known in Designer as the Central Object Library) is an optional component that can be used to support multiuser development. The Central Object Library provides a shared library that allows developers to check objects in and out for development. A profiler repository is used to store information that is used to determine the quality of data.
The Data Services Job Server Each repository is associated with at least one Data Services Job Server, which retrieves the job from its associated repository and starts the data movement engine. The data movement engine integrates data from multiple heterogeneous sources, performs complex data transformations, and manages extractions and transactions from ERP systems and other sources. The Job Server can move data in batch or real-time mode and uses distributed query optimization, multithreading, in-memory caching, in-memory data transformations, and parallel processing to deliver high data throughput and scalability.
6
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Defining Data Services
Figure 4: Data Services Architecture – Job Server
While designing a job, you can run it from the Designer. In your production environment, the Job Server runs jobs triggered by a scheduler or by a real-time service managed by the Data Services Access Server. In production environments, you can balance job loads by creating a Job Server Group (multiple Job Servers), which executes jobs according to overall system load. Data Services provides distributed processing capabilities through the Server Groups. A Server Group is a collection of Job Servers that each reside on different Data Services server computers. Each Data Services server can contribute one, and only one, Job Server to a specific Server Group. Each Job Server collects resource utilization information for its computer. This information is utilized by Data Services to determine where a job, data flow, or subdata flow (depending on the distribution level specified) should be executed. The Data Services Engines When Data Services jobs are executed, the Job Server starts Data Services engine processes to perform data extraction, transformation, and movement. Data Services engine processes use parallel processing and in-memory data transformations to deliver high data throughput and scalability. The Data Services Cleansing Packages
2011
© 2011 SAP AG. All rights reserved.
7
Unit 1: Describing Data Services
BODS30
Figure 5: Data Services Architecture–Cleansing Packages-
The Data Quality Cleansing Packages files provide referential data for the Data Cleanse and Address Cleanse transforms to use when parsing, standardizing, and cleansing name and address data. Cleansing Packages enhance the ability of Data Cleanse to accurately process various forms of global data by including language-specific reference data and parsing rules. Directories provide information on addresses from postal authorities; cleansing package files are used to identify, parse, and standardize data such as names, titles, and firm data. Dictionaries also contain acronym, match standard, gender, capitalization, and address information. The Data Services Management Console
8
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Defining Data Services
The Data Services Management Console provides access to these features: •
Administrator Administer Data Services resources, including:
•
– Scheduling, monitoring, and executing batch jobs – Configuring, starting, and stopping real-time services – Configuring Job Server, Access Server, and repository usage – Configuring and managing adapters – Managing users – Publishing batch jobs and real-time services via web services – Reporting on metadata Auto Documentation View, analyze, and print graphical representations of all objects as depicted in Data Services Designer, including their relationships, properties, and more.
•
Data Validation Evaluate the reliability of your target data based on the validation rules you create in your Data Services batch jobs to quickly review, assess, and identify potential inconsistencies or errors in source data.
•
Impact and Lineage Analysis Analyze end-to-end impact and lineage for Data Services tables and columns, and SAP BusinessObjects Business Intelligence platform objects such as universes, business views, and reports.
•
Operational Dashboard View dashboards of status and performance execution statistics of Data Services jobs for one or more repositories over a given time period.
•
Data Quality Reports Use data quality reports to view and export SAP Crystal Reports for batch and real-time jobs that include statistics-generating transforms. Report types include job summaries, transform-specific reports, and transform group reports. To generate reports for Match, US Regulatory Address Cleanse, and Global Address Cleanse transforms, you must enable the Generate report data option in the Transform Editor.
Other Data Services Tools There are also several tools to assist you in managing your Data Services installation.
2011
© 2011 SAP AG. All rights reserved.
9
Unit 1: Describing Data Services
BODS30
The Data Services Repository Manager allows you to create, upgrade, and check the versions of local, central, and profiler repositories. The Data Services Server Manager allows you to add, delete, or edit the properties of Job Servers. It is automatically installed on each computer on which you install a Job Server. Use the Server Manager to define links between Job Servers and repositories. You can link multiple Job Servers on different machines to a single repository (for load balancing) or each Job Server to multiple repositories (with one default) to support individual repositories (for example, separating test and production environments). The License Manager displays the Data Services components for which you currently have a license. The Metadata Integrator allows Data Services to seamlessly share metadata with SAP BusinessObjects Intelligence products. Run the Metadata Integrator to collect metadata into the Data Services repository for Business Views and Universes used by SAP Crystal Reports, Desktop Intelligence documents, and Web Intelligence documents. Note: These other tools are not discussed in this course.
Defining Data Services Objects Data Services provides you with a variety of objects to use when you are building your data integration and data quality applications.
Figure 6: Data Services Object Types
Data Services Objects
10
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Defining Data Services
In Data Services, all entities you add, define, modify, or work with are objects. Some of the most frequently used objects are: • • • • • •
Projects Jobs Work flows Data flows Transforms Scripts
This diagram shows some common objects.
Figure 7: Data Services Objects
All objects have options, properties, and classes. Each can be modified to change the behavior of the object. Options control the object. For example, to set up a connection to a database, the database name is an option for the connection. Properties describe the object. For example, the name and creation date describe what the object is used for and when it became active. Attributes are properties used to locate and organize objects. Classes define how an object can be used. Every object is either reusable or single-use. Single-use objects appear only as components of other objects. They operate only in the context in which they were created. You cannot copy single-use objects.
2011
© 2011 SAP AG. All rights reserved.
11
Unit 1: Describing Data Services
BODS30
A reusable object has a single definition and all calls to the object refer to that definition. If you change the definition of the object in one place, and then save the object, the change is reflected to all other calls to the object. Most objects created in Data Services are available for reuse. After you define and save a reusable object, Data Services stores the definition in the repository. You can then reuse the definition as necessary by creating calls to it. For example, a data flow within a project is a reusable object. Multiple jobs, such as a weekly load job and a daily load job, can call the same data flow. If this data flow is changed, both jobs call the new version of the data flow. You can edit reusable objects at any time independent of the current open project. For example, if you open a new project, you can open a data flow and edit it. However, the changes you make to the data flow are not stored until you save them. Defining Relationship between Objects Jobs are composed of work flows and/or data flows: • •
A work flow is the incorporation of several data flows into a sequence. A data flow process transforms source data into target data.
Figure 8: Data Services Object Relationships
12
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Defining Data Services
A work flow orders data flows and the operations that support them. It also defines the interdependencies between data flows. For example, if one target table depends on values from other tables, you can use the work flow to specify the order in which you want Data Services to populate the tables. You can also use work flows to define strategies for handling errors that occur during project execution, or to define conditions for running sections of a project. A data flow defines the basic task that Data Services accomplishes, which involves moving data from one or more sources to one or more target tables or files. You define data flows by identifying the sources from which to extract data, the transformations the data should undergo, and targets. Defining projects and jobs A project is the highest-level object in Designer. Projects provide a way to organize the other objects you create in Designer. A job is the smallest unit of work that you can schedule independently for execution. A project is a single-use object that allows you to group jobs. For example, you can use a project to group jobs that have schedules that depend on one another or that you want to monitor together. Projects have these characteristics: • • •
Projects are listed in the Local Object Library. Only one project can be open at a time. Projects cannot be shared among multiple users.
The objects in a project appear hierarchically in the project area. If a plus sign (+) appears next to an object, you can expand it to view the lower-level objects contained in the object. Data Services displays the contents as both names and icons in the project area hierarchy and in the workspace. Jobs must be associated with a project before they can be executed in the project area of Designer. Using Work Flows Jobs with data flows can be developed without using work flows. However, one should consider nesting data flows inside of work flows by default. This practice can provide various benefits. Always using work flows makes jobs more adaptable to additional development and/or specification changes. For instance, if a job initially consists of four data flows that are to run sequentially, they could be set up without work flows. But what if specification changes require that they be merged into another job instead? The developer would have to replicate their sequence correctly in the other job. If these had been initially added to a work flow, the developer could then have simply
2011
© 2011 SAP AG. All rights reserved.
13
Unit 1: Describing Data Services
BODS30
copied that work flow into the correct position within the new job. There would be no need to learn, copy, and verify the previous sequence. The change can be made more quickly with greater accuracy. Even if there is one data flow per work flow, there are benefits to adaptability. Initially, it may have been decided that recovery units are not important; the expectation being that if the job fails, the whole process could simply be rerun. However, as data volumes tend to increase, it may be determined that a full reprocessing is too time-consuming. The job may then be changed to incorporate work flows to benefit from recovery units to bypass reprocessing of successful steps. However, these changes can be complex and can consume more time than allotted for in a project plan. It also opens up the possibility that units of recovery are not properly defined. Setting these up during initial development when the full analysis of the processing nature is preferred. Note: This course focuses on creating batch jobs using database Datastores and file formats.
Using the Data Services Designer The Data Services Designer interface allows you to plan and organize your data integration and data quality jobs in a visual way. Most of the components of Data Services can be programmed with this interface. Describing the Designer Window The Data Services Designer interface consists of a single application window and several embedded supporting windows. The application window contains the menu bar, toolbar, Local Object Library, project area, tool palette, and workspace. Using the Local Object Library The Local Object Library gives you access to the object types listed in the table below. The table shows the tab on which the object type appears in the Local Object Library and describes the Data Services context in which you can use each type of object. You can import objects to and export objects from your Local Object Library as a file. Importing objects from a file overwrites existing objects with the same names in the destination Local Object Library. Whole repositories can be exported in either .atl or .xml format. Using the .xml file format can make repository content easier for you to read. It also allows you to export Data Services to other products. Using the Tool Palette
14
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Defining Data Services
The tool palette is a separate window that appears by default on the right edge of the Designer workspace. You can move the tool palette anywhere on your screen or dock it on any edge of the Designer window. The icons in the tool palette allow you to create new objects in the workspace. Disabled icons occur when there are invalid entries to the diagram open in the workspace. To show the name of each icon, hold the cursor over the icon until the tool tip for the icon appears. When you create an object from the tool palette, you are creating a new definition of an object. If a new object is reusable, it is automatically available in the Local Object Library after you create it. If you select the data flow icon from the tool palette and define a new data flow called DF1, you can later drag that existing data flow from the Local Object Library and add it to another data flow called DF2. Using the Workspace When you open a job or any object within a job hierarchy, the workspace becomes active with your selection. The workspace provides a place to manipulate objects and graphically assemble data movement processes. These processes are represented by icons that you drag and drop into a workspace to create a diagram. This diagram is a visual representation of an entire data movement application or some part of a data movement application. You specify the flow of data by connecting objects in the workspace from left to right in the order you want the data to be moved.
2011
© 2011 SAP AG. All rights reserved.
15
Unit 1: Describing Data Services
BODS30
Lesson Summary You should now be able to: • Define Data Services objects • Use the Data Services Designer interface
16
© 2011 SAP AG. All rights reserved.
2011
BODS30
Unit Summary
Unit Summary You should now be able to: • Define Data Services objects • Use the Data Services Designer interface
2011
© 2011 SAP AG. All rights reserved.
17
Unit Summary
18
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit Summary
19
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit Summary
20
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit 2 Using Data Quality Transforms Unit Overview Data Quality transforms are used to parse, standardize, cleanse, enhance, match, and consolidate your records. After completing this lesson, you will be able to: •
Describe Data Quality transforms
Unit Objectives After completing this unit, you will be able to: • • •
Define Data Quality processes Use Data Quality transforms Configure Data Quality transforms
Unit Contents Lesson: Data Quality Transforms.................................................. 22 Exercise 1: Create Source and Target Datastores.......................... 41
2011
© 2011 SAP AG. All rights reserved.
21
Unit 2: Using Data Quality Transforms
BODS30
Lesson: Data Quality Transforms Lesson Overview SAP BusinessObjects Data Services offers several Data Quality and Address transforms, which allows you to save time and resources in the production of mailings.
Lesson Objectives After completing this lesson, you will be able to: • • •
Define Data Quality processes Use Data Quality transforms Configure Data Quality transforms
Business Example Mail that is undeliverable generate a considerable cost to the mailer: wasted effort, wasted materials, and wasted time. Two of the major reasons for undeliverable mail are incorrect addresses and mail recipients moving to a new address. Incorrect addresses are the result of missing apartment or suite numbers, incorrect or missing street numbers, missing box numbers, incorrect street names. and incorrect recipients. On an average, in one year in the United States, there are 46 million people move and over one million new addresses are created. According to the United States Postal Service (USPS) National Postal Forum dated, May 17 – 20, 2009, the reason “Customer has Moved” is responsible for over 75% of undeliverable mail. Consider one mailing of 100,000 addresses sending at full rate. The postage for a 1 ounce piece of mail is 44¢. The average automation discount is 5¢. A customer would save $5,000 in postal service discounts with address cleansing. If a customer also deduplicates the list and the list has 2% duplicates, they could save an additional $780 on postage. If the piece costs 50¢ to create, the customer could save an additional $1000 in creation/printing costs. Imagine if this were a monthly or weekly billing!
Describing the data quality framework The term “Data Quality” is used often, but what does it mean?
Figure 9: What Is Data Quality?
22
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
The most common name and address data quality errors are seen below.
Figure 10: What Are Common Data Quality Errors?
Before you can implement an effective data quality project, you must first understand the data quality framework, which is a continuous cycle of activities that starts with measuring: 1. Measuring
2011
© 2011 SAP AG. All rights reserved.
23
Unit 2: Using Data Quality Transforms
BODS30
Figure 11: Data Quality – Assessment
Start the process by using a data profiling application to quantify the number and types of defects in your data. 2. Analyzing
Figure 12: Data Quality – Measure and Analyze
Once the data has been measured, assess the nature and cause of data defects. This information enables you to inform stakeholders of your findings and develop a data quality strategy. 3. Parsing
24
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
Figure 13: Data Quality – Cleansing
After the data has been assessed, identify and isolate the individual elements that make up the data. 4. Standardizing
2011
© 2011 SAP AG. All rights reserved.
25
Unit 2: Using Data Quality Transforms
BODS30
Figure 14: Data Cleansing – Parse, Standardize, and Correct
Next, normalize and format the parsed data according to business rules and third-party referential data. 5. Cleansing After standardization, verify and cleanse the data according to business rules and third-party referential data. 6. Enhancing
26
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
Figure 15: Data Quality – Enhancement
Append additional information that may increase the value of the data based on referential data.
Figure 16: Data Enhancement 1
An example of data enhancement can be seen in the below graphic.
2011
© 2011 SAP AG. All rights reserved.
27
Unit 2: Using Data Quality Transforms
BODS30
Figure 17: Data Enhancement 2
7. Matching Once the data has been fully cleansed, perform the matching process to identify duplicate records within multiple tables or databases. Matching after cleansing increases the likelihood of discovering overlaps. 8. Consolidating Once matches have been identified, consolidate the data to populate a best record in a match set or update all records in a match set. 9. Continuous Monitoring Use data profiling software to measure ongoing data quality scores and provide alerts when data scores fall outside of defined thresholds. Where Is Data Quality Used? Data quality is needed throughout an organization and should be used everywhere that data is managed. This includes not only batch processing data extracted from a current system, but also transactional processing on point of entry to the system.
28
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
Figure 18: Where Does Data Quality Fit?
The following application areas can benefit from the implementation of data quality: Application Area
Use Examples
Data Warehousing/Business Intelligence
Enterprise Resource Planning/Supply Chain Customer Relationship Management (Call Centers, SFA)
• • • •
Supplier management
• • • • • •
2011
Reporting Analysis Metrics Business intelligence
Direct mail/customer communication Campaign management Loyalty program management Cross and up sales Lifetime value 360 degree view
© 2011 SAP AG. All rights reserved.
29
Unit 2: Using Data Quality Transforms
Consumer and Corporate Householding
BODS30
• • •
Fraud Detection/Compliance
Distribution Projects
Business Process Automation
Mergers/Questions/Divestitures
Multi-level customer, supplier, and patient databases Cross selling for expanded offerings Identification of multiple organizations as a single company
• •
BASEL II Sarbanes-Oxley Act
• • • •
Distribution of goods Publishers Printers Service bureaus
• • • •
Real-time decisions Order entry Web forms Self service
Market bounce/EU
Defining Data Quality Processes You can use Data Services to improve the quality of your data by using Data Quality transforms to parse, standardize, cleanse, enhance, and match records. Parsing Data Parsing identifies data and breaks it down into its component parts. It rearranges data elements in a single field or moves multiple data elements from a single data field to multiple discrete fields. For example, a name can be parsed into new fields such as prename, first name, last name, and postname.
30
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
Input Record
Output fields Prename:
Mr.
First Name:
Dan
Middle Name:
R.
Mr. Dan R. Smith, Jr., CPA Account Mgr.
Last Name:
Smith
Maturity Postname:
Jr.
Jones Inc.
Other Postname:
CPA
Dept. of Accounting
Title:
Account Mgr.
PO Box 567
Firm:
Jones Inc.
Biron, WI
Department:
Dept. of Accounting
54494
Locality:
Biron
Region:
WI
Primary Number:
567
Postal Code:
54494
Standardizing Data Even the most stringent data entry controls can permit data with an unacceptable level of variability to enter your data store. To make your records more consistent, you can use data quality to standardize customer and operational data. Formats, abbreviations, acronyms, punctuation, greetings, casing, order, and pattern matching are examples of elements you can control. Input Record
Output Fields Purchase order:
34-5654
Purchase date:
03-01-2002
Purchase date: 3–01–02
Part number:
R2-0113-245500
Part number: R20113245500
Diameter:
3"
Length:
52'
Purchase order: PO345654
Part desc: 3” diameter, 1 @ 52 ft Cleansing Data
2011
© 2011 SAP AG. All rights reserved.
31
Unit 2: Using Data Quality Transforms
BODS30
Data with incorrect elements is known as “dirty data”. Cleansing dirty data involves correcting it and adding missing elements like adding directions and primary types to addresses. Cleansing can occur on a wide variety of data. Depending on the type of data, you can remove or correct incorrect data using sophisticated algorithms and rules in conjunction with referential data. For example, you can use address information obtained from a national postal service to correct address data. Input Record
Mark Kessler 117–138 St W Harlem NY 10030
Output Records Name:
Mark Kessler
Address:
117 Odell Clark Pl
Locality:
New York
Region:
NY
Postal Code:
10030-2342
Enhancing Data Data can increase in value when additional related data is attached to it. Data Quality can: •
•
•
•
32
Provide full international postal codes. For example, ZIP+4 can be appended to the primary five-digit postal code for US addresses. When these additional values are used on mailing labels, they assist postal services to route the mail more quickly and efficiently. Append other valuable information obtained from the directory services. For example, you can attach assignment, delivery, quality, geocensus, and geospatial codes and information. In addition, you can append match name standards, gender codes, and customized greetings to also increase the value of that record. Assign longitude and latitude information to records. By assigning latitude and longitude information to records with a physical mailing address, your data immediately becomes more valuable. Address interpolation allows you to look at trends in your data based on these geographic demographics. Marketing initiatives by geography (for example, targeted marketing, predictive modeling) become immediately more feasible. Append other geospatial information used for census, marketing, and insurance. Information codes can be appended for the following: US Census Bureau Minor Civil Division (MCD) or Census County Division (CCD) codes; Federal Information Processing Standard (FIPS) place, state, and country codes; US Census Tract and Block Numbering Area (BNA) codes; Core–Based Statistical
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
Area (CBSA) codes; Metropolitan Statistics Area codes (MSA); and Section codes for US Census information. Some of these codes can be used to match to demographic coding databases for census purposes, insurance rating territories, and so on for US address data. The record below is an example of geocoding information that has been appended to a record containing an address: Input Record
Appended Information Address Latitude:
40.723175
Address Longitude:
-74.004970
Centroid Latitude:
40.723195
Centroid Longitude:
-74.004977
Match quality:
Highest quality address
101 Avenue of the Americas
FIPS State Code:
36 New York
FIPS County Code:
061 New York
New York, NY 10013–1933 (222) 922–9922
FIPS Place Code:
51000 New York
MCD Code:
44919
BSA:
35620
Metro code:
5600
Section Code:
0051001012
Margaret Smith–Kline, Ph.D. Future Electronics
A number of enhancements are available for your data: Status and Error Coding Status and error codes indicate what has changed or is incorrect with an address. They enable you to identify issues with the data and determine the best way to resolve them. Enhanced Line of Travel (eLOT) Enhanced Line of Travel (eLOT) is available for the USA Regulatory Address Cleanse transform only. It is used for sorting to achieve mailing discounts. The original line of travel (LOT) narrowed the mail carrier's delivery route to the block face level (ZIP+4 level) by discerning whether an address resided on the odd or even side of a street or thoroughfare. eLOT narrows the mail carrier's delivery route walk sequence to the house (delivery point) level. This allows mailings to be sorted on a more precise level. Geocoding
2011
© 2011 SAP AG. All rights reserved.
33
Unit 2: Using Data Quality Transforms
BODS30
Geocodes provide latitude and longitude processing to a roof or block level. Geocoding assigns this information to records with a physical mailing address. It enables you to look at trends in your data based on these geographic demographics. Delivery Point Validation (DPV) Delivery Point Validation (DPV) determines if the address is deliverable. With DPV, you can identify addresses that are undeliverable as addressed and whether an address is a Commercial Mail Receiving Agency (CMRA). DPV can be useful in the following areas: • • • •
Mailing: Screen out undeliverable-as-addressed (UAA) mail and reduce mailing costs. Information quality: Verify an address down to the individual house, suite, or apartment rather than block face to increase the data's level of accuracy. Increased assignment rate: Use DPV tiebreak mode to resolve a tie when other tie-breaking methods are not conclusive may increase assignment rates. Preventing mail-order-fraud: Verify valid delivery addresses and Commercial Mail Receiving Agencies (CMRA). Eliminate shipping of merchandise to individuals who place fraudulent orders.
LACS Link (Rural Route Conversion) LACS Link is available for USA records with the USA Regulatory Address Cleanse transform only and is required for CASS certification. It converts emergency addresses. LACS Link updates addresses when the physical address does not move but the address has changed, such as when the municipality changes rural route addresses to street–name addresses. Rural route conversions make it easier for police, fire, ambulance, and postal personnel to locate a rural address. LACS Link also converts addresses when streets are renamed or post office boxes renumbered. Residential Delivery Indicator (RDI) The Residential Delivery Indicator (RDI) feature is available in the USA Regulatory Address Cleanse transform. RDI determines whether a given address is for a residence. This is because, unlike the United States Postal Service (USPS), some delivery services charge more for residential deliveries. You can use RDI if you are processing your data for CASS certification or if you are processing in a noncertified mode. In addition, RDI does not require that you use DPV processing. Matching and Consolidating Data
34
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
Matching is used once the data has been parsed, standardized, cleansed, and enhanced to determine whether records refer to the same entity, such as a customer. Once matches are identified, consolidation salvages data from records in a match set to create a “best” record or post data to all the records in a match set based on the priority of the records. Input Record Ms Margaret Smith–Kline Ph.D. Future Electronics
Consolidated Record Name:
Ms. Margaret Smith-Kline Ph.D.
Company name:
Future Electronics Co. LLC
SSN:
001-12-4367
Hire date:
5/23/2003
Address:
101 Avenue of the Americas, New York NY 10013-1933
Phone:
(222) 922-9922
Email:
maggie.kline@future_electronics.com
101 Avenue of the Americas New York NY 10013–1933 maggie.kline@future_electronics.com May 23, 2003 Maggie Smith Future Electronics Co. LLC 101 6th Ave. Manhattan, NY 10012 maggie.kline@future_electronics.com 001–12–4367 Ms. Peg Kline Future Elect. Co. 101 6th Ave.
2011
© 2011 SAP AG. All rights reserved.
35
Unit 2: Using Data Quality Transforms
BODS30
New York NY 10013 01–12–4367 (222) 922–9922 5/23/03
Describing the Components that Support Data Quality Data Quality transforms are supported by several unique components in the architecture of Data Services. Country–specific engines process Australian, Canadian, Japanese, and USA address data. An EMEA engine processes European data, and the Global Address engine processes the additional ISO countries. The Address Server is used to process European addresses using the Global Address Cleanse transform. It provides access to detailed address line information for most European countries. Note that the Address Server is resource-intensive and should be disabled until required. It takes time to initialize, which may impact performance when it is first started. Cleansing Packages are one of the four types of repositories. They enhance the ability of Data Services to accurately process various forms of global data by including language-specific reference data and parsing rules. Cleansing packages are used in conjunction with rule files to identify and parse name, title, and firm data. They contain entries for words and phrases. Each entry specifies how the word or phrase might be used. You can tailor a cleansing package to better suit your needs. Directories are used to define how to cleanse address data. Optional USA directories are available for specialized address data. The US Postal Service requires the software disable itself when a user attempts to use expired US directories.
Describing Data Quality Transforms Data Quality transforms are used to improve the quality of your data. They are used to parse, standardize, cleanse, enhance, match, and consolidate data. The following Data Quality transforms are available on the Transforms tab of the Local Object Library:
36
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
Transform
Description
Associate
Works downstream from Match transforms to provide a way to combine, or associate, the match results by using the Match transform-generated Group Number fields.
Country ID
Parses input data and then identifies the country of destination for each record.
Data Cleanse
Identifies, parses, standardizes, and enhances name, title, and firm data, phone numbers, Social Security numbers, dates, and e-mail addresses. Assigns gender, adds prenames, generates Match standards, and creates personlized greetings. Can also be used for international, operational, and product data.
Global Address Cleanse
Identifies, parses, validates, and corrects global address data, such as primary number, primary name, primary type, directional, secondary identifier, and secondary number.
Global Suggestion List
Offers a way to complete and populate addresses with minimal data, and can make suggestions for possible matches. Is ideal for use in call center and other transactional environments to cleanse and standardize data at the point of entry.
Match
Matches data based on user-defined business rules.
USA Regulatory Address Cleanse
Identifies, parses, validates, and corrects USA address data according to the USA Coding Accuracy Support System (CASS). Outputs many useful codes and creates USPS Form 3553. Can also run in non-certified mode.
User Defined
2011
Provides custom processing in a data flow using full Python scripting
© 2011 SAP AG. All rights reserved.
37
Unit 2: Using Data Quality Transforms
BODS30
language.User–Defined transforms should only be used for those tasks that cannot be accomplished with other transforms or with the Data Services scripting language in the Query transform. Note: The User–Defined transform lacks scalability and performance compared to other transforms and Data Services scripting language in the Query transform. Use the User–Defined transform for functions that extend beyond these tasks.
Using Transform Configurations Transform configurations are customized versions of base Data Quality transforms. They use preconfigured input fields, output fields, and options that include best practices and can be used in multiple data flows. Once you place an instance of the transform configuration in a data flow, you can override the preset defaults. You can also create your own transform configurations, either by replicating an existing transform configuration or creating a new one from scratch. Note: Transform configurations are only available for Data Quality transforms as part of the Data Services platform. This feature does not apply to Data Integrator or Platform transforms.
Using the Transform Editor When you are working with Data Quality transforms, the parameters area of the Transform Editor provides access to some unique functionality. For each tab, you can choose to filter to show all available options, only those that are currently in use, or those that are designated as "best practice" for that particular transform configuration. Note: When you create your own transform configurations, you can specify which fields and options are best practice to take advantage of these filters.
38
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
Tab
Description
Input tab
The Input tab displays the available field names that are recognized by the transform. Mapping input fields to field names that the transform recognizes tells the transform how to process that field.
Options tab
The Options tab contains business rules that determine how the transform processes your data. Each transform has a different set of available options. If you change an option value from its default value, a green triangle appears next to the option name to indicate that you made an override.
Output tab
The Output tab displays the field names that can be output by the transform. Data cleansing transforms can generate fields in addition to the input fields that transform processes, so that you can output many fields.
Configuring Data Quality Transforms The primary difference configuring Data Quality transforms versus configuring platform transforms is that Data Quality transforms have passthrough columns versus columns that are used by the transform as is the case with Platform transforms. You can map passthrough, input, and output columns in the transform editor by dragging the columns from one pane to another. Passthrough columns are not affected by the transform, but are available to other transforms downstream. To map a passthrough column, click and drag it from the input schema area to the output schema area. Input columns are used by the transform to generate parsed, standardized, and cleansed output. To map an input column, click and drag it from the input schema area to the appropriate field in the Input tab. Columns for which a mapping type have been identified are mapped automatically. Output columns contain parsed, standardized, and cleansed data from the transform. To map a column to the output scheme, select the check box on the Output tab. To replace an existing field with the new data, drag the column from the Output tab to the column in the output schema area and select Remap Column from the popup menu.
2011
© 2011 SAP AG. All rights reserved.
39
Unit 2: Using Data Quality Transforms
40
© 2011 SAP AG. All rights reserved.
BODS30
2011
BODS30
Lesson: Data Quality Transforms
Exercise 1: Create Source and Target Datastores Exercise Objectives After completing this exercise, you will be able to: • Create Datastores and import metadata for the Alpha Acquisitions, Beta Businesses, Delta, HR Data Mart and Omega databases
Business Example You have been hired as a Data Services designer for Alpha Acquisitions. Alpha has recently acquired Beta Businesses, an organization that develops and sells software products and related services. In an effort to consolidate and organize the data, and simplify the reporting process for the growing company, the Omega data warehouse is being constructed to merge the data for both organizations, and a separate data mart is being developed for reporting on Human Resources data. You also have access to a database for staging purposes called Delta. To start the development process, you must create datastores and import the metadata for all of these data sources.
Task: Create Datastores and import metadata for the Beta Businesses and Delta databases.
2011
1.
Log in to the Data Services Designer.
2.
In your Local Object Library, create a new source Datastore for the BETA Businesses database.
3.
In your Local Object Library, create a new Datastore for the DQSTANDALONE database.
4.
In your Local Object Library, create a new Datastore for the DELTA staging database.
© 2011 SAP AG. All rights reserved.
41
Unit 2: Using Data Quality Transforms
BODS30
Solution 1: Create Source and Target Datastores Task: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
2.
Log in to the Data Services Designer. a)
From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.
b)
In the dialog box, enter your assigned User ID.
c)
Enter your password, which is the same as your User ID.
d)
Choose the Log on button.
e)
In the list of repositories, select your repository DSREPO##, where ## is the number portion of your User ID.
f)
Choose OK.
In your Local Object Library, create a new source Datastore for the BETA Businesses database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
BETA
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
BETA
User name
sourceuser (lower case)
Password
sourcepass (lower case)
Continued on next page
42
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
Choose OK to finish. c)
In the Local Object Library, right-click the BETA datastore to select the Open option from the context menu. The tables from the BETA database are displayed on the Metadata window on the right.
d)
Import the metadata for the BETA database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • • • • • •
source.addrcodes source.categories source.country source.customer source.employees source.orderdetails source.orders source.products source.region source.suppliers source.usa_customers Note: This database is a read-only database.
e)
3.
View the data for the USA_Customers table and confirm that Jane Hartley from Planview Inc. is the first customer record by right-clicking on the table in the Local Object Library and choose the option View data.
In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DQSTANDALONE
Datastore type
Database
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
43
Unit 2: Using Data Quality Transforms
BODS30
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DQSTANDALONE
User name
sourceuser (lower case)
Password
sourcepass (lower case)
Choose OK to finish. c)
In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.
d)
Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •
AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database.
4.
In your Local Object Library, create a new Datastore for the DELTA staging database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DELTA
Continued on next page
44
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Data Quality Transforms
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DELTA## (where ## is the group number assigned by your instructor)
User name
student## or consult your instructor
Password
student## or consult your instructor
Choose OK to finish. c)
You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.
2011
© 2011 SAP AG. All rights reserved.
45
Unit 2: Using Data Quality Transforms
BODS30
Lesson Summary You should now be able to: • Define Data Quality processes • Use Data Quality transforms • Configure Data Quality transforms
46
© 2011 SAP AG. All rights reserved.
2011
BODS30
Unit Summary
Unit Summary You should now be able to: • Define Data Quality processes • Use Data Quality transforms • Configure Data Quality transforms
2011
© 2011 SAP AG. All rights reserved.
47
Unit Summary
48
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit Summary
49
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit Summary
50
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit 3 Using Address Cleanse Transforms Unit Overview This lesson describes the address cleansing capabilities of Data Services. After completing this lesson, you will be able to: •
Describe the Address Cleanse transforms
Unit Objectives After completing this unit, you will be able to: • • • •
Prepare your input data for Address Cleanse transforms Use the Address Cleanse transforms Work with global address data Work with transactional address data
Unit Contents Lesson: Using the Address Cleanse Transforms ............................... 52 Exercise 2: Use the Address Cleanse Transform ........................... 73
2011
© 2011 SAP AG. All rights reserved.
51
Unit 3: Using Address Cleanse Transforms
BODS30
Lesson: Using the Address Cleanse Transforms Lesson Overview The Address Cleanse transforms parse, correct, complete, and standardize address components. Address Cleanse contains engines to process Australian, Canadian, Japanese, and USA address data. An EMEA engine processes European data, and the Global Address engine processes the additional International Organization for Standardization (ISO) countries. For example, postal codes can be corrected and added for some countries with the Global Address Cleanse and USA Regulatory Cleanse transforms.
Lesson Objectives After completing this lesson, you will be able to: • • • •
Prepare your input data for Address Cleanse transforms Use the Address Cleanse transforms Work with global address data Work with transactional address data
Business Example Mail, which is undeliverable generates a considerable cost to the mailer: wasted effort, wasted materials, and wasted time. Two of the major reasons for undeliverable mail are incorrect addresses and mail recipients moving to a new address. Incorrect addresses are the result of missing apartment or suite numbers, incorrect or missing street numbers, missing box numbers, incorrect street names and incorrect recipients. On an average, in one year in the United States, there are 46 million people move and over 1,000,000 new addresses are created. According to the United States Postal Service (USPS) National Postal Forum dated, May 17 – 20, 2009, the reason “Customer has Moved” is responsible for over 75% of undeliverable mail. Consider one mailing of 100,000 addresses sending at full rate. The postage for a 1 ounce piece of mail is 44¢. The average automation discount is 5¢. A customer would save $5,000 in postal service discounts with address cleansing. If a customer also deduplicates the list and the list has 2% duplicates, they could save an additional $780 on postage. If the piece costs 50¢ to create, the customer could save an additional $1000 in creation/printing costs. Imagine if this were a monthly or weekly billing!
52
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
Understanding the Business Need for Address Cleanse Transforms
Figure 19: Why Cleanse Addresses?
There are two primary reasons for cleansing your address data: 1.
2.
You can reduce costs from misdirected mail. Compliance with a defined set of standards can enable your organization to attain certification that allows you to take advantage of discounts offered by many national postal organizations, such as the United States Postal Service. You can properly prepare your data for matching. When comparing two addresses to see if they match, the smallest data entry error could keep you from finding matches. You must have standardized data before you look for matches. Address cleansing can parse an address down into its smallest component parts (for example, separating the house number from the street name) so that your matching process can compare individual items and improve match results.
Figure 20: Standardization: USPS Address Preferences
The various address components need to be parsed before address cleansing can take place. For example:
2011
© 2011 SAP AG. All rights reserved.
53
Unit 3: Using Address Cleanse Transforms
BODS30
Figure 21: Parsing Address Components
Address cleansing tasks include: •
•
•
•
•
Parsing address components (such as locality, region, and postal code) and generating discrete fields for improved match results. The level of parsing and correction varies by country, and the correction depends on the amount of input address information available in the input record. Standardizing address fields based on business rule options set in the transform, correcting the format of address components, and changing the way the address line looks. For example, Data Services can add or strip punctuation and abbreviate or spell out address components. Verifying that locality, region, and postal codes agree with one another. If a record contains only a locality and region, Data Services can add the postal code and vice versa, depending on the country. Appending other information, such as address quality measurement in regard to whether region and postal code agree with the corrected address (for USA, Canada, and Australia) and demographic information (if the GeoCensus Option is purchased). Identifying any undeliverable addresses, such as vacant lots and condemned buildings (USA records only).
• •
54
Assigning diagnostic codes to help determine why addresses are not assigned or how they are corrected. Generating address-related reports.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
Preparing your Input Data for Address Cleanse Transforms
Figure 22: Preparing Input Data
You can cleanse address data that is stored in discrete, multiline, and multiline hybrid address line formats. Before setting up the cleansing process, you must identify the format of your data. The Address Cleanse transform accepts discrete, multiline, and multiline hybrid address line formats. The discrete format is a database record format in which pieces of data appear in exactly the same field arrangement in every record. For this reason, this option is the best performing format.
Figure 23: Discrete Line Format
Note: The discrete line formats are not recommended for processing multi–country databases. They are more appropriate for use with USA, Canadian, and Australian addresses.
2011
© 2011 SAP AG. All rights reserved.
55
Unit 3: Using Address Cleanse Transforms
BODS30
Fields
Address1
Address2
Address3
Firm
Micro Elektronic Ges.m.b.H
Address1
Baumschulengasse 250
Nybrovej 120
225 rue de Bragance
Locality1
Wein
Lyngby
Luxembourg
Postcode
1010
2800
1449
Country
Germany
Denmark
Luxembourg
Dupong S.A.
Region1
The multiline format is a database record format in which address data is not consistently located in the same arrangement in all records. That is, data items float among fields. For example, field one can contain the name for one record, the company for another, and the address for another.
Figure 24: Multiline Format
56
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
Fields
Address1
Multiline1
Lawrence Robinson
Multiline2
Information Services Plc
Gran Via de Carles S.A
Multiline3
Cornwall House
Ibarra Campillo
Multiline4
56 Henley Road
Multiline5
Address2
Peter Grant
United Car Rental, Ltd 48010 Bilbao
Multiline6
London
Multiline7
NW2Y 5DM
3150 S. Larkin Rd., Suite 10
Multiline8 Country
Address3
Etobicoke ON M9C4T9 United Kingdom
Spain
Canada
The multiline hybrid format is a record format that contains multiline and discrete fields.
Figure 25: Multiline Hybrid Format
2011
© 2011 SAP AG. All rights reserved.
57
Unit 3: Using Address Cleanse Transforms
Fields
Address1
Multiline1
Jonas Lancaster
Multiline2
Woodworks Ltd
Multiline3
Wetherby House
Multiline4
20 Oakley Road
Multiline5
BODS30
Address2
Address3 Akzo Chemicals
Gran Via de Carles S.A. Moreelsepk 24 Ibarra Cambillo 16-3-A
Multiline6 Multiline7 Locality1
Bilbao
Locality2
Harlow
Region1
Essex
Postcode
CM19 5AE
48010
Lastline Country
3511 EP Utrecht United Kingdom
Spain
Netherlands
Defining the Address Cleanse Transforms The Address Cleanse transforms identify, parse, validate, standardize, and correct address data. An extensive number of new fields are generated by the Address Cleanse transform. Address type, primary type, primary number, unit number, locality, region, and postal code are just a few of the output fields that are commonly used. If you perform both address cleansing and data cleansing, the Address Cleanse transform should be placed before the Data Cleanse transform in the data flow. The are four available Address Cleanse transforms:
58
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
Transform
Description
Global Address Cleanse
Cleanses your address data from any of the supported countries (not for US certification). You must use one or more of the Global Address Cleanse transform configurations Australia (AMAS), Canada (SERP), Japan, multicountry, or USA. With this transform, you can create a Statement of Address Accuracy Report (SERP) as well as an Australian Address Matching Processing report (AMAS) report.
USA Regulatory Address Cleanse
Cleanses your US address data for certification. You can also use any of the add-on options: Delivery Point Validation (DPV), Enhanced Line of Travel (eLOT), Early Warning System (EWS), Locatable Address Conversion system (LACSLink), Residential Delivery Indicator (RDI), suggestion lists (not for certification), or Z4Change. With this transform, you can create a USPS Form 3553.
Global Suggestion List
Offers suggestions for possible address matches for your global address data. This transform is used for transactional processing and does not standardize addresses. In a project, use a Country ID transform before this transform.
Country ID
Identifies the country of destination for the record and outputs an ISO code. Use this transform before the Global Suggestion Lists transform (it is not necessary before the Global Address Cleanse transform or the USA Regulatory Address Cleanse transform).
Note: This course focuses only on the Global Address Cleanse and USA Regulatory Address Cleanse transforms.
2011
© 2011 SAP AG. All rights reserved.
59
Unit 3: Using Address Cleanse Transforms
BODS30
Describing the Global Address Transform The Global Address Cleanse transform cleanses your address data from any supported countries. You must use one or more of the Global Address Cleanse transform configurations to accomplish certification for Australia (AMAS), Canada (SERP), Japan, or multi-country records. The Global Address Cleanse transform does not support CASS certification or produce a USPS Form 3553. If you want to certify your USA address data, you must use the USA Regulatory Address Cleanse transform, which supports CASS. Input fields There are recognized Data Services input fields that you can use in the Global Address Cleanse transform. Input fields may vary based on the engine(s) that you enable: • • • • • •
Australia (A) Canada ( C) Japan (J) EMEAU (E) Global Address (G) USA (U)
See the fields listed on the Input tab in the transform editor to view each field's properties. Input fields are used by the engine to understand the data elements anticipated. For example, locality1 informs the engine what column contains the city or locality information. Options The Global Address Cleanse transform offers several option groups. Report and Analysis options The Generate Report Data option specifies whether to generate report data for this transform: • •
Yes: Generates report data for this transform. No: Turns off report data generation.
If you do not need to generate reports (during testing, for example), you should set this option to No to improve performance. Reference files Reference files are directories required by the Global Address Cleanse transform to process your data. Set the location for the individual countries based on whether the engine is activated in your job.
60
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
Use this substitution variable if you have installed the directories in the default reference data directory of the install. The substitution variable for reference files is $$RefFilesAddressCleanse. Use this substitution variable if you have accepted the default installation directory when you installed Data Services. If you installed to a location that was not the default location, you can change the substitution variable dynamically. Country ID Options Specifies whether or not to use Country ID processing. This option group is required. Option
Description Specifies whether to always use the specified Country Name or to run Country ID processing. The options are available: •
• Country ID Mode
Constant: Assumes all of your input data is for the specified Country Name and does not run Country ID processing. Choose this option only if all of your data is from one country. This option may save processing time Assigned: Runs Country ID processing. Choose this option if the input data is from more than one country. For example, choose Country ID Mode if one of these scenarios applies: –
–
2011
You use two or more of the engines and your input addresses contain country data. You use only one engine and your input source contains many addresses that cannot be processed by that engine. Any addresses that cannot be processed are not sent to
© 2011 SAP AG. All rights reserved.
61
Unit 3: Using Address Cleanse Transforms
BODS30
the engine and the transform use the Country_Name option value as the default country. Specifies the country of destination. These options are available: •
•
None: Select when the Country ID Mode is set to Assigned and you do not want a default country to be set when the country cannot be identified. Special considerations: –
Country Name
If the Country ID Mode is set to Constant, choose the country of destination from the Country Name list. The transform assumes that all of your data is for this country. Note: You cannot choose None if the Country ID Mode is set to Constant.
Script Code
–
If the Country ID Mode is set to Assigned, choose a country name to be used when the Country ID cannot identify a country.
–
If Country Name is set to None, the address is sent to the Default engine.
Specifies the ISO four-character script code for your data. Unless you are processing Japanese data, choose LATN.
Engines Assigns the engines that you want to use with the Global Address Cleanse transform.
62
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
The Global Address Cleanse transform must have a Global Address Cleanse engines enabled process your data. • •
Yes: Activates the engine for this transform. No: De-activates the engine for this transform.
Specify which engine to use with this transform: • • • • • •
Australia Canada EMEA Global Address Japan USA
Country Standardization Options These options are found under Standardization Options Country Options. This option group contains the country–specific options for standardizing your address data. Suggestion List Options These options apply to both the Global Address Cleanse and USA Regulatory Address Cleanse transform. This option group specifies what suggestion list components to output. • •
Yes: Outputs the component. No: Does not output the component. Note: For the USA Regulatory Address Cleanse transform configuration, these options are listed under Suggestion List Options Output.
Output fields There are several output fields that can be used for the Global Address Cleanse transform. Each output field is available based on the engine(s) you enable. Generated_Field_AddrClass and Generated_Field_Class values are available for each field. These values indicate the data component to be outputted. Similar to the input field for locality1, the subsequent output field contains the official locality or city name.
2011
© 2011 SAP AG. All rights reserved.
63
Unit 3: Using Address Cleanse Transforms
BODS30
Describing the USA Regulatory Address Cleanse Transform The USA Regulatory Address Cleanse transform identifies, parses, validates, and corrects USA address data according to the US Coding Accuracy Support System (CASS). This transform can create the USPS Form 3553 and output many useful codes for your records. You can also run in a non–certification mode, as well as produce suggestion lists. For a list of reference files used by the USA Regulatory Address Cleanse transform, see the USA Regulatory Address Cleanse Options section of the Transforms Reference of the Data Services Reference Guide. The USA Regulatory Address Cleanse transform contains sample transform configurations that help you to set up your data flow. The transforms include all of the required options except input fields. This table lists the transform configurations for the USA Regulatory Address Cleanse transform:
64
Transform Configuration
Description
USA Regulatory Address Cleanse
Cleanses address data according to CASS requirements.
USA Regulatory EWS Address Cleanse
Cleanses address data according to CASS requirements, with Early Warning System.
USA Regulatory Geo Address Cleanse
Cleanses address data according to CASS requirements, with GeoCensus.
USA Regulatory Noncertified Address Cleanse
Cleanses address data in noncertified mode.
USA Regulatory Noncertified Geo Address Cleanse
Cleanses address data in noncertified mode, with GeoCensus.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
USA Regulatory RDI Address Cleanse
Cleanses address data according to CASS requirements, with Residential Delivery Indicator.
USA Regulatory Suggestions Address Cleanse
Cleanses address data using the Suggestion List feature.
USA Regulatory Z4 Change Address Cleanse
Cleanses address data according to CASS requirements, with Z4Change.
Note: If an input record has characters outside the Latin1 code page, the USA Regulatory Address Cleanse transform does not process that data. Instead, the mapped input record is sent to the corresponding standardized output field (if applicable). No other output fields are populated for that record. If your Unicode database has valid USA addresses from the Latin1 character set, this transform processes as normal. Input/Output Data Services offers a number of input and output fields for the USA Regulatory Address Cleanse transform. Options The USA Regulatory Address Cleanse transform offers several options. Option group
Description The Generate Report Data option specifies whether to generate report data for this transform. •
Report and Analysis options
Reference Files options
2011
Yes: Generates report data for this transform. • No: Turns off report data generation. If you do not need to generate reports (during testing, for example), you should set this option to No to improve performance. Reference files are directories used by the USA Regulatory Address Cleanse transform to correct and standardize your data. It is best to use a substitution variable for the Option value, such as
© 2011 SAP AG. All rights reserved.
65
Unit 3: Using Address Cleanse Transforms
BODS30
$$DIR_PATH_USA. This substitution variable represents the path, and you can change it dynamically.
Transform Performance options
The Transform Performance option group contains options that affect the performance of DPV, RDI, and LACSLink processing. Set these options if you have DPV, RDI, or LACSLink activated in the transform.
Assignment options
With this option group, you can choose the add-on features that you want to use during processing.
Standardization options
This option group contains all of the standardization settings that you need to define for processing USA data. (The options are listed alphabetically.)
Noncertified options
This option group includes options to process your data without some select CASS certification rules.
CASS Report options
With this option group, you can add the necessary USPS Form 3553 information as required by the USPS when certifying a mailing.
Suggestion List options
Set these options to configure how suggestions lists are output.
Working with Specific Types of Address Data There are several approaches to take when using the Address Cleanse transform, depending on the type of data that you are cleansing. USA data for USPS certification To process your USA address data for CASS certification or for USA suggestion lists, select the USA Regulatory Address Cleanse (URAC) transform configuration that most closely meets the project requirements and modify to meet your business requirements. Note: There are also USA Address Cleanse transforms configured for: Delivery Point Validation (DPV); Early Warning system (EWS); Enhanced Line of Travel (eLOT); GeoCensus; Locatable Address Conversion System (LACSLink); Residential Delivery Indicator (RDI); and Z4Change.
66
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
URAC helps ensure that an address in the United States and its territories is current and deliverable. Delivery Point Validation confirms the existence of an address.
Figure 26: Delivery Point Validation
LACSLink is a process that converts addressed that are changed by local governments.
2011
© 2011 SAP AG. All rights reserved.
67
Unit 3: Using Address Cleanse Transforms
BODS30
Figure 27: Locatable Address Conversion System
SuiteLink Provides secondary address information for businesses.
Figure 28: SuiteLink
68
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
National Change of Address (NCOALink) software provides a secure methodology to interface with the NCOA database from the USPS using name and address matching within the SAP Data Services product. It makes change-of-address information available to mailers to help reduce undeliverable mail pieces before mail enters the mail stream.
Figure 29: National Change of Address (NCOALink)
Data Services 4.0 introduces the use of the DSF2® walk sequence number. Two addresses (at top of slide) have the same Lot Sequence number, but were assigned different walk-sequence numbers. These houses share the same delivery unit and hence the same Lot number. The addresses (postal units) below are derived from actual neighborhoods located in Holmen, WI. To aid in mail sorting, eLOT contains an eLOT sequence number field and an ascending/descending code. The eLOT sequence number indicates the first occurrence of delivery made to the add-on range within the carrier route, and the ascending/descending code indicates the approximate delivery order within the sequence number. eLOT processing may be used by mailers to qualify for enhanced carrier route presort discounts.
2011
© 2011 SAP AG. All rights reserved.
69
Unit 3: Using Address Cleanse Transforms
BODS30
Figure 30: DSF2® Walk Sequencer
USA data not for USPS certification To process USA–specific address data, use the USA Regulatory Address Cleanse transform. Although you can use the Global Address transform as long as CASS certification is not required, you may experience a performance gain by using the USA Regulatory Address Cleanse transform. Address data for multiple purposes When you have two addresses intended for different purposes (for example, a billing address and a shipping address), use two Address Cleanse transform configurations. Include a Case transform to send one type of address to one transform and the other type to the other transform. Global address data To cleanse address data from any supported country, use the appropriate Global Address Cleanse transform configuration in your project. Note that the transform configuration for Australia can be used for AMAS certification, and the one for Canada can be used for SERP certification. However, the USA Global Address Cleanse transform configuration cannot be used for USPS certification. Multi–country address data
70
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
Identify which country or countries your address data refers to and select the country-specific Global Address Cleanse transform that most closely fits your needs. You can modify options to process addresses for additional countries if necessary.
2011
© 2011 SAP AG. All rights reserved.
71
Unit 3: Using Address Cleanse Transforms
72
© 2011 SAP AG. All rights reserved.
BODS30
2011
BODS30
Lesson: Using the Address Cleanse Transforms
Exercise 2: Use the Address Cleanse Transform Exercise Objectives After completing this exercise, you will be able to: • Use the USA Address Cleanse transform configuration to parse, cleanse, and standardize data • Use the Query transform to create a new column for information code descriptions • Assign content types for input data to assist the definition of mappings
Business Example After analyzing the target data for the Beta Businesses, you have decided that the address data requires parsing, cleansing, and standardization. You would also like to add a column that provides descriptive information for the information code values.
Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
If the BETA, DQSTANDALONE AND DELTA datastores already exist, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new source Datastore for the BETA Businesses database. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
73
Unit 3: Using Address Cleanse Transforms
BODS30
3.
In your Local Object Library, create a new Datastore for the DQSTANDALONE database.
4.
In your Local Object Library, create a new Datastore for the DELTA staging database.
Task 2: In the USA_Customer table in the Beta datastore, set the content type for certain fields to Name. 1.
In the USA_Customer table in the BETA datastore, set the content type for certain fields to Name.
Task 3: Within the Omega project, create a batch job Beta_Customers_AddressCleanse_Job and a dataflow Beta_Customers_AddressCleanse_DF containing the USA Address Cleanse transform with a source table and a target template table. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
In the Omega project, create a new batch job Beta_Customers_AddressCleanse_Job with a new data flow called Beta_Customers_AddressCleanse_DF.
3.
In the workspace for Beta_Customers_AddressCleanse_DF, add the USA_Customers table from the Beta datastore as the source object connected to the USA Address Cleanse transform with a target template table AddressCleansed_USA in the Delta datastore.
Task 4: Configure the USA Address Cleanse transform to parse the names and addresses in the USA_Customers table. 1.
Map input columns to output columns for those columns that should be processed by the USA Address Cleanse transform and output additional columns, which parse the original information based on the USA engine. Continued on next page
74
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
Task 5: Execute the Beta_Customers_AddressCleanse_Job with the default options after saving all objects. 1.
Execute the job with the default execution properties after saving all created objects.
Task 6: Add a Query transform to the data flow to use a lookup function to provide the definition for AC_INFO_CODE values and execute the batch job again to see the results.
2011
1.
Add the Query transform to the workspace between the USA Address Cleanse transform and the target template table AddressCleansed_USA.
2.
Use the Functions wizard to create a new lookup expression using the lookup_ext function.
3.
Execute the job with the default execution properties after saving all created objects.
© 2011 SAP AG. All rights reserved.
75
Unit 3: Using Address Cleanse Transforms
BODS30
Solution 2: Use the Address Cleanse Transform Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
a)
From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.
b)
In the dialog box, enter your assigned User ID.
c)
Enter your password, which is the same as your User ID.
d)
Choose the Log on button.
e)
In the list of repositories, select your repository DSREPO##, where ## is the number portion of your User ID.
f)
Choose OK.
If the BETA, DQSTANDALONE AND DELTA datastores already exist, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new source Datastore for the BETA Businesses database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Continued on next page
76
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
Field
Value
Datastore name
BETA
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
BETA
User name
sourceuser (lower case)
Password
sourcepass (lower case)
Click OK to finish. c)
In the Local Object Library, right-click the BETA datastore to select the Open option from the context menu. The tables from the BETA database are displayed on the Metadata window on the right.
d)
Import the metadata for the BETA database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • • • • • •
source.addrcodes source.categories source.country source.customer source.employees source.orderdetails source.orders source.products source.region source.suppliers source.usa_customers Note: This database is a read-only database.
e)
View the data for the USA_Customers table and confirm that Jane Hartley from Planview Inc. is the first customer record by right clicking on the table in the Local Object Library and choose the option View data. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
77
Unit 3: Using Address Cleanse Transforms
3.
BODS30
In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DQSTANDALONE
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DQSTANDALONE
User name
sourceuser (lower case)
Password
sourcpass (lower case)
Choose OK to finish. c)
In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.
d)
Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •
AddressCleanse_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page
78
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
4.
In your Local Object Library, create a new Datastore for the DELTA staging database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DELTA
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DELTA## (where ## is the group number assigned by your Instructor)
User name
student## or consult your instructor
Password
student## or consult your instructor
Choose OK to finish. c)
You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.
Task 2: In the USA_Customer table in the Beta datastore, set the content type for certain fields to Name. 1.
In the USA_Customer table in the BETA datastore, set the content type for certain fields to Name. a)
In the Local Object Library, select the Datastores tab and open the node for the BETA datastore.
b)
Open the node Tables and right-click on the table USA_Customer to choose the option Edit.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
79
Unit 3: Using Address Cleanse Transforms
BODS30
c)
Right-click the field Cust_Name to select the option Properites and choose the type Name from the drop-down list for the Content Type associated with the field.
d)
Select OK to close the dialog box.
e)
Close the table metadata by selecting the Back button.
Task 3: Within the Omega project, create a batch job Beta_Customers_AddressCleanse_Job and a dataflow Beta_Customers_AddressCleanse_DF containing the USA Address Cleanse transform with a source table and a target template table. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
80
a)
From the Project menu, choose the option New → Project.
b)
When the Project New dialog box appears, enter Omega in the Project name field.
c)
Select Create so that the new project appears in the Project area.
In the Omega project, create a new batch job Beta_Customers_AddressCleanse_Job with a new data flow called Beta_Customers_AddressCleanse_DF. a)
In the Project area, right-click the project name and choose New Batch Job from the menu.
b)
Enter the name of the job as Beta_Customers_AddressCleanse_Job.
c)
Press Enter to commit the change.
d)
Open the job Beta_Customers_AddressCleanse_Job by double-clicking it.
e)
Select the Data Flow icon in the Tool Palette.
f)
Select the workspace where you want to add the data flow.
g)
Enter Beta_Customers_AddressCleanse_DF as the name. Continued on next page
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
3.
h)
Press Enter to commit the change.
i)
Double-click the data flow to open the data flow workspace.
In the workspace for Beta_Customers_AddressCleanse_DF, add the USA_Customers table from the Beta datastore as the source object connected to the USA Address Cleanse transform with a target template table AddressCleansed_USA in the Delta datastore. a)
In the Local Object Library, select the Datastores tab and then select the USA_Customers table from the Beta datastore.
b)
Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.
c)
On the Transforms tab of the Local Object Library, expand the Data Quality Global Address Cleanse branch to select and drag the USA Address Cleanse transform to the data flow workspace.
d)
Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.
e)
In the Tool Palette, select the icon for a template table and then select in the data flow workspace. Enter AddressCleansed_USA as the table name in the DELTA## datastore.
f)
Connect the transform to the target table by selecting the transform and while holding down the mouse button, drag to the target table. Release the mouse button to create the link.
Task 4: Configure the USA Address Cleanse transform to parse the names and addresses in the USA_Customers table. 1.
Map input columns to output columns for those columns that should be processed by the USA Address Cleanse transform and output additional columns, which parse the original information based on the USA engine. a)
In the data flow workspace, double-click the USA Address Cleanse transform to open the transform editor.
b)
In the transform editor for the USA Address Cleanse transform, map all columns from the Input Schema to the Output Schema by selecting all the columns in the Input Schema and dragging them to the Output Schema.
c)
Map these fields from the Schema In to the Input tab of the parameters area: Continued on next page
2011
© 2011 SAP AG. All rights reserved.
81
Unit 3: Using Address Cleanse Transforms
BODS30
Schema In
Input Mapping
Address1
MULTILINE1
Address2
MULTILINE2
City
LOCALITY1
PostalCode
POSTCODE
Create the mapping by selecting the field form the Schema In area and dragging them to the field in the Input tab. Note: The fields Firm, Region, and Country are already mapped, because these content types are relevant to address cleanse transforms. d)
On the Options tab of the parameters area, ensure that only the USA engine is enabled. Note: Other settings may be modified to conform to business rules. In this case, accept the default address options.
e)
On the Output tab, filter the output columns based on best practice by selecting the radio button Best Practice.
f)
Since selection order determines output order, add these output columns in the order specified to the output schema after the columns you have already mapped. Use the check boxes to make your selections. GENERGENERGENERATED_FIELD_NAME ATED_FIELD_CLASS ATED_FIELD_ADDRCLASS PRIMARY_SECONDARY_ADDRESS
BEST
DELIVERY
PRIMARY_SECONDARY_ADDRESS
BEST
DUAL
PRIMARY_NUMBER
BEST
DELIVERY
PRIMARY_PREFIX1
BEST
DELIVERY
PRIMARY_NAME1
BEST
DELIVERY Continued on next page
82
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
g)
PRIMARY_POSTFIX1
BEST
DELIVERY
PRIMARY_TYPE1
BEST
DELIVERY
UNIT_NUMBER
BEST
DELIVERY
LOCALITY1_NAME
BEST
DELIVERY
REGION1
BEST
DELIVERY
POSTCODE_FULL
BEST
DELIVERY
ISO_COUNTRY_CODE_2CHAR
BEST
DELIVERY
STATUS_CODE
NONE
NONE
INFO_CODE
NONE
NONE
Rename the columns in the output schema by right-clicking on each one in turn and selecting the option Properties. Choose OK to close the Properties dialog box. ORIGINAL
NEW
PRIMARY_SECONDARY_ADDRESS_BEST_COMPONENT_DELIVERY
AC_PRIM_SEC_ADDRESS
PRIMARY_SECONDARY_ADDRESS_BEST_COMPONENT_DUAL
AC_PRIM_SEC_ADDRESS_DUAL
PRIMARY_NUMBER_BEST_COMPONENT_DELIVERY
AC_PRIM_NUM
PRIMARY_PREFIX1_BEST_COMPONENT_DELIVERY
AC_PREFIX
PRIMARY_NAME1_BEST_COMPONENT_DELIVERY
AC_PRIM_NAME
PRIMARY_POSTFIX1_BEST_COMPONENT_DELIVERY
AC_POSTFIX
PRIMARY_TYPE1_BEST_COMPONENT_DELIVERY
AC_PRIM_TYPE
UNIT_NUMBER_BEST_COMPONENT_DELIVERY
AC_UNIT_NUM
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
83
Unit 3: Using Address Cleanse Transforms
BODS30
LOCALITY1_NAME_BEST_COMPONENT_DELIVERY
AC_LOCALITY
REGION1_BEST_COMPONENT_DELIVERY
AC_REGION
POSTCODE_FULL_BEST_COMPONENT_DELIVERY
AC_POSTCODE_FULL
ISO_COUNTRY_CODE_2CHAR_BEST_COMPONENT_DELIVERY
AC_COUNTRY_CODE
STATUS_CODE_NONE_ASSIGNMENT_INFO_NONE
AC_STATUS_CODE
INFO_CODE_NONE_ASSIGNMENT_INFO_NONE
AC_INFO_CODE
Task 5: Execute the Beta_Customers_AddressCleanse_Job with the default options after saving all objects. 1.
Execute the job with the default execution properties after saving all created objects. a)
In the Project area, right-click the Beta_Customers_AddressCleanse_Job and choose Execute from the menu.
b)
Data Services prompts you to save any objects that have not been saved. Choose OK.
c)
The Execution Properties dialog box appears. Choose OK.
d)
Return to the data flow workspace and view the target table data by right-clicking the target table and choosing the option View Data. Notice that the output includes various AC_INFO_CODE values, but there is no output column that defines what these values represent.
Continued on next page
84
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
Task 6: Add a Query transform to the data flow to use a lookup function to provide the definition for AC_INFO_CODE values and execute the batch job again to see the results. 1.
Add the Query transform to the workspace between the USA Address Cleanse transform and the target template table AddressCleansed_USA. a)
In the Tool Palette, select the Query transform icon and select the workspace to add a Query template to the data flow.
b)
Right-click the link between the USA Address Cleanse transform and the target template table AddressCleansed_USA to choose the option Delete.
c)
Connect the USA Address Cleanse transform to the Query transform by selecting the USA Address Cleanse transform and holding down the mouse button, rag the cursor to the Query transform. Then release the mouse button.
d)
Connect the Query transform to the target template table by selecting the Query transform and holding down the mouse button, drag the cursor to the target table. Then release the mouse button.
e)
Double-click the Query transform to open the transform editor.
f)
Map all columns from the input schema to the output schema by selecting each field in the input schema and while holding down the mouse button, drag to the appropriate column in the output schema.
g)
Right-click the last column in the output schema to select the option Create output column and then the option Below. Enter AC_INFO_DESC as the name of the new output column with data type of varchar and a length of 255. Choose OK.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
85
Unit 3: Using Address Cleanse Transforms
2.
BODS30
Use the Functions wizard to create a new lookup expression using the lookup_ext function. a)
In the Mapping tab for the new output column, select the Function button and in the Select Function dialog box, open the category of “Database Functions”.
b)
From the list of function names, select the lookup_ext function and choose Next.
c)
In the Lookup_ext - Select Parameters dialog box, enter the parameters: Field/Option
Value
Lookup table
BETA.SOURCE.ADDRCODES
Condition Columns in lookup table
Info_Code
Op.(&)
=
Expression
USA_AddressCleanse.AC_INFO_CODE
Output Column in lookup table
Info_Desc
The code is generated: lookup_ext ([BETA.SOURCE.ADDRCODES, 'PRE_LOAD_CASH', 'MAX'], [Info_Desc], [NULL], [Info_Code, '=”, USA_AddressCleanse.AC_INFO_CODE]) SET (“run_as_separate_process”='no') d)
Select the Back icon to close the editor.
Continued on next page
86
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Address Cleanse Transforms
3.
Execute the job with the default execution properties after saving all created objects. a)
In the Project area, right-click the Beta_Customers_AddressCleanse_Job and select Execute from the menu.
b)
Data Services prompts you to save any objects that have not been saved. Select OK.
c)
The Execution Properties dialog box appears and select OK.
d)
Return to the data flow workspace and view the target table data by right-clicking the target table and choosing the option View Data. The target table now includes descriptions in the newly created AC_INFO_DESC column.
2011
© 2011 SAP AG. All rights reserved.
87
Unit 3: Using Address Cleanse Transforms
BODS30
Lesson Summary You should now be able to: • Prepare your input data for Address Cleanse transforms • Use the Address Cleanse transforms • Work with global address data • Work with transactional address data
Related Information • •
88
Refer to the Supported Countries section in the Address Cleanse chapter of the Data Services Reference Guide For more information, see Chapter 3 in the Data Services Reference Guide.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Unit Summary
Unit Summary You should now be able to: • Prepare your input data for Address Cleanse transforms • Use the Address Cleanse transforms • Work with global address data • Work with transactional address data
2011
© 2011 SAP AG. All rights reserved.
89
Unit Summary
90
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit Summary
91
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit Summary
92
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit 4 Using Data Cleanse Transforms Unit Overview This lesson describes the data cleansing capabilities of Data Services. After completing this lesson, you will be able to: • •
Describe the Data Cleanse transforms Describe the Universal Data Cleanse transform
Unit Objectives After completing this unit, you will be able to: •
Complete a Data Cleanse transform
Unit Contents Lesson: Using the Data Cleanse Transform ..................................... 94 Exercise 3: Use the Data Cleanse Transform .............................. 109
2011
© 2011 SAP AG. All rights reserved.
93
Unit 4: Using Data Cleanse Transforms
BODS30
Lesson: Using the Data Cleanse Transform Lesson Overview The Data Cleanse transform identifies and parses names, titles, firm data, phone numbers, Social Security numbers, dates, and e-mail addresses. You can also assign gender code, add prenames, create personalized greetings, generate match standards, and convert input sources to a standard format. Additionally, you can parse multiple names from individual records so that records can be created for each individual.
Lesson Objectives After completing this lesson, you will be able to: •
Complete a Data Cleanse transform
Business Example The Data Cleanse transform's perspective is on data other than addresses. Often titles, names, phone information, and e-mail addresses are incorrect. To correct these errors, the data must be parsed down to the lowest level. This is a preparatory step before standardization can be accomplished.
Understanding how Data Cleanse Parses Data There are five main steps that the Data Cleanse transforms takes while parsing operational data: 1.
2.
3.
4.
5.
94
Word breaking – breaks the input line down into smaller, more usable pieces. Data Cleanse breaks the input line on white space, punctuation, and alphanumeric transitions. Gathering – recombines words that belong together, such as words that look up together in the dictionary. Data Cleanse does not attempt to combine words that have been broken for a custom parser. Tokenization – assigns specific meanings to each of the pieces. Data Cleanse looks up each individual input word in the dictionary. A list of tokens is created using the classifications associated with each word in the dictionary. Rule matching – matches the token meanings against defined rules. Data Cleanse does not match the pattern of specific words against the rules; it matches pattern of the types, or classifications, of the words. Action item assignment – outputs parsed data based upon matched rules.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
Understanding the Business Need for Data Cleanse Transforms As with address cleansing, one of the primary uses of data cleansing is to prepare names, titles, firm data, phone numbers, Social Security numbers, dates, and e-mail addresses for matching. Dictionaries are used with rule files to parse data to its smallest components and standardize it. This reduces variability and increases the possibility of a successful match. In addition, data cleansing can generate match name standards, which provide: • • •
Alternative spellings for names (for example, Kathy and Cathy) Alternatives for ambiguous names (for example, Patrick and Patricia for Pat) Names associated with nicknames (for example, Elizabeth for Liz, Beth, or Betsy)
This example shows how data cleansing can prepare records for matching.
2011
© 2011 SAP AG. All rights reserved.
95
Unit 4: Using Data Cleanse Transforms
BODS30
Figure 31: Business Need for Data Cleanse Transform 1
• • • • •
•
96
Identifying and isolating a wide variety of data, even if the data is floating in lines. Standardizing data to make records more consistent, such as fixing casing, punctuation, and abbreviations. Assigning a precise gender code to each name – strong male, strong female, weak male, weak female, and ambiguous. Assigning a prename such as Mr., Ms., or Mrs. based on gender codes. Creating personalized greetings in formal, casual, and title styles: Dear Mr. Jones, Dear Robert, and Dead Manager. The transform creates a greeting for each person, as well as a dual name greeting for records with two names. Creating a separate output record for each person in a record with multiple persons. For example, an input database can contain one record for each customer with multiple contact persons, each of which can be split into separate records.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
Figure 32: Business Need for Data Cleanse Transform 2
Understanding Strategies for Data Cleanse Transform How you configure your Data Cleanse transforms depends on the type of data you are cleansing. Name, title, firm, and firm location data You can standardize name data and generate discrete standardized fields for prename, first name, middle name, last name, maturity post name, and honorary post name based on which field you decide to evaluate to determine if two records match. For the first name and middle name match standards, you can generate up to six first name match standards and up to six middle name match standards. Even though there are a maximum of six first and middle match name standards, you can only use a maximum of three first name match name standards and a maximum of three middle name match name standards when matching. The Data Cleanse transform also parses up to six job titles per record, up to two firm names (such as IBM), and up to two firm locations (such as Engineering Dept.). This transform can also convert firm names to accepted acronyms, such as General Motors Corp. to GM. Social Security number data
2011
© 2011 SAP AG. All rights reserved.
97
Unit 4: Using Data Cleanse Transforms
BODS30
Data Cleanse parses US Social Security numbers (SSN) that are either by themselves or on an input line surrounded by other text. Data Cleanse outputs the individual components of a parsed Social Security number: the entire SSN, the area, the group, and the serial. Data Cleanse parses Social Security numbers in two steps. First, it identifies a potential SSN by looking for any of three patterns: Once the pattern is identified, Data Cleanse performs a validity check on the first five digits only. If the number fails validation, the number is not output, as it is not considered a valid SSN as defined by the US government. E-mail data When Data Cleanse parses input data it recognizes as an e-mail address, it outputs the individual components of a parsed address: the e-mail user name, complete domain name, top domain, second domain, third domain, fourth domain, fifth domain, and host name. You can also verify that an e-mail address is properly formatted and flag the address as belonging to an Internet service provider (ISP). Data Cleanse does not verify whether the domain name is registered. Nor does it verify that an e-mail server is active at that address, the user name is registered on that e-mail server, or that the personal name in the record can be reached at this e-mail address. For example, with the input data,
[email protected], Data Cleanse outputs each element in the following fields: Output field
Output value
Email
[email protected]
Email user
expat
EMail_Domain_All
london.home.office.city.co.uk
Email_Domain_Top
uk
Email_Domain_Second
co
Email_Domain_Third
city
Email_Domain_Fourth
office
Email_Domain_Fifth
home
Email_Domain_Host
london
Phone number data Data Cleanse can parse both North American (US and Canada) and international phone numbers. When Data Cleanse parses a phone number, it outputs the individual components of the number into the appropriate fields.
98
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
Data Cleanse recognizes phone numbers by their pattern and (for non-US numbers) by their country code. For North American phone numbers, it looks for commonly used patterns such as (234) 567-8901, 234-567-8901, and 2345678901. It gives you the option for some reformatting on output (such as your choice of delimiters). Data Cleanse searches for European and Pacific-Rim numbers by pattern. The patterns used are defined from the US and require that the country code appear at the beginning of the number. Note that Data Cleanse does not offer any options for reformatting international phone numbers or cross-compare to the address to see if the country and city codes in the phone match the address. Date data Data Cleanse can parse up to six dates from your defined record. Data Cleanse identifies the dates in the input, breaks dates into day/month/year components, and makes dates available as output in either the original format (for example DD-MMM-YY) or a user-selected standard format (for example, MM/DD/YYYY). International data
2011
© 2011 SAP AG. All rights reserved.
99
Unit 4: Using Data Cleanse Transforms
BODS30
By default, Data Cleanse can identify international data presented in multiple formats. There are also several ways that you can use Data Cleanse to identify and manipulate various forms of other international data, including prenames, greetings, and personal identification numbers. •
Customizing greetings and prenames per country The default prenames and salutations found in the Data Cleanse greetings option group are commonly used in English-speaking nations. For countries where English is not the primary language, you can modify these options to reflect common prenames and salutations.
•
Modifying the phone file for other countries By default, Data Cleanse includes phone number patterns for many countries. However, if you find that you need parsing for a country that is not included, you can modify the international phone file (drlphint.dat) to enable Data Cleanse to detect phone number patterns that follow a different format. New phone number patterns can be added to the international phone file using regular expressions.
•
Using personal ID numbers With a default Data Quality installation, Data Cleanse can identify USA Social Security numbers and separate them into discrete components. If your data includes personal identification numbers, which are different from US SSNs, you can use User-Defined Pattern Matching to identify the numbers. Number formats to be identified by User-Defined Pattern Matching can be set up using regular expressions.
•
Using cleansing packages Cleansing Packages are packages that enhance the ability of Data Cleanse to accurately process various forms of global data by including language-specific reference data and parsing rules. Since cleansing packages are based on the standard Data Cleanse transform, you can use the sample transforms in your projects in the same way you would use Data Cleanse and gain the advantage of enhanced reference data and parsing rules.
The Data Cleanse Transform A default installation of Data Services with the Data Quality transforms includes several basic Data Cleanse transform configurations. A default transform configuration, EnglishNorthAmerica_DataCleanse, is preconfigured to cleanse name, title, firm, date, phone, Social Security number, and e-mail data using English-based data quality rules.
100
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
Other Data Cleanse transform configurations are preconfigured to cleanse name, title, firm, data, phone, and e-mail data using data quality rules specific to Dutch, French, German, Italian, Spanish, or Portuguese. These Data Cleans transforms include all required options except input fields.
Configuring the Data Cleanse Transform Configure options on the Data Cleanse transform to fit your operational data project. You need to set several options: •
Import the ATL file transferred from Cleansing Package Builder. Importing the ATL file brings the required information and automatically sets the following options:
•
– Cleansing Package – Engine – Filter Output Fields – Input Word Breaker – Parser Configuration In the input schema, select the input fields that you want to map and drag them to the appropriate fields in the Input tab. –
•
•
2011
Name and firm data can be mapped either to discrete fields or multiline fields. – Custom data must be mapped to multiline fields. – Phone, date, e-mail, Social Security number, and user-defined pattern data can be mapped either to discrete fields or multiline fields. The corresponding parser must be enabled. In the Options tab, select the appropriate option values to determine how Data Cleanse processes your data. – The Cleansing Package option – The Parser_Sequence_Multiline options In the Output tab, select the fields that you want to output from the transform. In Cleansing Package Builder, output fields are referred to as attributes. Ensure that you map any attributes (output fields) defined in user-defined patterns in Cleansing Package Builder reference data.
© 2011 SAP AG. All rights reserved.
101
Unit 4: Using Data Cleanse Transforms
BODS30
Understanding Dictionaries and Rule Files Data cleansing in Data Services uses cleansing packages and rule files. The Reference Files section of the Options tab points to the location of supporting dictionary and data files needed for parsing. By default, these files are installed in the LINK_DIR\DataQuality|datacleanse folder. The default substitution parameter for dictionary locations is [$$RefFilesDataCleanse]. Parsing rule files contain rules that define how an element is parsed based on the tokens assigned from the dictionary. Parsing dictionaries in cleansing packages contain entries for words and phrases that define how each might be used. For example, the dictionary indicates that the word "Engineering" can be used in a firm name, such as “Smith Engineering Inc.”, or in a job title, such as “VP of Engineering”. The dictionary also defines the following: Type of information in dictionary
Description
Aliases
These are potential matches for a value. For example, Patrick and Patricia are aliases for Pat.
Gender
These define the gender assigned to a value. For example, Anne is a feminine given name and Mr. is a masculine prename.
Standards
These indicate the proper capitalization or other output formatting on data parsed by Data Cleanse.
All of the necessary dictionaries and rule files are contained in a Cleansing Package which is installed with the Cleansing Package Builder.
102
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
Creating a Custom Cleansing Package in a Cleansing Package
Figure 33: The Cleansing Package Builder
The default Data Cleanse cleansing package consists of name and firm data that is not required for parsing operational data. If you use a complete custom dictionary, Data Cleanse does not have to consider all the default name and firm data, and parsing is faster and more accurate. Based on your analyzed data, you can create new output categories and fields where Data Cleanse can place parsed and standardized data. For data sets where the input data comes in only a few orders, using a single output category is sufficient. However, if your data can come in any order, you can use multiple output categories to reduce the number of rules needed. Cleansing Package-based classifications add meaning to terms when assigned to primary dictionary entries. For example, the primary entry blue might be assigned the classification color, meaning Data Cleanse can identify blue as a color term. Pattern-based classifications are defined using regular expressions, and are not assigned to specific terms in the dictionary. Instead, Data Cleanse looks for patterns in the input data that match the specified regular expression. To execute successfully, a Data Cleanse job must reference a cleansing package. The cleansing package may be either an SAP-supplied cleansing package or a cleansing package that you have modified and published in the Cleansing Package Builder module of Information Steward.
2011
© 2011 SAP AG. All rights reserved.
103
Unit 4: Using Data Cleanse Transforms
BODS30
With the introduction of the Cleansing Package Builder in Information Steward, you no longer need to specify individual dictionary, rules, and reference data files. The information formerly contained in those files is now included in the cleansing package. The Cleansing Package option group and Cleansing Package option has been added to the Data Cleanse transform.
Improving Parsing Results The Cleansing Package Builder module of Information Steward is required for Data Services 4.0 to modify or customize any type of data. The dictionary menu has been removed from the Data Services Designer menu bar. Data Cleanse no longer requires a separate cleansing package repository. With the introduction of the Cleansing Package Builder in Information Steward, you no longer need to specify individual dictionary, rules, and reference data files. The information formerly contained in those files is now included in the cleansing package. The Cleansing Package option group and Cleansing Package option has been added to the Data Cleanse transform. You can modify the parsing dictionary used by the Data Cleanse transform to improve parsing results for your data.
Figure 34: Refining Data Cleansing Results
Correct specific parsing behavior You can customize the parsing dictionary to correct specific parsing behavior that you have seen in your output. Recognize local names The name data in Data Cleanse's default parsing dictionary, PERSON_FIRM_EN, is based on an analysis of USA residents. As such, the parsing dictionary is broadly useful across the United States. However, you can tailor the dictionary to better suit your data by adding ethnic or regional names. If Data Cleanse does not recognize a specific name, for example, “Jinco Xandru”, you can add “Jingco” to the dictionary as a first name and “Xandru” as a last name.
104
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
Identify industry-specific jargon The default parsing dictionary is useful across many industries. You can tailor the dictionary to better suit your own industry by adding special titles, prenames or postname, standardized abbreviations, or other jargon words. For example, if you process data for the real estate industry, you might add industry-specific postnames such as Certified Residential Specialist (CRS), Accredited Buyer Representative (ABR), or Graduate Realtor Institute (GRI). Recognize specific phrases Some words can be used in both firm names and job titles. As a result, Data Cleanse may incorrectly recognize some job titles as firm names. To improve parsing, you can add these job title phrases to the dictionary. Identify firm names containing personal names Often a firm name is made up of personal names. As a result, Data Cleanse may incorrectly parse the firm as a personal name. For example, the catalog retailer “J. Crew ” may be parsed as a personal name rather than as a firm. To improve parsing, you can add multiple-word firm names to the dictionary. For example, to parse “J. Crew” as a firm rather than as a personal name, you could add “J ” and “Crew” to the dictionary with the Firm_Name classification, and “J Crew” with the Firm_Name_Alone classification. Changes to structure require that you upgrade all Data Services 3.2 Data Cleanse dictionaries to cleansing packages of the current version. To execute successfully, a Data Cleanse job must reference a cleansing package. The cleansing package may be either an SAP-supplied cleansing package or a cleansing package that you have modified and published in the Cleansing Package Builder module of Information Steward. Restriction: If you currently have modified a person and firm dictionary or created a custom dictionary using Universal Data Cleanse, ensure that the Cleansing Package Builder in Information Steward is available before you upgrade. You need Cleansing Package Builder to migrate your dictionary rules, and reference files to the new cleansing package format.
Improving Casing Results cleansing package dictionaries can also be used to adjust casing. If you use mixed case, the general rule is to capitalize the first letter of the word and put the rest of the word in lowercase. However, there are exceptions to that rule, such as McDonald, Ph.D., IBM, NY, and so on.
2011
© 2011 SAP AG. All rights reserved.
105
Unit 4: Using Data Cleanse Transforms
BODS30
To handle mixed-case exceptions, Data Cleanse consults secondary information standards in the dictionary. The dictionary contains the correct casing of a word and also indicates when that casing should be used. For example, the entry MS is cased differently depending on how it is used. “M.S.” is an abbreviation for the honorary postname “Master of Science”, which “Ms.” is a prename. The dictionary indicates which formatting to use based on the content type. Most Data Cleanse users find that the default capitalization of words in the dictionary is sufficient for producing good mixed-case results. However, it is impossible for the default dictionary to contain every mixed-case exception. If Data Cleanse does not case a word as you want, you can create a custom standard in the dictionary. For example, “TechTel” is not in the default dictionary, so Data Cleanse capitalizes only the first letter of the word. However, if you add the word “TechTel” to your dictionary with a standard for firm name use, you can achieve the desired mixed-case results: Input
Standard
Output
TECHTEL, INC.
(none)
Techtel Inc.
TECHTEL, INC.
TechTel
TechTel Inc.
Ranking and Prioritizing Parsing Engines The Data Cleanse transform can be configured to use only specific parsers or a specific parser order when dealing with multiline input. You can change the parser order for a specific multiline input by modifying the corresponding parser sequence option in the Parser_Configuration options group for the Data Cleanse transform. For example, to change the order of parsers for the Multiline1 input field, modify the Parser_Sequence_Multiline1 option.
106
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
By default, Data Cleanse parses multiline input using parsers in the following order: 1. 2. 3. 4. 5. 6.
User-defined pattern matching E-mail Social Security number North American phone number International phone number Person or firm name Hint: Data Cleanse parser prioritization options can be modified with the ordered options window. Carefully selecting which parsers to use in what order can be beneficial. Turning off parsers that you do not need significantly improves parsing speed and reduces the chances that your data will be parsed incorrectly.
2011
© 2011 SAP AG. All rights reserved.
107
Unit 4: Using Data Cleanse Transforms
108
© 2011 SAP AG. All rights reserved.
BODS30
2011
BODS30
Lesson: Using the Data Cleanse Transform
Exercise 3: Use the Data Cleanse Transform Exercise Objectives After completing this exercise, you will be able to: • Create columns to identify individual name components in the Data Cleanse transform • Create match standards to assist in the matching process
Business Example After analyzing the target data for the Beta Businesses, you have decided that you would like to create new columns from the USA_Customers table to contain the parsed name data and name standards for the match. After the column is created, you would then like to filter the AC_INFO_CODE results according to descriptions.
Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new Datastore for the DQSTANDALONE database.
3.
In your Local Object Library, create a new Datastore for the DELTA staging database. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
109
Unit 4: Using Data Cleanse Transforms
BODS30
Task 2: In the Omega project, create a new batch job Beta_Customers_DataCleanse_Job containing a data flow Beta_Customers_DataCleanse_DF as a replication of Beta_Customers_Address_Cleanse_DF. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
In the Omega project, create a new batch job Beta_Customers_DataCleanse_Job containing a data flow Beta_Customers_DataCleanse_DF.
Task 3: Add and configure a Data Cleanse transform to the data flow Beta_Customers_DataCleanse_DF. 1.
In the Address_Cleansed_USA table in the DQSTANDALONE datastore, set the content type for certain fields to Name.
2.
In the workspace for Beta_Customers_DataCleanse_DF, add the AddressCleansed_USA table from the DQSTANDALONE datastore as the source object connected to the English North America Data Cleanse transform with a target template table DataCleansed_USA in the DELTA datastore.
3.
Configure the Data Cleanse transform in the data flow Beta_Customers_DataCleanse_DF.
Task 4: Because of the additional transform, remap the fields from the input schema to the output schema in the Query transform. 1.
In the Query transform, remap the fields from the input schema to the output schema.
2.
Use the Functions wizard to create a new lookup expression using the lookup_ext function.
Continued on next page
110
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
Task 5: Save all objects and execute the batch job Beta_Customers_DataCleanse_Job with the default parameters. 1.
Save all objects and execute the batch job Beta_Customers_DataCleanse_Job with the default parameters.
Task 6: In the Beta_Customers_DataCleanse2_Job, modify the Query transform to add a description field DC_GENDER_DESC to further clarify the DC_GENDER_ID field. 1.
Modify the Query transform in the Beta_Customers_DataCleanse2_DF workspace with the addition of the DC_GENDER_DESC field.
Task 7: Save all objects and execute the batch job Beta_Customers_DataCleanse2_Job with the default parameters. 1.
2011
Save all objects and execute the batch job Beta_Customers_DataCleanse2_Job with the default parameters.
© 2011 SAP AG. All rights reserved.
111
Unit 4: Using Data Cleanse Transforms
BODS30
Solution 3: Use the Data Cleanse Transform Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
a)
From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.
b)
In the dialog box, enter your assigned User ID.
c)
Enter your password, which is the same as your User ID.
d)
Choose the Log on button.
e)
In the list of repositories, select your repository DSREPO##, where ## is the number portion of your User ID.
f)
Choose OK.
If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
Continued on next page
112
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DQSTANDALONE
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DQSTANDALONE
User name
sourceuser (lower case)
Password
sourcepass (lower case)
Choose OK to finish. c)
In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.
d)
Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •
AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
113
Unit 4: Using Data Cleanse Transforms
3.
BODS30
In your Local Object Library, create a new Datastore for the DELTA staging database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DELTA
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DELTA## (where ## is the group number assigned by your Instructor)
User name
student## or consult your instructor
Password
student## or consult your instructor
Choose OK to finish. c)
You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.
Task 2: In the Omega project, create a new batch job Beta_Customers_DataCleanse_Job containing a data flow Beta_Customers_DataCleanse_DF as a replication of Beta_Customers_Address_Cleanse_DF. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. a)
From the Project menu, choose New → Project. Continued on next page
114
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
2.
b)
When the Project New dialog box appears, enter Omega in the Project name field.
c)
Select Create so that the new project appears in the Project area.
In the Omega project, create a new batch job Beta_Customers_DataCleanse_Job containing a data flow Beta_Customers_DataCleanse_DF. a)
In the Project area, right-click on the Omega project to select the option New Batch Job and enter the name Beta_Customers_DataCleanse_Job.
b)
Select the Data Flow icon in the Tool Palette.
c)
Select the workspace where you want to add the data flow.
d)
Enter Beta_Customers_DataCleanse_DF as the name.
e)
Press Enter to commit the change.
f)
Double-click the Beta_Customers_DataCleanse_DF to open the workspace.
Task 3: Add and configure a Data Cleanse transform to the data flow Beta_Customers_DataCleanse_DF. 1.
In the Address_Cleansed_USA table in the DQSTANDALONE datastore, set the content type for certain fields to Name. a)
In the Local Object Library, select the Datastores tab and open the node for the DQSTANDALONE datastore.
b)
Open the node Tables and right-click on the table Address_Cleansed_USA to choose the option Edit.
c)
Right-click the field Cust_Name to select the option Properties and choose the type Name from the drop-down list for the Content Type associated with the field.
d)
Choose OK to close the dialog box.
e)
Close the table metadata by selecting the Back button.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
115
Unit 4: Using Data Cleanse Transforms
2.
3.
BODS30
In the workspace for Beta_Customers_DataCleanse_DF, add the AddressCleansed_USA table from the DQSTANDALONE datastore as the source object connected to the English North America Data Cleanse transform with a target template table DataCleansed_USA in the DELTA datastore. a)
In the Local Object Library, select the Datastores tab and then select the AddressCleansed_USA table from the DQStandalone datastore.
b)
Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.
c)
On the Transforms tab of the Local Object Library, expand the Data Quality Data Cleanse branch to select and drag the English North America Data Cleanse transform to the data flow workspace.
d)
Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.
e)
In the Tool Palette, select the icon for a Query transform and then select in the data flow workspace.
f)
Connect the English North America Data Cleanse transform to the Query transform by selecting the English North America Data Cleanse transform and while holding down the mouse button, drag to the Query transform. Release the mouse button to create the link.
g)
In the Tool Palette, select the icon for a template table and then select in the data flow workspace. Enter DataCleansed_USA as the table name in the DELTA datastore. Connect the Query transform to the target table by selecting the Query transform and while holding down the mouse button, drag to the target table. Release the mouse button to create the link.
Configure the Data Cleanse transform in the data flow Beta_Customers_DataCleanse_DF. a)
Double-click the Data Cleanse transform to open the transform editor.
b)
In the transform editor, map all columns from the input schema to the output schema except for the AC* columns.
c)
Schema In columns with defined content types are automatically mapped for this transform. Verify that columns with content types are mapped as follows: Continued on next page
116
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
Name
Mapping
EMAIL1
Email
FIRM_LINE1
Firm
NAME_LINE1
Cust_Name
PERSON1_TITLE
Title
PHONE1
Phone
d)
The input column Title is mapped to the output column Title_Line1. Right-click on the output column Person1_Title to choose the option Remove mapping.
e)
The following columns do not have content types defined. Therefore, manually map columns as listed in the following table.
f)
Input Name
Mapping
DATE1
BirthDate
SSN1
ID
TITLE_LINE1
TITLE
On the Options tab in the parameters area, set the following options: Option Name
Option Value
Date_Delimiter
DASH
Date_Format
YYYY_MM_DD
Filter Output Fields
SHOW_ALL_FIELDS
Select Enter on the keyboard to set these options. g)
On the Output tab, add the following output columns to the output schema after the columns you have already mapped in the order specified. All of these fields except SSN are found with the radio button Best practice selected for filtering. Select the checkbox for each column to be added to the output schema. PARENT COMPONENT
GENERATED FIELD NAME
GENERATED FIELD CLASS Continued on next page
2011
© 2011 SAP AG. All rights reserved.
117
Unit 4: Using Data Cleanse Transforms
h)
BODS30
FIRM1
FIRM
STANDARDIZED
PERSON1
PRENAME
STANDARDIZED
PERSON1
GIVEN_NAME1
STANDARDIZED
PERSON1
GIVEN_NAME1 MATCH_STD1
STANDARDIZED
PERSON1
GIVEN_NAME1 MATCH_STD2
STANDARDIZED
PERSON1
GIVEN_NAME1 MATCH_STD3
STANDARDIZED
PERSON1
GIVEN_NAME2
STANDARDIZED
PERSON1
GIVEN_NAME2 MATCH_STD1
STANDARDIZED
PERSON1
FAMILY_NAME1
STANDARDIZED
PERSON1
HONORARY POSTNAME
STANDARDIZED
PERSON1
MATURITY POSTNAME
STANDARDIZED
PERSON1
TITLE
STANDARDIZED
PERSON1
PERSON
STANDARDIZED
PERSON1
GENDER
STANDARDIZED
DATE1
DATE
STANDARDIZED
EMAIL1
EMAIL
STANDARDIZED
NORTH_AMERICAN PHONE1
NORTH_AMERICAN PHONE
STANDARDIZED
SSN1
SSN
STANDARDIZED
Rename these columns in the output schema as follows by right-clicking each one to select the option Properties and after changing the name, choose OK: Original
New
FIRM1_FIRM STANDARDIZED
DC_FIRM
Continued on next page
118
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
i)
PERSON1_PRENAME STANDARDIZED
DC_PRENAME
PERSON1_GIVEN_NAME1 STANDARDIZED
DC_GIVEN_NAME1
PERSON1_GIVEN_NAME1 MATCH_STD1
DC_GIVEN_NAME1 MATCH_STD1
PERSON1_GIVEN_NAME1 MATCH_STD2
DC_GIVEN_NAME1 MATCH_STD2
PERSON1_GIVEN_NAME1 MATCH_STD3
DC_GIVEN_NAME1 MATCH_STD3
PERSON1_GIVEN_NAME2 STANDARDIZED
DC_GIVEN_NAME2
PERSON1_GIVEN_NAME2 MATCH_STD1_STANDARDIZED
DC_GIVEN_NAME2 MATCH_STD1
PERSON1_FAMILY_NAME1 STANDARDIZED
DC_FAMILY_NAME1
PERSON1_HONORARY POSTNAME_STANDARDIZED
DC_HONORARY_POSTNAME
PERSON1_MATURITY POSTNAME_STANDARDIZED
DC_MATURITY_POSTNAME
PERSON1_TITLE STANDARDIZED
DC_TITLE
PERSON1_PERSON STANDARDIZED
DC_PERSON
PERSON1_GENDER STANDARDIZED
DC_GENDER
DATE1_DATE STANDARDIZED
DC_DATE
EMAIL1_EMAIL STANDARDIZED
DC_EMAIL
NORTH_AMERICAN PHONE1_PHONE STANDARDIZED
DC_NORTH_AMERICAN PHONE
SSN1_SSN STANDARDIZED
DC_SSN
Map all columns prefixed with the prefix AC_ from the input schema to the output schema following the columns prefixed by DC_ by dragging from the input column to the output column. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
119
Unit 4: Using Data Cleanse Transforms
j)
BODS30
Select the Back icon to close the transform editor and return to the data flow work space.
Task 4: Because of the additional transform, remap the fields from the input schema to the output schema in the Query transform. 1.
In the Query transform, remap the fields from the input schema to the output schema. a)
In the data flow work space, double-click the Query transform to open the transform editor.
b)
In the transform editor for the Query transform, remove all columns in the output schema, except the AC_INFO_DESC column created for the address code lookup. Select all relevant output columns. Then right-click to select the option Delete.
c)
Map all of the columns from the input schema to the output schema and insert them above the AC_INFO_DESC column by dragging from the input column to the appropriate output column. Hint: You can select several input columns at one time either by holding down the CTRL button on the keyboard or by selecting the first input column you want to move, hold down the SHIFT key and select the last input column you want to move.
Continued on next page
120
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
2.
Use the Functions wizard to create a new lookup expression using the lookup_ext function. a)
In the Mapping tab for the output column, AC_INFO_DESC, select the Function button and in the Select Function dialog box, open the category of “Database Functions”.
b)
From the list of function names, select the lookup_ext function and select the Next button.
c)
In the Lookup_ext - Select Parameters dialog box, enter the parameters: Field/Option
Value
Lookup table
BETA.SOURCE.ADDRCODES
Condition Columns in lookup table
Info_Code
Op.(&)
=
Expression
EnglishNorthAmerica_DataCleanse.AC_INFO_CODE
Output Column in lookup table
Info_Desc
This code is generated: lookup_ext ([BETA.SOURCE.ADDRCODES, 'PRE_LOAD_CASH', 'MAX'], [Info_Desc], [NULL], [Info_Code, '=”, EnglishNorthAmerica_DataCleanse.AC_INFO_CODE]) SET “(run_as_separate_process”='no') d)
Select the Back icon to close the editor.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
121
Unit 4: Using Data Cleanse Transforms
BODS30
Task 5: Save all objects and execute the batch job Beta_Customers_DataCleanse_Job with the default parameters. 1.
Save all objects and execute the batch job Beta_Customers_DataCleanse_Job with the default parameters. a)
In the Project area, right-click the batch job Beta_Customers_DataCleanse_Job to select the option Execute. Choose OK to save all objects you have created.
b)
Choose OK to accept the default execution parameters.
c)
In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.
Task 6: In the Beta_Customers_DataCleanse2_Job, modify the Query transform to add a description field DC_GENDER_DESC to further clarify the DC_GENDER_ID field. 1.
Modify the Query transform in the Beta_Customers_DataCleanse2_DF workspace with the addition of the DC_GENDER_DESC field. a)
Double-click the Beta_Customers_DataCleanse2_DF to open the workspace.
b)
Double-click the Query transform to open the transform editor.
c)
In the output schema, find and right-click the field DC_GENDER to select the option New Output Column and then the option Insert Below. Call the new field DC_GENDER_DESC with a datatype of varchar and a length of 1. Choose OK to create the output column. This column translates the DC_GENDER_ID field values into single character representation so it can be sorted.
d)
In the Mapping section for the new field, select the Functions button and select the function search_replace under the String functions. Then choose Next.
e)
In the next dialog box, select the Full string radio button and select the value DC_GENDER_ID from the drop-down list as the value of the field Input expression.
Continued on next page
122
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
f)
g)
Then in the Search-replace table, enter the following values for the Search value and Replace value: Search value
Replace value
0
U
1
M
2
M
3
A
4
F
5
F
Leave the Default replace value blank. Click Finish to close the function editor.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
123
Unit 4: Using Data Cleanse Transforms
BODS30
Task 7: Save all objects and execute the batch job Beta_Customers_DataCleanse2_Job with the default parameters. 1.
Save all objects and execute the batch job Beta_Customers_DataCleanse2_Job with the default parameters. a)
In the Project area, right-click the batch job Beta_Customers_DataCleanse2_Job to select the option Execute. Choose OK to save all objects you have created.
b)
Choose OK to accept the default execution parameters.
c)
In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data. Check the DC_GENDER_DESC field to see the designations.
Figure 35: Data Cleanse Results
124
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Data Cleanse Transform
Lesson Summary You should now be able to: • Complete a Data Cleanse transform
2011
© 2011 SAP AG. All rights reserved.
125
Unit Summary
BODS30
Unit Summary You should now be able to: • Complete a Data Cleanse transform
126
© 2011 SAP AG. All rights reserved.
2011
Unit Summary
127
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit Summary
128
BODS30
© 2011 SAP AG. All rights reserved.
2011
Unit 5 Matching and Consolidating Data Unit Overview This lesson describes the matching capabilities of Data Services. After completing this lesson, you will be able to: • • •
Understand match concepts Perform post-match processing Tailor the match process to your data
Unit Objectives After completing this unit, you will be able to: • • • • • • • • • • • • • • • •
2011
Create a match strategy Define match components Set up matching using the Match Wizard Complete a simple match for full names on the Beta Businesses USA_Customers table Control record comparisons Use match levels Define match criteria Use blank field options Use match criteria options Modify match criteria using the Match Editor Perform post-match processing Set priorities Configure the group prioritization operation Set up for match reports Consolidate matching records to form a best record Configure the best record operation
© 2011 SAP AG. All rights reserved.
129
Unit 5: Matching and Consolidating Data
• • •
BODS30
Generate a unique identification number Use consumer and consumer householding Use associative matching
Unit Contents Lesson: Using the Match Wizard ................................................. 131 Exercise 4: Use the Match Wizard ........................................... 143 Lesson: Configuring the Match Transform ...................................... 156 Exercise 5: Modify Match Criteria using the Match Editor ................ 185 Lesson: Performing Post-Match Processing .................................... 204 Exercise 6: Prioritize Matched Records ..................................... 209 Lesson: Consolidating Matching Records ....................................... 230 Exercise 7: Create a Best Record............................................ 239 Lesson: Using Advanced Match Strategies ..................................... 261
130
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Lesson: Using the Match Wizard Lesson Overview Introduction The process of identifying matching records in Data Services is highly customizable and can accommodate nearly any set of complex business rules.
Lesson Objectives After completing this lesson, you will be able to: • • • •
Create a match strategy Define match components Set up matching using the Match Wizard Complete a simple match for full names on the Beta Businesses USA_Customers table
Business Example Realizing that it is easy to create duplicate records, you want to determine if the records being used as the single version of the truth for customers do not have duplicate records. You do not want to read through each record individually, so you are looking for a programmatic way to determine if duplicates exist.
Understanding the Purpose of Matching The goal of matching is to determine whether records refer to the same entity. This involves evaluating how well the individual fields, or attributes, of records match each other.
2011
© 2011 SAP AG. All rights reserved.
131
Unit 5: Matching and Consolidating Data
BODS30
Figure 36: Data Quality Framework – Matching and Consolidate
Once matches have been identified, data from these match groups can be salvaged and posted to form a single best record or posted to all matching records to update them.
Figure 37: Basic Concepts – Matching and Consolidation
132
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Input records
Consolidated record
Ms Margaret Smith-Kline Ph.D.
Name:
Future Electronics
Ms. Margaret Smith-Kline Ph.D.
101 Avenue of the Americas New York NY 10013–1933 maggie.kline@future_electronics.com May 23, 2003
Company name:
Maggie Smith
001–12–4367
Future Electronics Co. LLC
Purchase date:
101 6th Ave.
5/23/2003
Manhattan, NY 10012
Address:
maggie.kline@future_electronics.com
101 Avenue of the Americas
001–12–4367
New York, NY 10013–1933
Future Electronics Co. LLC SSN:
Latitude: 40.722970 Longitude: Ms. Peg Kline
–74.005035
Future Elect. Co.
Fed code:
101 6th Ave.
36061
New York NY 10013
Phone:
001–12–4367
(222) 922–9922
(222) 922–9922
Email:
5/23/03
maggie.kline@future_electronics.com
Using other Transforms to Support Matching There are several transforms that you should use before or after the Match transform to optimize the matching process.
2011
© 2011 SAP AG. All rights reserved.
133
Unit 5: Matching and Consolidating Data
BODS30
Figure 38: Data Flow with Matching
Transform
Description
Case
Use the Case transform to route data to a particular Match transform, such as sending USA-specific and international data to different transforms. You can also use this transform to route blank records around a Match transform.
Address Cleanse
Data Cleanse
Use the Address Cleanse transform to ensure all address data is accurate prior to matching.
Use the Data Cleanse transform to ensure name, firm, title, phone, date, SSN, and e-mail data is parsed and standardized to improve the likelihood of identifying matches. You also may want to include match standards of name or firm data. Utilizing name match standards generated from the Data Cleanse transform can help to improve match results.
Query
Query
134
Use the Query transform to create additional fields, perform functions, and order data. You may also want to filter out empty records to assist in performance and protect against false matches. Performing search and replace functions can also assist in reducing noise words from matching, such as Inc., Corp., and Ltd. Use the Merge transform to combine data output from multiple Match transforms. This is commonly used to reunite data sets that were split before matching for purposes of routing empty records. It is also used to reunite records from multiple match transforms so the resulting match groups can be associated.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Understanding the Match Wizard The easiest way to create a Match transform is to use the Match Wizard. The Match Wizard provides intuitive screens that walk you through the process, based on the strategy you choose.
Figure 39: Develop a Matching Strategy
You can quickly set up match strategies without manually configuring each individual setting required to complete the task. The Match Wizard: • • • • •
Builds the necessary transforms to perform the match strategy you choose. Applies default values to your match criteria based on the strategy you choose. Places the resulting transform on the workspace, connected to the upstream transform you choose. Detects the appropriate upstream fields and maps to them automatically. Automatically generates report statistics and common output fields.
The Match Wizard provides you with a basic match setup that, in some cases, may require customization to meet the goals of your project. The Match Wizard does not: •
• •
•
Alter any data that flows through it. To correct non-standard entries or missing data, place the Address Cleanse and Data Cleanse transforms upstream from the match process. Connect the generated match transform to a target. Directly allow you to set rule-based or weighted-scoring values for matching. The Match Wizard incorporates a “best practices” standard. You may want to edit option values to conform to your business rules. This can be done by editing the options in the Match transform that is generated by the Match Wizard. Directly create prioritization, best record, or candidate selection. These options are set by editing the options in the generated Match transform.
Setting up Matching To set up your Match transform using the Match Wizard, you must define specific elements:
2011
© 2011 SAP AG. All rights reserved.
135
Unit 5: Matching and Consolidating Data
BODS30
1. Select the appropriate match strategy based on the types of information you want to compare. 2. Specify how many match sets to define the number of ways that you would like to identify a match. 3. Define your match levels, if appropriate, to indicate the categories of matching involved. 4. Define your match criteria by selecting the individual columns in your data that will be used for matching, based on your business rules. 5. Define the break keys that will be used to group records that are likely to match by selecting the upstream field and setting the start position and length.
Figure 40: Setting Up Matching
Deciding on a Match Strategy The Match Wizard allows you to complete a Match transform using default strategies with a few clicks. These strategies contain best practice settings for a handful of potential matching situations. For example, do you want to perform a simple match on name and address, or would you prefer to use a householding method to determine residence, family, and individual match levels? Once you have defined your requirements, select the matching strategy that best meets those business rules. The Match Wizard supports these common matching strategies:
136
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Match strategy
Description
Simple match
Use this strategy when your matching business rules consist of a single match criterion for identifying relationships in consumer, business, or product data
Consumer householding
Use this strategy when your matching business rules consist of multiple levels of consumer relationships, such as residential matches, family matches, and individual matches.
Corporate householding
Use this strategy when your matching business rules consist of multiple levels of corporate relationships, such as corporate matches, subsidiary matches, and contact matches .
Multi-national consumer
Use this match strategy when your data consists of multiple countries and your matching business rules are different for different countries.
Identify a person multiple ways
Use this strategy when your matching business rules consist of multiple match criteria for identifying relationships, and you want to find the overlap between all of those definitions.
Defining Match Components The basic components of matching are: • • • •
Match sets Match levels Match criteria Break groups
Match sets Match sets let you control how the Match transform matches certain records, segregates records, and matches on records independently. For example, you could choose to match USA records differently than records containing international data.
2011
© 2011 SAP AG. All rights reserved.
137
Unit 5: Matching and Consolidating Data
BODS30
Each match set can have its own break groups, match levels, match criteria, prioritization, and consolidation. A match set: • • •
Allows only select data into a given set of match criteria for possible comparison (for example, exclude blank SSNs, international addresses, and so on.) Permits related match scenarios to be stacked to create a multi-level match set. Prepares data so that multiple match sets can be later associated using the Associate transform.
Match levels A match set contains one or more match levels. A match level is an indicator of what type of matching will occur, such as on individual, family, resident, firm, and so on. A match level refers not to specific criteria, but to the broad category of matching. You can have as many match levels as you want. However, the Match wizard restricts you to three levels during setup (more can be added later). Each match level in a match set can be defined in a way that is increasingly more restrictive. Multi-level matching feeds only the records that match from match level to match level (for example, resident, family, individual) for comparison. A single match level is defined automatically for the Simple Match strategy. The example of consumer householding shown below depicts residence, family, and individual levels.
Figure 41: Hierarchical Levels of Matching
138
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Match level
Description
Resident
Determines whether two records should be considered members of the same residence. The Match transform compares the address data. A match means that the two records represent members of the same household. Contrast this match type with the family match type, which also compares last name data.
Family
Determines whether two people should be considered members of the same family. The Match transform compares the last name and the address data. A match means that the two records represent members of the same family.
Individual
Determines whether two records are for the same person. The Match transform compares the first name, last name, and address data. A match means that the two records represent the same person.
Firm
Determines whether two records reflect the same firm. This match type involves comparisons of firm and address data. A match means that the two records represent the same firm.
Firm-Individual
Determines whether two records are for the same person at the same firm. This match type compares the first name, last name, firm name, and address data. A match means that the two records reflect the same person at the same firm.
Alternate forms of match level criteria can also be defined. For example: • • • •
Family-level match criteria based on family (last) name and telephone number. Individual-level match criteria based on: Social Security Number (SSN) or identification number and full name or full name and account number. Firm-level match criteria based on firm name. Industrial Classification (SIC) Code or firm name and Data Universal Numbering System (DUNS) number.
Match criteria Match criteria refers to the fields you want to match on. You can use criteria options to specify business rules for matching on each of these fields. Match criteria allow you to control how exact data must be for the data to be considered a match.
2011
© 2011 SAP AG. All rights reserved.
139
Unit 5: Matching and Consolidating Data
BODS30
For example, you may require first names to be at least 85% similar, but also allow a first name initial to match a spelled–out first name, and allow a first name to match a middle name. Break groups
Figure 42: Break Group Benefits and Considerations
Controlling the number of record comparisons in the matching process is important for performance. Break groups limit the number of comparisons performed during the matching process, because matching is only considered within break groups, not between them. Break groups are established by defining criteria called a break key. Defining an appropriate break key can save valuable processing time by preventing widely divergent data from being compared. Break keys should group records that would most likely contain matches. Fields commonly used for creating break groups are postcodes, account or identification numbers, or the first two positions of a street name. For example, when matching on address data, it is common to use the first three digits of a postcode as the break key. Thus, only records that have the same first three digits of a postcode become members of a break group. In the diagram below, with a break key of the first three digits of the postal code, records in A1 would be compared to records in A2, but never to records in B1 or B2.
140
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Figure 43: Setting Up Break Keys
Once you see how break keys control the number of records that are matched, it is easier to follow the match process.
Figure 44: Break Keys and Matching
2011
© 2011 SAP AG. All rights reserved.
141
Unit 5: Matching and Consolidating Data
BODS30
To set up matching using the Match Wizard 1. 2. 3. 4. 5. 6.
142
In your data flow, right-click the transform you would like to the Match transform after, and select Run Match Wizard. Select the required match selection type. Set the Match Set according to your business rules. Set the criteria for Match Set Name according to your business rules. Define the break keys by selecting the upstream field, setting the start position, and length. Connect the Match transform to the target table.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Exercise 4: Use the Match Wizard Exercise Objectives After completing this exercise, you will be able to: • Complete a simple match for full names on the Beta Businesses USA_Customers table
Business Example You want to determine if there are duplicate records in the Beta Businesses USA_Customers table. You decide to match according to name and address.
Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new Datastore for the DQSTANDALONE database.
3.
In your Local Object Library, create a new Datastore for the DELTA staging database.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
143
Unit 5: Matching and Consolidating Data
BODS30
Task 2: You want to determine if there are duplicate records in the Beta Businesses USA_Customers table. You decide to match according to name and address by setting up a batch job Beta_Customers_Match_Job containing the data flow Beta_Customers_Match_DF by using the Match Wizard. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
In the Omega project, create a new batch job Beta_Customers_Match_Job with a new data flow called Beta_Customers_Match_DF.
Task 3: In the data flow Beta_Customers_Match_DF, configure the Match Wizard in the Query transform. 1.
In the DataCleansed2_USA table in the DQSTANDALONE datastore, set the content type for certain fields to Name.
2.
In the workspace for Beta_Customers_Match_DF, add the DataCleansed2_USA table from the DQSTANDALONE datastore as the source object connected to the Query transform with a target template table Match_Target_USA in the DELTA datastore.
Task 4: Use the Match Wizard in the Query transform to set up a Simple match. 1.
Transfer all input columns from the source table to the output columns in the Query transform.
2.
Configure the Match Wizard in the Query transform.
Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target. 1.
Configure a Match_NameAddr transform to the data flow. Continued on next page
144
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
2.
Configure a new target table for the data flow.
Task 6: Execute the batch job Beta_Customers_Match_Job with the default parameters. 1.
2011
Execute the batch job Beta_Customers_Match_Job with the default parameters.
© 2011 SAP AG. All rights reserved.
145
Unit 5: Matching and Consolidating Data
BODS30
Solution 4: Use the Match Wizard Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
a)
From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.
b)
In the dialog box, enter your assigned User ID.
c)
Enter your password, which is the same as your User ID.
d)
Choose the Log on button.
e)
In the list of repositories, select your repository DSREPO## where ## is the number portion of your User ID.
f)
Choose OK.
If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
Continued on next page
146
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DQSTANDALONE
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DQSTANDALONE
User name
sourceuser (lower case)
Password
sourcepass (lower case)
Choose OK to finish. c)
In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.
d)
Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •
AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
147
Unit 5: Matching and Consolidating Data
3.
BODS30
In your Local Object Library, create a new Datastore for the DELTA staging database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DELTA
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DELTA## (where ## is the number assigned by your Instructor)
User name
student## or consult your instructor
Password
student## or consult your instructor
Click OK to finish. c)
You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.
Continued on next page
148
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Task 2: You want to determine if there are duplicate records in the Beta Businesses USA_Customers table. You decide to match according to name and address by setting up a batch job Beta_Customers_Match_Job containing the data flow Beta_Customers_Match_DF by using the Match Wizard. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
a)
From the Project menu, choose New → Project.
b)
When the Project New dialog box appears, enter Omega in the Project name field.
c)
Choose Create so that the new project appears in the Project area.
In the Omega project, create a new batch job Beta_Customers_Match_Job with a new data flow called Beta_Customers_Match_DF. a)
In the Project area, right-click the project name and choose New Batch Job from the menu.
b)
Enter the name of the job as Beta_Customers_Match_Job.
c)
Press Enter to commit the change.
d)
Select the Data Flow icon in the Tool Palette.
e)
Right-click in the data flow workspace and choose Add New → Dataflow.
f)
Enter Beta_Customers_Match_DF as the name.
g)
Press Enter to commit the change.
h)
Double-click the Beta_Customers_Match_DF to open the workspace.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
149
Unit 5: Matching and Consolidating Data
BODS30
Task 3: In the data flow Beta_Customers_Match_DF, configure the Match Wizard in the Query transform. 1.
2.
In the DataCleansed2_USA table in the DQSTANDALONE datastore, set the content type for certain fields to Name. a)
In the Local Object Library, select the Datastores tab and open the node for the DQSTANDALONE datastore.
b)
Open the node Tables and right-click on the table DataCleansed_USA to choose the option Edit.
c)
Right-click the field Cust_Name to select the option Properties and choose the type Name from the drop-down list for the Content Type associated with the field.
d)
Choose OK to close the dialog box.
e)
Close the table metadata by selecting the Back button.
In the workspace for Beta_Customers_Match_DF, add the DataCleansed2_USA table from the DQSTANDALONE datastore as the source object connected to the Query transform with a target template table Match_Target_USA in the DELTA datastore. a)
In the Local Object Library, select the Datastores tab and then select the DataCleansed2_USA table from the DQStandalone datastore.
b)
Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.
c)
In the Tool Palette, select the icon for the Query transform and select in the dataflow workspace to add the Query transform.
d)
Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.
Continued on next page
150
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Task 4: Use the Match Wizard in the Query transform to set up a Simple match. 1.
Transfer all input columns from the source table to the output columns in the Query transform. a)
Double-click the Query transform to open the transform editor.
b)
In the transform editor, map all columns from the input schema to the output schema columns. Select all the fields listed in the Schema In window and then drag them to the Query object in the Schema Out window.
c) 2.
Select the Back icon to close the Query transform editor.
Configure the Match Wizard in the Query transform. a)
In the data flow Beta_Customers_Match_DF work space, right-click on the Query transform to select the option Run Match Wizard.
b)
On the Match Selection Type window, select the option for a Simple match selection with only one way to identify an individual. Then choose Next.
c)
Set the name for Match Set Name to NameAddr.
d)
Specify that the first criteria is Full Name by selecting the field Full Name from the drop-down list for the field Match On. Use the value Word similarity for the field Compare Using. Choose Enter.
e)
Enter the second criteria as Address and use the value Word similarity for the field Compare Using. Select Enter. Then choose the Next button.
f)
In the Select Criteria Fields window, confirm that the fields are mapped appropriately to the criteria as seen below. Criteria
Field
PERSON1_FAMILY_NAME1
DC_FAMILY_NAME
PERSON1_GIVEN_NAME1
DC_GIVEN_NAME1
PERSON1_GIVEN_NAME1 MATCH_STD1
DC_GIVEN_NAME1 MATCH_STD1
PERSON1_GIVEN_NAME1 MATCH_STD2
DC_GIVEN_NAME1 MATCH_STD2
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
151
Unit 5: Matching and Consolidating Data
BODS30
PERSON1_GIVEN_NAME1 MATCH_STD3
DC_GIVEN_NAME1 MATCH_STD3
PERSON1_GIVEN_NAME2
DC_GIVEN_NAME2
PERSON1_GIVEN_NAME2 MATCH_STD1
DC_GIVEN_NAME2 MATCH_STD1
ADDRESS_PRIMARY NAME
AC_PRIM_NAME
ADDRESS_PRIMARY NUMBER AC_PRIM_NUM ADDRESS_SECONDARY NUMBER
AC_UNIT_NUM
Notice that matching is performed on the cleansed and standardized columns, which are the fields that begin with either DC_ or AC_. If one of these fields is not mapped, then map it selecting the Field and select the field from the drop-down list. Then select the Next button. g)
On the next window, define the break key by selecting AC_Postcode_Full as the Upstream field from the drop-down list. Hint: You may need to click on the “field” More... Set the start position to 1 and length to 5.
h)
Choose Finish and close the Match Wizard configuration to return to the data flow work space.
Continued on next page
152
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target. 1.
Configure a Match_NameAddr transform to the data flow. a)
When you closed the Match Wizard, Data Services automatically inserted the Match_NameAddr transform to the data flow workspace.
b)
Double-click the Match_NameAddr transform to open the transform editor.
c)
Rename the new generated match columns as follows:
d)
Original Column Name
New Column Name
NameAddr_Level1_GROUP NUMBER
MC_GROUP NUMBER
NameAddr_Level1_MATCH SCORE
MC_MATCH SCORE
NameAddr_Level1 GroupStats1 GROUP_COUNT
MC_GROUP COUNT
NameAddr_Level1 GroupStats1 GROUP_RANK
MC_GROUP RANK
Verify that all columns in the output schema are mapped correctly. Remap any columns that are not mapped. Hint: If the column does not have a blue arrow icon, the mapping needs to be reset.
2.
Configure a new target table for the data flow. a)
From the Tool Palette, select the icon for a template table and select in the data flow work space to insert the new target template table. Enter Match_Target for the name and select the DELTA datastore from the drop-down list. Choose OK to create the template table.
b)
Connect the Match_NameAddr transform to the target template table by selecting the Match_Name_Addr transform and while holding down the mouse button, drag to the target template table and release the mouse button.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
153
Unit 5: Matching and Consolidating Data
BODS30
Task 6: Execute the batch job Beta_Customers_Match_Job with the default parameters. 1.
Execute the batch job Beta_Customers_Match_Job with the default parameters. a)
In the Project area, right-click the batch job Beta_Customers_Match_Job to select the option Execute. Choose OK to save all objects you have created.
b)
Choose OK to accept the default execution parameters.
c)
In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.
d)
In the target table, select the column heading for the MC_GROUP_NUMBER column to sort the data. Notice that there are 9 match groups. Matched records for Gloria Peterson appear to be valid matches because they share the same name, address and SSN values. However, matched records for A G Olson and Andrew Olson have different social security numbers and are therefore invalid matches. It is apparent that the match criteria needs adjustment to include matching on SSN.
154
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using the Match Wizard
Lesson Summary You should now be able to: • Create a match strategy • Define match components • Set up matching using the Match Wizard • Complete a simple match for full names on the Beta Businesses USA_Customers table
2011
© 2011 SAP AG. All rights reserved.
155
Unit 5: Matching and Consolidating Data
BODS30
Lesson: Configuring the Match Transform Lesson Overview Once the basic configuration of your Match transform has been completed using the Match Wizard, you can further refine the process using the Match Editor.
Lesson Objectives After completing this lesson, you will be able to: • • • • • •
Control record comparisons Use match levels Define match criteria Use blank field options Use match criteria options Modify match criteria using the Match Editor
Business Example While matching can take place on the parsed components of data records, there are many variable forms, spellings, abbreviations, and blank values that can get in the way of determining if two records are really a match. You want to learn how to configure the Match transform to more accurately determine if records match by setting certain options in configuration for match criteria. You need to understand how to control the number of record comparisons. You also want to know how to determine the best possible matching method, rule-based, weighted-scoring or combination matching.
Understanding the Match Editor Although the Match Wizard does a lot of the work for matching, there are some things that you must configure yourself. The basic match setup created by the Match Wizard may require customization to meet the goals of your project. You can edit the options in the transform that is generated by the Match Wizard using the Match Editor. Match transforms can be modified or manually configured using the Match Editor. This enables you to set up and modify match strategies in a single screen.
156
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Figure 45: Data Services Match Capabilities
In the Match Editor, you can make adjustments to all aspects of the matching process, including transform-level options, group forming operations, match criteria, and post-match processing.
Figure 46: Match Functionality Highlights
2011
© 2011 SAP AG. All rights reserved.
157
Unit 5: Matching and Consolidating Data
BODS30
The Match Editor: • •
Enables you to set rule-based or weighted-scoring values for matching. Provides access to prioritization, best record, and candidate selection operations.
Explaining the Match Transform The Match transform performs matching based on the business rules you define. The transform sends matching and unique records on to the next transform in the data flow. You can edit the transform-level options for the Match transform to specify the reporting settings and how the process is run.
Controlling Record Comparisons When configuring the Match transform, you can use the group forming operations to control the number of match comparisons for a more efficient comparison process. Group forming options include the following: Break groups Use fields to group together records that have a better chance of matching to reduce the number of comparisons the Match transform needs to make. Records are compared within break groups. Candidate selection Use SQL to extract a subset of records from the database that are needed for comparison. This is primarily used for real-time jobs to speed processing in a real-time match job. Group prioritization Use priorities and penalties to order records in data collections before you compare them. This helps to ensure that your best records are being used to drive the comparison process.
158
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
To create break keys 1. 2. 3. 4.
In the Match Editor, add a Break Groups operation to the Groups Forming option group. In the Break key table, add a row. Select a field in the field column that you want to use as a break key. Select the Start position and length (number of characters) you want used in the break key. You can use negative numbers to signify that you want to start and the end of a given length of characters.
5. 6.
Add more rows and fields as necessary. Order your rows by selecting a row and clicking the Move Up and Move Down buttons. Ordering your rows ensures that the fields are used in the right order in the break key.
7.
Choose OK.
Understanding Matching Methods Matching methods use a similarity score to define how well two records match. A similarity score is a percentage that represents how closely two records resemble each other, based on your business rules. This score is calculated internally by the application when records are compared. Whether the application considers the records a match depends on the match and no match scores you define in your match criteria. If the similarity score is lower than or equal to the no match score, the two records definitely do not match. If the score is higher than the no match level, but not higher than the match score, the result of the comparison is inconclusive. If the score is higher than or equal to the match score, the two records are considered a match. The following is a simple example of how similarity scores are determined. The last names are compared for length and letter-by-letter comparison in each position to determine the similarity score. Here, only the first comparison is considered a match because it is the only comparison where the similarity score meets or exceeds the no match score.
2011
© 2011 SAP AG. All rights reserved.
159
Unit 5: Matching and Consolidating Data
BODS30
Figure 47: Simple Matching
There are three matching methods to consider when setting up your match criteria in the Match transform: • • •
Rule-based matching Weighted-scoring matching Combination matching
Rule-Based Matching With rule-based matching, you rely solely on your match and no match scores to determine matches within a criterion. The following example shows how to set up this method in the Match transform:
Figure 48: Rule-based Matching
By entering a value of 101 in the match score for every criterion except the last, the First Name and Last Name criteria never determine a match, because two fields cannot be more than 100 percent alike. By setting the match and no match score for the E-mail criteria with a one point difference, any comparison that reaches the last criterion must either be a match or a no match.
160
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Weighted-scoring matching In a weighted-scoring matching, you can assign different weights to individual criterion by specifying a contribution value. The higher the value, the more weight that criterion carries in determining matches. Fields that are more likely to determine a match should be assigned more weight. For example, an SSN or account number may be assigned a higher weighted value than an E-mail address. The total of all contribution values must total 100. The Match transform generates the contribution score for each criterion by multiplying the contribution value with the similarity score. These individual contribution scores are then added to get the total contribution score. Matches are determined by comparing the total contribution score with the weighted match score. If the total contribution score is equal to or greater than the weighted match score, the records are considered a match. If the total weighted score is less than the weighted match score, the records are considered a no match. You can set the weighted match score in the Weighted Match Score option of the Match Level Editor. When you set up weighted scoring, the no match score must be set to –1, and the match score must be set to 101. These values ensure that neither a match or a no match is determined for the specified criterion. In this example, the contribution value for the E-mail criterion gives it the most importance.
Figure 49: Weighted-scoring Matching
Note: The contribution score = actual similarity X contribution value. Combination matching In combination matching, rule-based and weighted-scoring methods are combined.
2011
© 2011 SAP AG. All rights reserved.
161
Unit 5: Matching and Consolidating Data
BODS30
A no match can be determined by the similarity score of any criterion equal to or lower than the no match score. However, the only way a match determination can be made is when a no match has not been determined and the total contribution score is greater than or equal to the weighted match score. A match can never be based on any individual elements, only the weighted score of all of them, provided that a no match determination has not been made.
Figure 50: Combination Matching
The contribution score = actual similarity X contribution value.
Using Match Levels Match levels are used to increase granularity of criteria. Only matching records from the previous level area passed to a subsequent level. There are several person options to configure for each level. These person options dictate how name data is treated for all columns that relate to an individual. The following person options are available: Number of names that must match With this option, you can control how matching is performed on match keys with more than one name (for example, comparing “John and Mary Smith” to “Dave and Mary Smith”). Choose whether only one or both persons must match for the records to match. Match on hyphenated family name
162
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
With this option, you can control how matching is performed if a family name (last name) field contains a hyphenated family name (for example, comparing “Smith-Jones” to “Jones”). Choose whether both keys must have one or both names to match. Compare Given_Name1 to Given_Name2 With this option, you can compare a first name to a middle name. This option is useful for identifying matches of individuals that use their middle name, not their first name. Record #
First name
Middle name
Last name
Address
170
Leo
Thomas
Smith
225 Pushbutton Drive
198
Tom
Smith
225 Pushbutton Drive
Ignore family name when female With this option, you can control whether the family name is used for matching purposes when the gender for the record is female. This technique assists in dealing with situations where the individual's married name differs from her maiden name.
Using Compare Tables Compare tables are sets of rules that define which records to compare, acting as an additional way to create break groups. You use your logical source values to determine which records are compared or are not compared. By using compare tables, you can compare records within sources, or you can compare records across sources, or a combination of both. Compare tables are configured when defining match levels.
Defining Match Criteria Match criteria contains the set of rules used to determine matches. Each rule is based on match key, which is the portion of a field used for record comparison. For example, match criteria may include various standardized firm, name, address, phone, E-mail, and SSN fields as match keys. A standardized firm may be contained within a single field, whereas a given name may incorporate several match standards. These fields must be mapped as input to the Match transform and are logically grouped and listed in the Match Editor for ease of configuration.
2011
© 2011 SAP AG. All rights reserved.
163
Unit 5: Matching and Consolidating Data
BODS30
Match keys must be selected for each match level in a match set. They are processed sequentially for rule-based and combination matching, so it is important that you organize your match criteria keys in the order in which you want them to be evaluated. Organizing your criteria randomly may produce unacceptable results. For each match criterion, you must configure the match, no match, and weighted score options that are appropriate for the match method you use. Other criteria options for each match key enable you to apply business rules for determining how close to exact the data needs to be for records to be considered a match and how to deal with special data. For example, you may require first names to be at least 85% similar, but also allow a first name initial to match a spelled-out first name, and allow a first name to match a middle name. There are two types of criteria: Predefined criteria are available for fields typically used in matching, such as name, address, firm, phone, identification number, e-mail, and title. Predefined criteria can be modified. Custom criteria must be configured for other fields such as part numbers and proprietary data for which predefined criteria do not exist. Match criteria can be adjusted using individual options: Option
Description
Criteria name
Type a descriptive name for this criterion. Specifies the minimum similarity score needed for record to be considered a match based on this criterion. Type a value from 0 to 101.
Match score
A value of 101 ensures that this criterion alone is not enough to consider two records a match and that you want to consider other criteria in the comparison process. For example, a value of 90 means that you consider this data to be important enough that of the data in two records is 90% similar or higher, the records are considered a match.
164
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Specifies the maximum similarity score needed for records to be considered a no-match based on this criterion. Type a value from -1 to 100. A value of -1 ensures that this criterion is not enough to consider two records a no-match and that you want to consider other criteria in the comparison process.
No match score
For example, a value of 49 means that if the similarity between the data in two records is less than 50%, the records do not match. Specifies the contribution value when you use weighted or combination scoring methods. If no single criterion decides a match or no-match, the contribution score is calculated by multiplying the similarity score by the contribution value.
Contribution to weighted score
Enter a value between 0 and 100. Specifies whether to use this criterion if one record's field is populated and the other record's field is blank. Eval: The value entered in the One field blank score option is used as the similarity score for this criterion. Ignore: This criterion is ignored in the comparison process and its contribution to weighted score is proportionally distributed among the remaining criteria. Thus, it negates any impact this contribution score may have had.
One field blank operation
Be sure to enter a value in the One Blank score option if you set this option to Eval. Specifies the similarity score to use if one of the fields is blank and the One field blank operation is set to Eval. One field blank score
Type a value from 0 to 100. Be sure to enter a value here if you set the One field blank operation option to Eval.
2011
© 2011 SAP AG. All rights reserved.
165
Unit 5: Matching and Consolidating Data
BODS30
Specifies whether to use this criterion when both record's fields for this match key criterion are blank. Both fields blank operation
Both fields blank score
Eval: The value entered in the Both fields blank score option is used as the similarity score for this criterion. Ignore: This criterion is ignored in the comparison process and its contribution to weighted score is proportionally distributed among the remaining criteria. Thus, it negates any impact this contribution score may have had. Specifies the similarity score if both fields are blank when the Both fields blank operation is set to Eval. Enter a value form 1 to 100 depending on the matching method you use. Specifies how to handle fields where more than one word commonly exists. Field: If you choose Field, the transform compares the entire field's data as a single word. This algorithm is more efficient and should be used in fields that typically have just one word, like a first name field.
Compare algorithm
Word: If you choose Word, the transform first parses the data into words and then compares the words. This algorithm is less efficient than the Field algorithm, but does a better job comparing data that typically has more than one word in it, such as firm data. Many criteria options require this option to be set to Word.
166
Initial adjustment score
Use the Initials_Adjustment_ Score option to allow matching initials to whole words. For example, “International Health Providers” can be matched to “IHP”.
Substring adjustment score
Use the Substring_Adjustment_Score to allow matching longer strings to shorter strings. For example, the string “Mayfield Painting and Sand Blasting” can match “Mayfield painting”.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Use the Abbreviation_ Adjustment_Score option to allow matching whole words to abbreviations. For example, “International Health Providers” can be matched to “Intl Health Providers”.
Abbreviation adjustment score
This option controls matching whole words to abbreviations in which long words are abbreviated by removing letters. In this case, the first letter of the abbreviation matches the first letter of the longer word, and all remaining letters of the abbreviation appear in the longer word in the same order as the abbreviation. Enter a value from 0 to 100. • • •
A value of 0 (zero) disables this option. A value greater than 0 enables this option. A value of 100 means you want to consider an abbreviation of a longer word to be considered as a perfect match.
Note: For this option to work, you must set the Compare algorithm option to Word. Extended abbreviation matching offers functionality that handles situations not covered by the Initials_Adjustment_Score, Substring_Adjustment_Score, and Abbreviation_ Adjustment_Score options. For example, you might encounter the following situations: • Ext Abbreviation adjustment score
•
2011
Suppose you have localities in your data such as La Crosse and New York. However, you also have these same localities listed as LaCrosse and NewYork (without spaces). Under normal matching, you cannot designate these (La Crosse/LaCrosse and New York/NewYork) as matching 100%; the spaces prevent this. (These would be 94 and 93 percent matching.) Suppose you have Metropolitan Life and MetLife (an abbreviation and combination of Metropolitan Life) in your data. The Abbreviation_Adjustment_Score option cannot detect the combination of the two words.
© 2011 SAP AG. All rights reserved.
167
Unit 5: Matching and Consolidating Data
BODS30
If you are concerned about either of these cases in your data, you should use the Ext_Abbreviation_Adjustment_Score option. The score you set in the Ext_Abbreviation_Adjustment_Score option tunes your similarity score to consider these types of abbreviations and combinations in your data. The adjustment score adds a penalty for the non-matched part of the words. The higher the number, the greater the penalty. The score you set in the Ext Abbreviation Adjustment Score option tunes your similarity score to consider these types of abbreviations and combinations in your data. The adjustment score adds a penalty for the non-matched part of the words. The higher the number, the greater the penalty. A score of 100 means no penalty and score of 0 means maximum penalty. Here are some examples: • • • • • • • • •
Sim score when Adj score is 100 Sim score when Adj score is 50 Sim score when String 1 String 2 Adj score is 0 MetLife Metropolitan Life 58 79 100 MetLife Met Life 93 96 100 Sim score when Adj score is 100 Sim score when Adj score is 50 Sim score when String 1 String 2 Adj score is 0 MetLife MetropolitanLife 60 60 60
Note: In the third string comparison, Extended Abbreviation scoring was not needed or used because both strings compared are each one word. The similarity score value you enter here determines whether to use weighted scoring when a score is greater Use in weighted score if than to the value you specified for this greater than Use in weighted score if greater than option.
168
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
The similarity score value you enter here determines Use in weighted score if whether to use weighted scoring when a score is less than or equal to the value you specified for this Use in less than or equal weighted score if less than or equal option. Specifies whether the match score should be adjusted for any transposed characters encountered. Yes: The transform deducts half as many points for transposed characters as it deducts for other Check for transposed letters
non–matching characters. For example, Comparison: Smith to Simth Results: characters 2 and 3 are transposed Percentage alike: 90% No: The transform handles transposed chiropractors the same way it handles any non–matching characters. For example, Comparison: Smith to Simth Results: characters 2 and 3 are transposed Percentage alike: 80% Specifies how to match data that contains both numbers and letters. This option applies to address data and custom data, such as a part number. The following steps are required to match numerically: 1.
Numeric words match exactly 2.
The string is first broken into words. The word breaking is performed on all punctuation and spacing, and then the words are assigned a numeric attribute. A numeric word is any word that contains at least one number from 0 to 9. For example, 4L is considered a numeric word, where as FourL is not. Numeric matching is performed according to the option setting that you choose (as described below). • •
2011
None: Numeric words do not need to match exactly to be considered a match. Any_Position: Numeric words do not need to be in the same position in two different strings to be considered a match. For example:
© 2011 SAP AG. All rights reserved.
169
Unit 5: Matching and Consolidating Data
BODS30
•
–
Street comparison: “4932 Main St # 101 ” and “ # 101 4932 Main St” are considered a match.
–
Part description: “ACCU 1.4L 29BAR” and “ACCU 29BAR 1.4L” are considered a match.
Any_Position_Consider _Punctuation: This value behaves the same the Any_Position value; however, the Match transform takes the position of the decimal separator (comma or period) within the numeric words into consideration. So, decimal numbers are not broken. For example, the string 123.456 is considered a single numeric word as opposed to two numeric words. For example: –
•
Part description: “ACCU 29BAR 1.4L” and “ACCU 1.4L 29BAR” are considered a match. – Part description: “ACCU 1.4L 29BAR” and “ACCU 29BAR 1,4L” are not considered a match, because there is a decimal indicator between the 1 and the 4 in both cases. – Financial data: “25,435” and “25.435” are not considered a match. Any_Position_Ignore _Punctuation: This value is similar to the Any_Position_Consider_Punctuation option except that decimal separators do not impact the matching process. For example: –
•
170
Part description: “ACCU 1,4L 29BAR” and “ACCU 29BAR 1.4L” are considered a match even though there is a decimal indicator between the 1 and the 4. Same position: Numeric words must match exactly and be in the same position in the string to be considered a match. For example:
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
–
Field compare length Remove punctuation
608–782–5000 will match 608–782–5000, but it will not match 782– 608–5000. Note: For this option to work, you must set the Compare algorithm option to Word.
Specifies the number of characters in the field to compare. Specifies whether to remove punctuation from your data to help provide more accurate matches. Be aware of the following: • •
This option is valid for English language, Latin1 data only. This option is ignored for all other Match engine values.
Yes: Removes punctuation. No: Keeps punctuation in your data. Note: This option works on the mapped input field used in this and other criteria. If you set this option for something different than how it is set in another criterion using the same field, it overwrites that setting. Before setting this option, be sure that there are no other criteria using this same field. Convert to upper case
Specifies whether to convert all data to uppercase for matching purposes only. Be aware of the following: • •
This option is valid for English language, Latin1 only. This option is ignored for all other Match engine options.
Yes: Convert the data to uppercase where appropriate. No: Leaves the data intact. Note: This option works on the mapped input field used in this and other criteria. If you set this option to something different than how it is set in another criterion
2011
© 2011 SAP AG. All rights reserved.
171
Unit 5: Matching and Consolidating Data
BODS30
using the same field, it overwrites that setting. Before setting this option, be sure that there are no other criteria using this same field.
Convert diacritical characters
Specifies whether to include diacritical characters in the matching process. Be aware of the following: • •
This option is valid for all match engine options. This option works best when using the Latin1 engine. For example, if you are processing Japanese data you may have some Latin1 data mixed in it. In these cases, you will be able to convert diacritical characters.
Yes: Converts diacritical characters to the closest English ASCII equivalent for matching purposes. For example, ä converts to a. No: Preserves diacritical characters during the match process. For example, ä is not identical to a. Note: This option works on the mapped input field used in this and other criteria. If you set this option to something different than how it is set in another criterion using this same field, it overrides that setting. Before setting this option, be sure that there are no other criteria using this same field. Convert text to numbers
Specifies whether number represented as text (one, two, and three…) should be converted to numbers. If you choose Yes, they will be cardinal (one > 1) or ordinal (first > 1st) format. Be aware that this option is valid for all match engines except the Other_Non_Latin1 engine. Yes: Converts numbers represented as text to numbers. No: Leaves any numerical text intact. Note: This option works on the mapped input field used in this and other criteria. If you set this option to something different than how it is set in another criterion
172
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
using the same field, it overwrites that setting. Before setting this option, be sure that there are no other criteria using this same field.
To add a match criterion • • •
In the Option Explorer of the Match Editor, right-click the appropriate match level. Select Add Criteria. In the Match Criteria list, select the criterion that best represents the data that you want to match on. If you do not find what you are looking for, select Custom criteria. The Match criteria field mapping and Criteria options tables are activated.
• •
In the Match criteria field mapping table, select an input field that contains the data you want to match on for this criterion. In the Criteria options table, enter a name for your criteria in the Criteria name option. Note: You can keep the default name for predefined criteria, but you should enter a meaningful criteria name if you chose a Custom criterion.
•
•
2011
Configure the remaining criteria options. Be sure to set the Match score and No match score, because these are required options. Criteria option
Value
Match score
101
No match score
77
Compare Algorithm
Word
Use in weighted score if greater than
78
Weighted score is less than or equal
78
To order your criteria in the Options explorer of the Match Editor (or the Match Table), select a criterion and select the Move Up or Move Down buttons as necessary.
© 2011 SAP AG. All rights reserved.
173
Unit 5: Matching and Consolidating Data
BODS30
Using Blank Field Options In your business rules, you can control how the Match transform treats field comparisons when one or both of the fields compared are blank. If you choose to evaluate when one or both of the fields is blank, the Match transform scores the comparison using the score you specify. If you choose to ignore the blank fields, the score for this field rule does not contribute to the overall weighted score for the record comparison. The Match transform redistributes the contribution allotted for this field to the other criteria and recalculates. To help you decide what score to enter, determine if you want the Match transform to consider a blank field 0 percent, similar to a populated field or another blank field, 100 percent similar, or somewhere in between. Your answer depends on which field you are comparing. Giving a blank field a high score might be appropriate if you are matching on a first or middle name or a company name, for example. The following examples demonstrate how setting these blank matching options can affect the overall scoring of records. Example 1 The One Field Blank Operation for Given_Name1 field is set to Ignore:
Figure 51: Blank Field Options 1
Example 2 The One Field Blank Operation for Given_Name1 field is set to Eval, with the One Field Blank Score set to 0:
174
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Figure 52: Blank Field Options 2
Example 3 The One Field Blank Operation for Given_Name1 field is set to Eval, with the One Field Blank Score set to 100.
Figure 53: Blank Field Options 3
Using Match Criteria Options The majority of your data standardization should take place in the address cleansing and Data Cleanse transforms. However, there are a few preprocessing options specific to only the match process that can provide more accurate matching, which can be defined in the Match Editor. These options include removing punctuation, converting to upper case, converting diacritical characters, and converting text to numbers.
2011
© 2011 SAP AG. All rights reserved.
175
Unit 5: Matching and Consolidating Data
BODS30
Matching on Phonetic Data There are instances where using phonetic data can produce more matches when used as a criterion, than if you were to match on other criteria such as name or firm data. For example, the names Smith and Smythe are only 72% similar when you match based on the name field, but are a 100% match when you match phonetically. Name
Comparison score
Smith
72% similar
Smythe Name
Phonic key (primary)
Smith
SMO
Smythe
SMO
Comparison score 100% similar
To match on phonetic data, use the Double Metaphone or Soundex functions to populate a field and use it for creating break groups or use it as a criterion in matching. If you intend to match on phonetic keys, set up the criteria options as follows: Option
Setting
Compare algorithm
Field
Check for transposed characters
No
Initials adjustment score
0
Substring adjustment score
0
Abbreviation adjustment score
0
Set your match score options as follows: Option
Value
Match score
100
No match score
99
If you are matching on multiple criteria, including a phonetic criterion, place the phonetic criteria first in the order of criteria and set your match score options as follows:
176
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Option
Value
Match score
101
No match score
99
Remember that, when you break, records that have no value are not in the same group as records that have a value (unless you set up matching on blank fields). For example, consider the following two input records: Title
First name
Mr
Last name
Address
Postal code
Johnson
100 Main Street
54601
WI Scott
Johnson
100 Main Street
54601
La Crosse WI After these records are processed by the Data Cleanse transform, the first record has an empty first name field and, therefore, an empty phonetic key. is means there cannot be a match if you are creating break groups. If you are not creating break groups, there cannot be a match if you are not blank matching. The length that you assign to a phonetic function output is important. Consider the following example: First name (last name)
Key
S (Johnson)
S
S (Johnson)
SKT
Suppose these two records represent the same person. If you break on more than one character, these records are in different break groups, and therefore will not be compared.
2011
© 2011 SAP AG. All rights reserved.
177
Unit 5: Matching and Consolidating Data
BODS30
Matching Unicode Data Unicode matching lets you process any non-Latin1 Unicode data, with special processing for Chinese, Japanese, Korean, and Taiwanese (CJKT) data. For example, the Match transform will: • •
• •
• •
Consider half-width and full-width characters to be equal. Consider native script numerals and Arabic numerals to be equal. It can interpret numbers that are written in native script. This can be controlled with the Convert_Text_To_Numbers option. Includes variations for popular, personal, and firm name characters in the referential data. Consider firm words, such as Corporation or Limited, to be equal to their variations (Corp. or Ltd.) during the matching comparison process. To find the abbreviations, the transform uses native script variations of the English alphabets during firm name matching. Ignore commonly used optional markers for province, city, district, and so on, in address data comparison. Intelligently handles variations in a building marker.
With Japanese data, the Match transform will: • • • •
Block data markers, such as chome and banchi, to be equal to those used with hyphenated data. Words with or without Okurigana to be equal in address data. Variations of no marker, ga marker, and so on, to be equal. Variations of a hyphen or dashed line to be equal.
The Unicode match functionality does not: • •
Perform conversions of simplified and traditional Chinese data. Compare different scripts, such as Kana to Kanji, or Chinese to English.
The Match transform provides some data normalization options to prepare your data for matching. These options are located in the Field (Match Input) option group. Before sending Unicode data into the matching process, you must first separate out the data by country to separate match data flows. The Match Wizard can do this for you when you use the multi–national strategy.
178
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
To configure the Match transform for unicode matching 1. 2. 3.
Use a Case transform to route your data to a Match transform that handles that type of data. Open the AddressJapan_MatchBatch Match transform configuration, and save it with a different name. Set the Match engine option in the transform options option group to a value that reflects the type of data being processed. This option is set to Japanese in the match_starter_unicode sample.
4.
Set any preprocessing options in the Match Criteria Editor. For example: • •
•
When possible, use criteria for parsed components for address, firm, and name data, such as Primary_Name or Person1_Family_Name1. If you have parsed address, firm, or name data that does not have a corresponding criterion, use the Address_Data1-5, Firm_Data1-3, and Name_Data1-3 criteria. For all other data that does not have a corresponding criteria, use the Custom criteria.
Additional Match Features Typically, matching implies comparing two strings from an alphabetical perspective. In other words, we try to find out how similar two strings are if the strings are compared character by character and calculate a similarity score. But this is not a good approach if the real meaning of the characters is more important. The “semantic” of the data, the real meaning of the data, has to be considered also. With proximity matching, we are trying to interpret Geographical, Numeric and Date type data as such and calculate similarity based on how close they are. What is Proximity Matching? Ability to match records based on their proximity (instead of comparing string representation of data). Data Services provides several types: Geographical proximity Find duplicate records based on geographic proximity, using latitude and longitude information. Not driving distance, but Geographic distance (Haversine Distance algorithm). It uses WGS 84 (GPS) coordinates.
2011
© 2011 SAP AG. All rights reserved.
179
Unit 5: Matching and Consolidating Data
BODS30
Figure 54: Geographical Proximity Matching
Numeric proximity This method either finds duplicates based on numerical closeness of data based either on numbers or date. • •
180
Numeric proximity – Find duplicates based on numerical closeness of data Date proximity – Find duplicate based on date ranges.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Figure 55: Numerical and Date Proximity Matching
Comparing Compound Family Name This is an extension of the substring adjustment score option in the sense that it loosens some of the prerequisites. Not only can it find new matches, it can also boost the score for some existing matches. The Approx substring option works only on one field. If the family names are in separate fields, they need to be concatenated into one field. It may be useful to know how Adjustment scores are calculated may be useful, but not necessary for all to understand. Say there are two strings “International” and “Intl.”. Without any adjustment scores, (Number of identical characters * 100)/Total number of characters = (8*100)/17 = 47. Now if the adjustment score is 80, it is applied to the characters that are not identical. Score = (Number of identical characters * 100) + (Number of nonidentical characters * 80)/ Total number of characters = (8*100 + 9*80)/17 = 89.
2011
© 2011 SAP AG. All rights reserved.
181
Unit 5: Matching and Consolidating Data
BODS30
Figure 56: Comparing Compound Family Names – Approximate Substring Matching
The Match Wizard previously supported only name/address/e-mail/phone type fields when setting up a match strategy, but did not support non-party data like product data or financial data. Now there is the ability to set up custom fields in the match strategy in the wizard. Custom fields are only used with the “Simple Match” strategy; the rest of the strategies are for party (for example, customers, suppliers) data only.
182
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Figure 57: Customer Fields in Match Wizard
2011
© 2011 SAP AG. All rights reserved.
183
Unit 5: Matching and Consolidating Data
184
© 2011 SAP AG. All rights reserved.
BODS30
2011
BODS30
Lesson: Configuring the Match Transform
Exercise 5: Modify Match Criteria using the Match Editor Exercise Objectives After completing this exercise, you will be able to: • Refine match criteria so that invalid results are not returned for a match for Full Names on the USA_Customers table.
Business Example You have performed a simple match by Full Name on the Beta Businesses USA_Customers table. You have determined that matched records for at least one individual have different SSNs and are, therefore, invalid matches. You decide to further refine the match so that these invalid matches are eliminated.
Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new Datastore for the DQSTANDALONE database.
3.
In your Local Object Library, create a new Datastore for the DELTA staging database. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
185
Unit 5: Matching and Consolidating Data
BODS30
Task 2: Create a new batch job Beta_Customers_Match2_Job with a data flow Beta_Customers_Match2_DF. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
In the Omega project, create a new batch job Beta_Customers_Match2_Job with a new data flow called Beta_Customers_Match2_DF.
Task 3: In the data flow Beta_Customers_Match2_DF, configure the Match Wizard in the Query transform. 1.
In the Match_Target table in the DQSTANDALONE datastore, set the content type for certain fields to Name.
2.
In the workspace for Beta_Customers_Match2_DF, add the Match_Target table from the DQSTANDALONE datastore as the source object connected to the Query transform.
Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.
Configure the Match Wizard in the Query transform.
Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target. 1.
Configure a Match_NameAddr transform to the data flow.
2.
Add a new matching criterion based on the new input field DC_SSN.
Continued on next page
186
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.
Add a new criterion based on Social Security Number 1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN.
2.
Adjust the value of the No match score field to 79 for the criteria Address Primary Name Match.
3.
Adjust the values of the Address Secondary Number match criteria according to the following table:
4.
5.
Option
Value
No match score
84
One field blank score
85
Both field blank score
100
Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option
Value
Match score
101
One field blank score
0
Both field blank score
0
Initial adjustment score
90
Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Option
Value
Match score
80
No match score
79
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
187
Unit 5: Matching and Consolidating Data
6.
BODS30
One field blank score
90
Both field blank score
100
Initial adjustment score
100
Move the new criteria for the Social Security Number 1 to be the first match criteria.
Task 7: Execute the batch job Beta_Customers_Match2_Job with the default parameters. 1.
Configure a new target table for the data flow.
2.
Execute the batch job Beta_Customers_Match2_Job with the default parameters.
Task 8: Adjust the match criteria in the Match transform to ignore the values for Social Security Number 1 when both record field values are blank. Execute the batch job Beta_Customers_Match2_Job again with the default parameters.
188
1.
Adjust the match criteria in the Match transform for the data flow Beta_Customers_Match2_DF to ignore records when their social security number values are blank.
2.
Execute the batch job Beta_Customers_Match2_Job with the default parameters.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Solution 5: Modify Match Criteria using the Match Editor Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
a)
From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.
b)
In the dialog box, enter your assigned User ID.
c)
Enter your password, which is the same as your User ID.
d)
Choose the Log on button.
e)
In the list of repositories, select your repository DSREPO##, where ## is the number portion of your User ID.
f)
Choose OK.
If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
189
Unit 5: Matching and Consolidating Data
BODS30
In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DQSTANDALONE
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DQSTANDALONE
User name
sourceuser (lower case)
Password
sourcepass (lower case)
Click OK to finish. c)
In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.
d)
Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •
AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page
190
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
3.
In your Local Object Library, create a new Datastore for the DELTA staging database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DELTA
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DELTA## (where ## is the group number assigned by your Instructor)
User name
student## or consult your instructor
Password
student## or consult your instructor
Choose OK to finish. c)
You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.
Task 2: Create a new batch job Beta_Customers_Match2_Job with a data flow Beta_Customers_Match2_DF. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. a)
From the Project menu, choose New → Project. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
191
Unit 5: Matching and Consolidating Data
2.
BODS30
b)
When the Project New dialog box appears, enter Omega in the Project name field.
c)
Choose Create so that the new project appears in the Project area.
In the Omega project, create a new batch job Beta_Customers_Match2_Job with a new data flow called Beta_Customers_Match2_DF. a)
In the Project area, right-click the project name and choose New Batch Job from the menu.
b)
Enter the name of the job as Beta_Customers_Match2_Job.
c)
Press Enter to commit the change.
d)
Select the Data Flow icon in the Tool Palette.
e)
Select the workspace where you want to add the data flow.
f)
Enter Beta_Customers_Match2_DF as the name.
g)
Press Enter to commit the change.
h)
Double-click the Beta_Customers_Match2_DF to open the workspace.
Task 3: In the data flow Beta_Customers_Match2_DF, configure the Match Wizard in the Query transform. 1.
2.
In the Match_Target table in the DQSTANDALONE datastore, set the content type for certain fields to Name. a)
In the Local Object Library, select the Datastores tab and open the node for the DQSTANDALONE datastore.
b)
Open the node Tables and right-click on the table Match_Target to choose the option Edit.
c)
Right-click the field Cust_Name to select the option Properties and choose the type Name from the drop-down list for the Content Type associated with the field.
d)
Choose OK to close the dialog box.
e)
Close the table metadata by selecting the Back button.
In the workspace for Beta_Customers_Match2_DF, Continued on next page
192
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
add the Match_Target table from the DQSTANDALONE datastore as the source object connected to the Query transform. a)
In the Local Object Library, select the Datastores tab and then select the Match_Target table from the DQStandalone datastore.
b)
Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.
c)
In the Tool Palette, select the icon for a Query transform and then select in the data flow workspace.
d)
Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.
e)
Double-click the Query transform to open the transform editor.
f)
In the transform editor, map all columns from the input schema to the output schema columns. Do this by selecting all the fields in the Schema In window and then drag them to the Query object in the Schema Out window. Use the Back icon to close the Query transform editor.
Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.
Configure the Match Wizard in the Query transform. a)
In the data flow Beta_Customers_Match2_DF work space, right-click on the Query transform to select the option Run Match Wizard.
b)
On the Match Selection Type window, select the option for a simple match selection with only one way to identify an individual. Then choose Next.
c)
Set the name for Match Set Name to NameAddr.
d)
Specify that the first criterion is Full Name by selecting the field from the drop-down list in the Match On field. Use the value Word similarity for the field Compare Using. Choose Enter.
e)
Enter the second criterion as Address by selecting the field from the drop-down list in the Match On field. Use the value Word similarity for the field Compare Using. Select Enter. Choose Next.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
193
Unit 5: Matching and Consolidating Data
f)
BODS30
In the Select Criteria Fields window, confirm that the fields are mapped appropriately to the criteria as seen below. Criteria
Field
PERSON1_FAMILY_NAME1
DC_FAMILY_NAME
PERSON1_GIVEN_NAME1
DC_GIVEN_NAME1
PERSON1_GIVEN_NAME1 MATCH_STD1
DC_GIVEN_NAME1 MATCH_STD1
PERSON1_GIVEN_NAME1 MATCH_STD2
DC_GIVEN_NAME1 MATCH_STD2
PERSON1_GIVEN_NAME1 MATCH_STD3
DC_GIVEN_NAME1 MATCH_STD3
PERSON1_GIVEN_NAME2
DC_GIVEN_NAME2
PERSON1_GIVEN_NAME2 MATCH_STD1
DC_GIVEN_NAME2 MATCH_STD1
ADDRESS_PRIMARY NAME
AC_PRIM_NAME
ADDRESS_PRIMARY NUMBER AC_PRIM_NUM ADDRESS_SECONDARY NUMBER
AC_UNIT_NUM
Notice that matching is performed on the cleansed and standardized columns, which are the fields that begin with either DC_ or AC_. If the fields are not mapped, select the field name in the Field by using the drop-down list. Hint: You may need to select in the field More... to see more field entries. Then choose Next. g)
On the next window, define the break key by selecting POSTALCODE as the Upstream field. Set the start position to 1 and length to 5.
h)
Close the Match Wizard configuration to return to the data flow work space.
Continued on next page
194
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target. 1.
Configure a Match_NameAddr transform to the data flow. a)
When you closed the Match Wizard, Data Services automatically inserted the Match_NameAddr transform to the data flow workspace.
b)
Double-click the Match_NameAddr transform to open the transform editor.
c)
Delete the following fields in the Schema Out window by selecting them, right-click and choose Delete. Output Column MC_GROUP_NUMBER MC_MATCH_SCORE MC_GROUP_COUNT MC_GROUP_RANK
d)
e)
Rename the new generated match columns as follows: Original Column Name
New Column Name
NameAddr_Level1 GROUP_NUMBER
MC_GROUP NUMBER
NameAddr_Level1 MATCH_SCORE
MC_MATCH SCORE
NameAddr_Level1 GroupStats1_GROUP_COUNT
MC_GROUP COUNT
NameAddr_Level1 GroupStats1_GROUP_RANK
MC_GROUP RANK
Verify that all columns in the output schema are mapped correctly. Remap any columns that are not mapped. Hint: If the column does not have a blue arrow icon, the mapping needs to be reset.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
195
Unit 5: Matching and Consolidating Data
f)
BODS30
Add the DC_SSN column from the input schema to the Input tab in the parameters area to create a new entry. Hint: You may need to select the field by selecting on the blue arrow icon.
2.
Add a new matching criterion based on the new input field DC_SSN. a)
In the transform editor, go to the Options tab and select the button Edit Options... to invoke the Match Editor.
b)
In the Transform Options area, expand the nodes Matching → Level1 → Match Criteria if it is not already expanded. Notice that the match method is rule-based. The order of evaluation is important. The decision is made on the last match key unless a no-match is determined based on other keys. The key that is more likely to determine a match or no-match should be evaluated first. In this case, it is the social security number. Make a note of the current Match Score and No Match Score for each criterion. Open the Person1_Given_Name2 match criteria and view the values for the options. To view the options, select the criteria and then select the Options tab.
Continued on next page
196
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.
2.
3.
Add a new criterion based on Social Security Number 1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN. a)
In the Match Editor, select the node Match Criteria and choose the Add Criteria button.
b)
Enter Social Security Number 1 as the Criteria Name.
c)
Select the tab Criteria Fields and open the node for Person. Scroll down to find and select the criteria field SOCIAL_SECURITY_NUMBER1.
d)
In the Criteria field mapping section, from the drop-down list, choose the field DC_SSN as the entry for the field Input Field Mapped Name.
e)
For the SSN criteria, select the Options tab and enter the value of 101 for the Match score fieldand 99 for the field No match score.
f)
For the SSN criteria, enter the value of Field Similarity for the field Compare data using:.
g)
Select the Apply button.
Adjust the value of the No match score field to 79 for the criteria Address Primary Name Match. a)
Select the Address Primary Name Match and go to the Options tab.
b)
Set the value of the No match score field to 79.
Adjust the values of the Address Secondary Number match criteria according to the following table:
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
197
Unit 5: Matching and Consolidating Data
4.
BODS30
Option
Value
No match score
84
One field blank score
85
Both field blank score
100
a)
Select the Address Secondary Number match and go to the Options tab.
b)
Adjust the values of the match criteria according to the following table: Option
Value
No match score
84
One field blank score
85
Both field blank score
100
Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option
Value
Match score
101
One field blank score
0
Both field blank score
0
Initial adjustment score
90
a)
Select the Person1 Given Name 1 match and go to the Options tab.
b)
Adjust the values of the match criteria according to the following table: Option
Value
Match score
101
One field blank score
0
Both field blank score
0
Initial adjustment score
90
Continued on next page
198
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
5.
6.
Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Option
Value
Match score
80
No match score
79
One field blank score
90
Both field blank score
100
Initial adjustment score
100
a)
Select the Person1 Given Name 2 match and go to the Options tab.
b)
Change the value of the Compare data using: field to Word Similarity using the drop-down list.
c)
Adjust the values of the match criteria according to the following table: Option
Value
Match score
80
No match score
79
One field blank score
90
Both field blank score
100
Initial adjustment score
100
Move the new criteria for the Social Security Number 1 to be the first match criteria. a)
In the Transformation Options hierarchy, select the SSN match criteria under Level 1.
b)
Use the Move Up button underneath Transformation Options to place the SSN match criteria in the first position.
c)
Choose OK to close the Match Editor.
d)
Choose Back to close the transform editor.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
199
Unit 5: Matching and Consolidating Data
BODS30
Task 7: Execute the batch job Beta_Customers_Match2_Job with the default parameters. 1.
2.
Configure a new target table for the data flow. a)
From the Tool Palette, select the icon for a template table and select in the data flow work space to insert the new target template table. Enter Match_Target2 for the name and select the DELTA datastore from the drop-down list. Choose OK to create the template table.
b)
Connect the Match_NameAddr transform to the target template table by selecting the Match_NameAddr transform and while holding down the mouse button, drag to the target template table, and release the mouse button.
Execute the batch job Beta_Customers_Match2_Job with the default parameters. a)
In the Project area, right-click the batch job Beta_Customers_Match2_Job to select the option Execute. Choose OK to save all objects you have created.
b)
Choose OK to accept the default execution parameters.
c)
In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.
d)
In the target table, select the column heading for the MC_GROUP_NUMBER column to sort the data. Notice that there are two match groups. Records for A G Olson and Andrew Olson no longer match. However, records without social security numbers are not identified as matches in cases where they should be. The match criteria need to be adjusted again.
Continued on next page
200
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Task 8: Adjust the match criteria in the Match transform to ignore the values for Social Security Number 1 when both record field values are blank. Execute the batch job Beta_Customers_Match2_Job again with the default parameters. 1.
Adjust the match criteria in the Match transform for the data flow Beta_Customers_Match2_DF to ignore records when their social security number values are blank. a)
Open the work space for the data flow Beta_Customers_Match2_DF.
b)
Double-click the Match transform to open the transform editor.
c)
Go to the Options tab and start the Match Editor.
d)
For the Social Security Number 1 match criteria, set the option to ignore the values when both fields are blank. Find the field Both field blank operation and using the drop-down list, choose the value IGNORE. If there is a value for One field blank score and Both fields blank score, delete them.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
201
Unit 5: Matching and Consolidating Data
2.
BODS30
Execute the batch job Beta_Customers_Match2_Job with the default parameters. a)
In the Project area, right-click the batch job Beta_Customers_Match2_Job to choose the option Execute. Choose OK to save all objects you have created.
b)
Choose OK to accept the default execution parameters.
c)
In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.
d)
In the target table, select the column heading for the MC_GROUP_NUMBER column to sort the data. Notice that the matches for Thomas Davenport are correctly identified. The records for Ervin Scott Benson and Scott Benson are identified as a match due to a high-level match setting called Compare Given_Name1 to Given_Name2. Similarly, records for Isabel Cruz and Isabel A. Cruz match due to the Match on hyphenated family name setting.
Figure 58: Match Editor Results
202
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Configuring the Match Transform
Lesson Summary You should now be able to: • Control record comparisons • Use match levels • Define match criteria • Use blank field options • Use match criteria options • Modify match criteria using the Match Editor
Related Information •
2011
For more information about Compare Tables, see the Compare Tables section in the Data Services Technical Manuals
© 2011 SAP AG. All rights reserved.
203
Unit 5: Matching and Consolidating Data
BODS30
Lesson: Performing Post-Match Processing Lesson Overview Post-match operations give you the opportunity to take action based on the results of your matching setup.
Lesson Objectives After completing this lesson, you will be able to: • • • •
Perform post-match processing Set priorities Configure the group prioritization operation Set up for match reports
Business Example The matching process can divide records in groups based on break keys. The records in those break groups are considered to be match to one degree or another. You want to be able to prioritize the records within the break groups according to their source. This will help you identify the most important records as a source of data to create a single version of the truth.
Performing Post-Match Processing Post-match operations give you the opportunity to take action based on the results of your matching setup. The following post-match operations are available: Group Prioritization Use the group prioritization operation to order records for processing by other post-match operations. It is used to control duplicates by determining which record is master and which are subordinate. Group Statistics Use group statistics to analyze information about your data collections. Find out the number of records within the collection, the sequential group order number, the group rank, and whether the records in a collection belong to more than one source. Group statistics are essential for generating data for match reports. Best record
204
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
Use the best record operation to salvage data from matching records and consolidate that data into a single best record or copy it to all matching records. Unique ID Use the unique ID options to assign sequential identification numbers to each new record when adding records to a data warehouse. Unique records from the match are assigned a unique ID, while duplicates share an ID.
Setting Priorities When records match, they are assigned to a match group. The Match transform can determine which record in a match group is the best, or highest ranking, record using a group prioritization operation. The highest ranking record is referred to as the master, and all other records in that match group are subordinates. This information is critical for the best records operations that consolidate these records.
Figure 59: Post-Match Process Prioritization
Record priority is the combination of field priority and blank penalty scores: •
•
Field priority score uses a field containing numeric values to calculate the priority. For example, if you want to give a higher priority to one source over another, you could assign fewer points to the source with the higher priority. Blank penalty score assigns a penalty value up to 999 when the specified field is blank in a record.
Using best record prioritization
2011
© 2011 SAP AG. All rights reserved.
205
Unit 5: Matching and Consolidating Data
BODS30
The first option for calculating the priority of a record is to determine priority based on best record. This involves selecting a field that contains a numeric value and use that value as the initial priority for the record, sorted in ascending or descending order. You can combine multiple fields to calculate this priority. If the field you want to use is not a numeric value, you need to convert the content so that it can be used for prioritization. Use the Query transform with the appropriate function to create a new field for the converted content. For example, you can use an IfThenElse function to replace string with numeric values, or use the Julian function to convert a date to a number. Using blank penalty prioritization Alternatively, you can choose to order the records based on the completeness of data and define blank penalties. This feature assigns a lower priority to records in which particular fields are blank. Incorporating a blank penalty is appropriate if you feel that a blank field should not disqualify one record from matching another, but prefer to have the most complete record be your best records. As with best record priority, the blank penalty is a penalty-scoring system. For each blank field, you can assess a penalty of up to 999 points. You can assess the same penalty for each blank field, or assess a higher penalty for fields you consider more important. For example, if you were targeting a mailing to college students who primarily live in apartments or dormitories, you might assess a blank penalty of 5 to Given_Name2 and Primary_Postfix as less important fields. For more important fields, you might assess blank penalty of 20 to Given_Name1 and Unit_Number. As a result, records would be prioritized as follows:
Figure 60: Blank Penalty Prioritization
Combining best record with blank penalty
206
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
The best record priority and blank penalty scores are added together and considered as one score. When defining both priorities, you must determine which you consider most important, and weight each accordingly. For example, if you want to have the best record priority take precedence, set the intervals between the options to a large enough value that the combined blank penalties cannot bridge the gap.
Figure 61: Combining Best Record with Blank Penalty
With this scoring system, a record from the house source always receives priority over a record from a rented source, even if the house record has blank fields. Even though the house record contains five blank fields, it receives only 155 penalty points (100 + 5 + 20 + 5 + 5 + 20), while the record from source A receives 200 penalty points. The house record, therefore, has the lower penalty and the higher priority. You can manipulate the scores to set priority exactly as you would like. For example, suppose you prefer a rented record containing first name data over a house record without first Name data. You could set the first name blank penalty score so that a blank first name field would weigh more heavily than any source membership.
Generating Group Statistics Group statistics help you analyze your match groups. They provide information, such as the number of records in a match group, the sequential group order number, group rank, whether records in a match group belong to more than one source. Group statistics are essential for generating data for match reports.
2011
© 2011 SAP AG. All rights reserved.
207
Unit 5: Matching and Consolidating Data
BODS30
If you want to generate match reports, you must have a Group Statistics operation included in your Match transform. To get the most accurate data in your reports, make sure that you have defined unique names in the Match transform for the other operations. This helps you better understand which element is producing the data you are looking at. To generate the data, you want to see in match reports, you must also set the Generate report statistics option to Yes in the Match transforms. By turning on report data generation, you can get information about break groups, which criteria were instrumental in creating a match, and so on. There are several other match reports available to help you analyze your match results using the Data Services Management Console. There are many match reports available to help you analyze your match results. •
Match Contribution The Match Contribution report provides information on the effect of the individual break groups and individual criteria on the total matching. Evaluation of this information is helpful for fine-tuning break key and match criteria.
•
Match Criteria Summary Data Services generates one Match Criteria Summary report per match set to provide a consolidated view of all key settings and the criteria settings. You can evaluate this information to determine whether adjustment of field comparison lengths or criteria settings would be helpful.
•
Match Source Statistics Summary The Match Source Statistics report provides information about duplicates within and across sources.
•
Match Duplicate Sample The Match Duplicate Sample report provides a sample of duplicates in the match results. One report is generated for each Match transform in the job.
208
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
Exercise 6: Prioritize Matched Records Exercise Objectives After completing this exercise, you will be able to: • Prioritize records from the Beta Businesses USA_Customers table to adhere to specific business rules.
Business Example You previously used the English North America Data Cleanse transform to generate a new column (DC_FIRM) that contained cleansed and standardized firm names. By modifying the PERSON_FIRM_EN dictionary, you standardized records so that “Associates” became “Assoc”. Now you want to prioritize records based on their source. You want records with a source value of “ACCT” to take precedence over records with other source values.
Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new Datastore for the DQSTANDALONE database.
3.
In your Local Object Library, create a new Datastore for the DELTA staging database. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
209
Unit 5: Matching and Consolidating Data
BODS30
Task 2: Create a new batch job Beta_Customers_Match3_Job with a data flow Beta_Customers_Match3_DF. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
In the Omega project, create a new batch job Beta_Customers_Match3_Job with a new data flow called Beta_Customers_Match3_DF.
Task 3: In the data flow Beta_Customers_Match3_DF, configure the Match Wizard in the Query transform. 1.
In the Match_Target2 table in the DQSTANDALONE datastore, set the content type for certain fields to Name.
2.
In the workspace for Beta_Customers_Match3_DF, add the Match_Target2 table from the DQSTANDALONE datastore as the source object connected to the Query transform.
Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.
Configure the Match Wizard in the Query transform.
Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target3. 1.
Configure a Match_NameAddr transform to the data flow.
2.
Add a new matching criterion based on the new input field DC_SSN.
Continued on next page
210
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.
Add new criteria based on Social Security Number 1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN.
2.
Adjust the value of the No match score field to 79 for the match criteria Address Primary Name.
3.
Adjust the values of the Address Secondary Number match criteria according to the following table:
4.
5.
Option
Value
No match score
84
One field blank score
85
Both field blank score
100
Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option
Value
Match score
101
One field blank score
0
Both field blank score
0
Initial adjustment score
90
Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Option
Value
Match score
80
No match score
79
One field blank score
90
Both field blank score
100
Initial adjustment score
100 Continued on next page
2011
© 2011 SAP AG. All rights reserved.
211
Unit 5: Matching and Consolidating Data
BODS30
6.
Adjust the match criteria for Social Security Number 1 to ignore records when their social security number values are blank.
7.
Move the new criteria for the Social Security Number 1 to be the first match criteria.
Task 7: Add a new output column Source_Priority to the Query transform populated by an expression giving a higher priority to the source value Acct. 1.
Add a new output column Source_Priority to the Query transform below the Source output column.
Task 8: Add the Source_Priority field to the Match transform. Adjust the Post Match Processing to add a prioritization group Priority with the field match Source_Priority assigned. 1.
Add the Source_Priority field to the Match transform. .
2.
To further prioritize records within the match group based on blank priority, configure the blank penalty scores for the fields: DC_SSN; DC_Given_Name2; and AC_Unit_Num to 1 in the Group Prioritization Editor.
3.
Add a new template target table called Match_Target3 in the DELTA datastore, connect it to the Match transform and execute the job.
Task 9: Execute the batch job Beta_Customers_Match3_Job with the default parameters. 1.
212
Execute the batch job Beta_Customers_Match3_Job with the default parameters.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
Solution 6: Prioritize Matched Records Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
a)
From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.
b)
In the dialog box, enter your assigned User ID.
c)
Enter your password, which is the same as your User ID.
d)
Choose the Log on button.
e)
In the list of repositories, select your repository DSREPO##, where ## is the number portion of your User ID.
f)
Choose OK.
If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
213
Unit 5: Matching and Consolidating Data
BODS30
In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DQSTANDALONE
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DQSTANDALONE
User name
sourceuser (lower case)
Password
sourcepass (lower case)
Choose OK to finish. c)
In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.
d)
Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •
AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page
214
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
3.
In your Local Object Library, create a new Datastore for the DELTA staging database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DELTA
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DELTA## (where ## is the group number assigned by your Instructor)
User name
student## or consult your instructor
Password
student## or consult your instructor
Choose OK to finish. c)
You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.
Task 2: Create a new batch job Beta_Customers_Match3_Job with a data flow Beta_Customers_Match3_DF. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. a)
From the Project menu, choose New → Project.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
215
Unit 5: Matching and Consolidating Data
2.
BODS30
b)
When the Project New dialog box appears, enter Omega in the Project name field.
c)
Choose Create so that the new project appears in the Project area.
In the Omega project, create a new batch job Beta_Customers_Match3_Job with a new data flow called Beta_Customers_Match3_DF. a)
In the Project area, right-click the project name and choose New Batch Job from the menu.
b)
Enter the name of the job as Beta_Customers_Match3_Job.
c)
Press Enter to commit the change.
d)
Select the Data Flow icon in the Tool Palette.
e)
Right-click in the workspace where you want to add the data flow and select Add New → Dataflow.
f)
Enter Beta_Customers_Match3_DF as the name.
g)
Press Enter to commit the change.
h)
Double-click the Beta_Customers_Match3_DF to open the workspace.
Task 3: In the data flow Beta_Customers_Match3_DF, configure the Match Wizard in the Query transform. 1.
In the Match_Target2 table in the DQSTANDALONE datastore, set the content type for certain fields to Name. a)
In the Local Object Library, select the Datastores tab and open the node for the DQSTANDALONE datastore.
b)
Open the node Tables and right-click on the table Match_Target2 to choose the option Edit.
c)
Right-click the field Cust_Name to select the option Properties and choose the type Name from the drop-down list for the Content Type associated with the field.
d)
Choose OK to close the dialog box.
e)
Close the table metadata by selecting the Back button. Continued on next page
216
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
2.
In the workspace for Beta_Customers_Match3_DF, add the Match_Target2 table from the DQSTANDALONE datastore as the source object connected to the Query transform. a)
In the Local Object Library, select the Datastores tab and then select the Match_Target2 table from the DQStandalone datastore.
b)
Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.
c)
In the Tool Palette, select the icon for a Query transform and then select in the data flow workspace.
d)
Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.
e)
Double-click the Query transform to open the transform editor.
f)
In the transform editor, map all columns from the input schema to the output schema columns. Do this by selecting all the fields listed in the Schema In window and then drag them to the Query object in the Schema Out window.
g)
Use the Back icon to close the transform editor.
Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.
Configure the Match Wizard in the Query transform. a)
In the data flow Beta_Customers_Match3_DF work space, right-click on the Query transform to select the option Run Match Wizard.
b)
On the Match Selection Type window, select the option for a simple match selection with only one way to identify an individual. Then choose Next.
c)
Set the name for Match Set Name to NameAddr.
d)
Specify that the first criterion is Full Name by selecting the field from the drop-down list in the field Match On. Use the value Word similarity for the field Compare Using. Choose Enter.
e)
Enter the second criteria as Address by selecting the field from the drop-down list in the Match On field.Use the value Word similarity for the field Compare Using. Choose Enter. Then choose Next. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
217
Unit 5: Matching and Consolidating Data
f)
BODS30
In the Select Criteria Fields window, confirm that the fields are mapped appropriately to the criteria as seen below. Criteria
Field
PERSON1_FAMILY_NAME1
DC_FAMILY_NAME
PERSON1_GIVEN_NAME1
DC_GIVEN_NAME1
PERSON1_GIVEN_NAME1 MATCH_STD1
DC_GIVEN_NAME1 MATCH_STD1
PERSON1_GIVEN_NAME1 MATCH_STD2
DC_GIVEN_NAME1 MATCH_STD2
PERSON1_GIVEN_NAME1 MATCH_STD3
DC_GIVEN_NAME1 MATCH_STD3
PERSON1_GIVEN_NAME2
DC_GIVEN_NAME2
PERSON1_GIVEN_NAME2 MATCH_STD1
DC_GIVEN_NAME2 MATCH_STD1
ADDRESS_PRIMARY_NAME
AC_PRIM_NAME
ADDRESS_PRIMARY NUMBER AC_PRIM_NUM ADDRESS_SECONDARY NUMBER
AC_UNIT_NUM
Notice that matching is performed on the cleansed and standardized columns, which are the fields that begin with either DC_ or AC_. If the fields are not mapped, select the field name in the Field by using the drop-down list. Hint: You may need to select the More... “field ”to see more field entries. Choose Next. g)
On the next window, define the break key by selecting AC_POSTCODE_FULL as the Upstream field. Set the start position to 1 and length to 5.
h)
Close the Match Wizard configuration to return to the data flow work space.
Continued on next page
218
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target3. 1.
Configure a Match_NameAddr transform to the data flow. a)
When you closed the Match Wizard, Data Services automatically inserted the Match_NameAddr transform to the data flow workspace.
b)
Double-click the Match_NameAddr transform to open the transform editor.
c)
Delete the following fields in the Schema Out window by selecting them, right-click and choose Delete. Output Column MC_GROUP_NUMBER MC_MATCH_SCORE MC_GROUP_COUNT MC_GROUP_RANK NAMEADDR_LEVEL1_GROUP_NUMBER NAMEADDR_LEVEL1_MATCH_SCORE NAMEADDR_LEVEL1_GROUPSTATS1_GROUP_COUNT NAMEADDR_LEVEL1_GROUPSTATS1_GROUP_RANK
d)
Rename the new generated match columns as follows: Original Column Name
New Column Name
NameAddr_Level1_GROUP NUMBER
MC_GROUP NUMBER
NameAddr_Level1_MATCH SCORE
MC_MATCH SCORE
NameAddr_Level1_GroupStats1 GROUP_COUNT
MC_GROUP COUNT
NameAddr_Level1_GroupStats1 GROUP_RANK
MC_GROUP RANK
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
219
Unit 5: Matching and Consolidating Data
e)
BODS30
Verify that all columns in the output schema are mapped correctly. Remap any columns that are not mapped. Hint: If the column does not have a blue arrow icon, the mapping needs to be reset.
f) 2.
Add the DC_SSN column from the input schema to the Input tab in the parameters area to create a new entry.
Add a new matching criterion based on the new input field DC_SSN. a)
In the transform editor, go to the Options tab and select the button Edit Options... to invoke the Match Editor.
b)
In the Transform Options area, expand the nodes Matching → Level1 → Match Criteria if it is not already expanded. Notice that the match method is rule-based. The order of evaluation is important. The decision is made on the last match key unless a no-match is determined based on other keys. the key that is more likely to determine a match or no-match should be evaluated first. In this case, it is the social security number. Make a note of the current Match Score and No Match Score for each criterion. Open the Person1_Given_Name2 match criteria and view the values for the options. To view the options, select the criteria and then select the Options tab.
Continued on next page
220
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.
2.
3.
Add new criteria based on Social Security Number 1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN. a)
In the Match Editor, select the node Match Criteria and choose the Add Criteria button.
b)
Enter Social Security Number 1 as the Criteria Name.
c)
Select the tab Criteria Fields and open the node for Person. Scroll down to find and select the criteria field SOCIAL_SECURITY_NUMBER1.
d)
From the drop-down list, choose the field DC_SSN as the entry for the field Input Field Mapped Name.
e)
For the Social Security Number 1 criteria, select the Options tab and enter the value of 101 for the field Match score and 99 for the field No match score.
f)
For the Social Security Number 1 criteria, enter the value of Field Similarity for the field Compare data using:.
Adjust the value of the No match score field to 79 for the match criteria Address Primary Name. a)
Select the match Address Primary Name and go to the Options tab.
b)
Set the value of the No match score field to 79.
Adjust the values of the Address Secondary Number match criteria according to the following table:
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
221
Unit 5: Matching and Consolidating Data
4.
5.
BODS30
Option
Value
No match score
84
One field blank score
85
Both field blank score
100
a)
Select the Address Secondary Number match and go to the Options tab.
b)
Adjust the values of the match criteria according to the following table: Option
Value
No match score
84
One field blank score
85
Both field blank score
100
Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option
Value
Match score
101
One field blank score
0
Both field blank score
0
Initial adjustment score
90
a)
Select the Person1 Given Name 1 match and go to the Options tab.
b)
Adjust the values of the match criteria according to the following table: Option
Value
Match score
101
One field blank score
0
Both field blank score
0
Initial adjustment score
90
Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Continued on next page
222
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
6.
Option
Value
Match score
80
No match score
79
One field blank score
90
Both field blank score
100
Initial adjustment score
100
a)
Select the Person1 Given Name 2 match and go to the Options tab.
b)
Change the value of the Compare data using: field to Word Similarity using the drop-down list.
c)
Adjust the values of the match criteria according to the following table: Option
Value
Match score
80
No match score
79
One field blank score
90
Both field blank score
100
Initial adjustment score
100
Adjust the match criteria for Social Security Number 1 to ignore records when their social security number values are blank. a)
For the Social Security Number 1 match criteria, set the option to ignore the values when both fields are blank. Find the field Both field blank operation and using the drop down list, choose the value IGNORE. If there is a value for One field blank score and Both fields blank score, delete it.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
223
Unit 5: Matching and Consolidating Data
7.
BODS30
Move the new criteria for the Social Security Number 1 to be the first match criteria. a)
In the Transformation Options hierarchy, select the Social Security Number 1 match criteria under Level 1.
b)
Use the Move Up button underneath Transformation Options to place the Social Security Number 1 match criteria in the first position.
c)
Choose the OK button to close the Match Editor.
d)
Use the Back icon in the Tool Bar to close the Match transform editor and return to the data flow workspace.
Continued on next page
224
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
Task 7: Add a new output column Source_Priority to the Query transform populated by an expression giving a higher priority to the source value Acct. 1.
Add a new output column Source_Priority to the Query transform below the Source output column. a)
Double-click the Query transform to open the transform editor.
b)
In the Schema Out area, right-click on the Source ouput columnn to select the option New output column and select Insert below. Enter Source_Priority as the name with a data type of int.
c)
Select the output column Source_Priority and go to the Mapping tab. Select the Functions button and select the category Miscellaneous Functions, select the function IfThenElse and choose Next.
d)
For the “If” part of the function, choose the ... button to open the Smart Editor. Select the Data tab and open the node on the table Match_Target2. Scroll down to find and select the field Source.
e)
Right-click on the field Source and choose Enter. In the Smart Editor, type in an equal sign = followed by the string 'ACCT'. Choose OK to return.
f)
For the “Then” part of the function, type in 10.
g)
For the “Else” part of the function, type in 20 and then select OK. The code should be as follows: ifthenselse (MATCH_TARGET2.SOURCE = 'ACCT', 10, 20)
h)
Choose Back to close the transform editor and return to the data flow work space.
Task 8: Add the Source_Priority field to the Match transform. Adjust the Post Match Processing to add a prioritization group Priority with the field match Source_Priority assigned. 1.
Add the Source_Priority field to the Match transform. . a)
Double-click the Match transform to open the transform editor.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
225
Unit 5: Matching and Consolidating Data
BODS30
b)
In the transform editor for the Match transform, add the Source_Priority column from the Schema In window to the Input tab in the parameters area to create a new entry.
c)
Go to the Options tab and start the Match Editor.
d)
In the Transform Options area on the left, select the node for Post Match Processing and then choose the Add Operation → Group Prioritization option.
e)
In the Group Prioritization Editor enter Priority as the Prioritization name. Then add these values on the Priority Order tab in the Priority fields area: Input Field
Field Order
Source_Priority
Ascending
DC_SSN
Ascending
DC_Given_Name2
Ascending
AC_Unit_Num
Ascending
At this point, you have configured prioritization based on source. This assures that the records with a source value of ACCT take precedence over those with any other value. f)
For the prioritization group Priority, go to the Record Completeness tab and make the following settings: Option
Value
Orders records on completeness of data
Select the checkbox
Define priority and penalty fields
Select the radio button
Record priority field
SOURCE_PRIORITY
Choose the Apply button.
Continued on next page
226
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
2.
To further prioritize records within the match group based on blank priority, configure the blank penalty scores for the fields: DC_SSN; DC_Given_Name2; and AC_Unit_Num to 1 in the Group Prioritization Editor. a)
In the Group Prioritization Editor, go to the Blank penalty score area and set the Blank Penalty score for these fields: Input Field
Blank Penalty
DC_SSN
1
DC_Given_Name2
1
AC_Unit_Num
1
Hint: You may need to choose OK to return to the Options tab and then select the Edit Options button a second time to permit these entries to appear.
3.
b)
In the navigation area on the left, select the node for Post Match Processing and select the node Priority and use the Move Up button to move Priority before GroupStats1..
c)
Choose OK to confirm your changes and close the Match Editor.
d)
On the Output tab of the Match transform, select the check box for the field NameAddr_Level1_Priority_PRIORITY_VALUE. This outputs the field.
e)
In the Schema Out pane, right-click on the output column NameAddr_Level1_Priority_PRIORITY_VALUE to select the option Properties and rename the output column to MC_PRIORITY. Choose OK.
f)
Choose the Back button to close the transform editor and return to the data flow work space.
Add a new template target table called Match_Target3 in the DELTA datastore, connect it to the Match transform and execute the job. a)
From the Tool Palette, select the icon for a template table and then select in the data flow workspace to insert the table. Use Match_Target3 as the name and DELTA as the datastore. Choose OK to create the table.
b)
Connect the Match transform to the new target template table by select the Match transform, hold down the mouse button, drag to the template table and release the mouse button to create the link. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
227
Unit 5: Matching and Consolidating Data
BODS30
Task 9: Execute the batch job Beta_Customers_Match3_Job with the default parameters. 1.
Execute the batch job Beta_Customers_Match3_Job with the default parameters. a)
In the Project area, right-click the batch job Beta_Customers_Match3_Job to choose the option Execute. Choose OK to save all objects you have created.
b)
Choose OK to accept the default execution parameters.
c)
In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.
d)
In the target table, select the column heading for the MC_GROUP_NUMBER column to sort the data. Notice that the master match record for Gloria Peterson now has a MC_Priority of 12 and the subordinate has a score of 22. Recall that the master record is the record with the lowest priority value. Priority and score are inversely proportional. Score (12) = Source_Priority (10) + DC_Given_Name2 Blank Penalty (1) + AC_Unit_Num Blank Penalty (1) + DC_SSN Blank Penalty (0)
Figure 62: Prioritizing Match Results
228
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Performing Post-Match Processing
Lesson Summary You should now be able to: • Perform post-match processing • Set priorities • Configure the group prioritization operation • Set up for match reports
Related Information •
2011
For more information about these individual reports, see the Data Services Management Console: Metadata Reports Guide.
© 2011 SAP AG. All rights reserved.
229
Unit 5: Matching and Consolidating Data
BODS30
Lesson: Consolidating Matching Records Lesson Overview Consolidation either eliminates, filters, or combines duplicate records utilizing configurable rules. Consolidation can build a “best record” allowing the best field selection based on your priorities: source, frequency, completeness, recency, and so on. It can also build reference keys to track individual records and their associations across multiple databases.
Lesson Objectives After completing this lesson, you will be able to: • • •
Consolidate matching records to form a best record Configure the best record operation Generate a unique identification number
Business Example The Match/Consolidate functionality in Data Services helps drive critical business decisions by understanding complex and multidimensional relationships in the enterprise data. It helps ensure that customer relationships are complete and accurate, enabling you to produce more effective targeted campaigns and provide better customer service. Within Business Intelligence, Enterprise, or Middleware applications, Match/Consolidate’s ability to handle a variety of customer and business data elements will improve duplicate detection coverage, allowing you to leverage multiple data sources for your analytical or operational needs.
Consolidating Matching Records to Form a Best Record A key component in most data consolidation efforts is salvaging data from records within match groups and consolidating or posting that data to a best record or all matching records. Best record operations enable you to update information in record fields based on their membership in a match group, their priority, and the completeness of the records within that match group.
230
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
Figure 63: Match to Consolidation
With the best record operation, you can: •
•
Salvage useful data from matching records before discarding them. For example, when checking a driver license file against your main source file, you might pick up gender or date-of-birth data to add to your master record. Post updated data, such as the most recent phone number, to all of the records in a match group. You can choose to post data to the master record, to all the subordinate members of the match group, or to all members of the match group.
The operations you set up in the Best Record operation always start with the highest priority member of the match group (the master) and work their way down to the last subordinate, one at a time. This ensures that data can be salvaged from the higher-priority record to the lower-priority record.
2011
© 2011 SAP AG. All rights reserved.
231
Unit 5: Matching and Consolidating Data
BODS30
Best record strategies act as a filter for taking action on other fields. There are several strategies to assist in setting up the best record operation quickly and easily. •
Date Select a date field and determine best record based on the oldest or most recent date.
•
Length Select a string field and determine best record based on the shortest or longest string of data in the field.
•
Non Blank Select any field and determine best record based on completeness of data in the field.
•
Priority Number Select a numeric field and determine best record based on the highest or lowest number.
•
Priority String Select a string field and determine best record based on ascending or descending alphabetic order for the data.
•
Custom Base your strategy entirely on custom Python code. This allows you to open the Python Expression editor and create custom Python code.
If none of these strategies fit your project needs, create a custom best record operation, using custom Python code. Note: The Best Record Summary report shows statistics about best record processing that indicate configuration settings and the results of the posting. This information can be used to assist in fine tuning configuration settings.
Generating a Unique Identification Number A unique ID refers to a field within your data that contains a unique value that is associated with a record or group of records. A unique ID is to data what a social security number (SSN) is to a person. It creates and tracks data relationships from between multiple jobs. With the Unique ID operation, you can set your own starting ID during the first execution. For each subsequent execution, the unique ID is the
232
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
next sequential value based on the existing highest unique ID. You can set your own starting ID for new key generation or have it dynamically assigned based on existing data by determining where the highest unique ID from the previous run ended.
Figure 64: Unique Identification Numbers
When the unique ID operation combines the update source information with the master database information to form one source of match group information, the following options are available: •
Assign The Assign option assigns a new ID to unique records that need one or assigns a new ID to all members of a group that do not have an ID. In addition, the assign operation copies an existing ID if a member of a destination match group already has an ID. Each record is assigned a value. –
– –
2011
Records in a match group where one record had an input unique ID share the value with other records in the match group that had no input value. The first value encountered is shared. Order affects this: Place a group prioritization post-match operation prior to the unique ID operation if you have a priority field that can be sequenced using ASCII order. Records in a match group where two or more records had different unique ID input values each keep their input value. Records that do not have an input unique ID value have the next available ID assigned to it.
© 2011 SAP AG. All rights reserved.
233
Unit 5: Matching and Consolidating Data
•
BODS30
Combine The Combine option combines the IDs of a match group if more than one ID is represented. For example, this would occur because John Smith and Mary Higgins are on the update source, but they are also listed in the master database. Each record is assigned a value: –
•
Records in a match group where two or more records have different unique ID input values share the first input value. Order affects this: Place a group prioritization post-match operation prior to the unique ID operation if you have a priority field that can be sequenced using ASCII order. – Records in a match group where the first record has a blank unique ID value while later records have a value all have blank output unique ID values. – Records that do not have an input unique ID value have a blank unique ID on output even if a prior record in the match group has an input unique ID value. AssignCombine The AssignCombine option performs both an assign and a combine operation. Each record is assigned a value: –
•
Records that do not have an input unique ID value and are not found to match another record containing an input unique ID value have the next available ID assigned to it. These are “add” records that could be unique records or could be matches but not to another record that had previously been assigned a unique ID value. – Records in a match group where one or more records had an input unique ID with the same or different values share the first value encountered with all other records in the match group. Order affects this: Place a group prioritization post-match operation prior to the unique ID operation if you have a priority field that can be sequenced using ASCII order. Delete The Delete option deletes unique IDs from records that no longer need them, provided that they are not protected from being deleted. If you are using a file and are recycling IDs, this ID is added to the file. The output for the unique ID is dependent on whether an input field in that record has a value that indicates that the ID is protected. –
–
234
If the unique ID protected field is not mapped as an input field, the unique ID on output is blank for every record, regardless of the fact that an ID came in or not. If the unique ID protected field is mapped as an input field, a value of N means that the record's input data is deleted in the output unique ID.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
–
•
If the unique ID protected field is mapped as an input field, a value other than N means that the record's input data is retained in the output unique ID field.
Split Changes a split group's unique records so that the records that do not belong to the same match group have a different ID. The record with the group's highest priority keeps its unique ID. The rest are assigned new unique IDs. Examples of this are when an adult child moves from his or her family home and establishes their own household or in the case of a divorce. –
–
–
Records in a match group where two or more records had different unique ID input values or blank values each retain their input value, filled or blank respective to the record. Records that do not have an input unique ID value and do not match any record with an input unique ID value record have a blank unique ID on output. Records that come in with the same input unique ID value that no longer are found as matches have the first record output with the input value and the subsequent record is assigned a new unique ID value.
Let us say you have two distinct tables of data and there is a unique identifier that relates a row in table A to a row in table B as the same customer. J. Smith is in table A and J. Smith is in table B. These are really two different people, yet you have the same unique ID assigned to both of them as though they were the same person. Now you do a more comprehensive match process between these two tables, and the fact is uncovered that these two rows are not duplicates at all, yet they currently have the same unique ID. The Split option splits these two rows into two different data collections, keeping the existing unique ID in one of them, and assigning a new unique ID to the other.
2011
© 2011 SAP AG. All rights reserved.
235
Unit 5: Matching and Consolidating Data
BODS30
Assigning Unique IDs When specifying which numbers to use for unique IDs, there are several methods available: • •
You can use a file of your own to assign a sequential number to records, beginning at whatever number you want, provided it is a positive integer greater than 0. You can manually enter a starting unique ID value. You can send the starting unique ID through a field in your data source created using the Query transform. The starting unique ID is passed to the Match transform before the first new unique ID is requested. If no unique ID is received, the starting number will default to 1. Use caution when using the Field option. The field that you use must contain the unique ID value you want to begin the sequential numbering with. This means that each record you process must contain this field, and every record must have the same value in this field.
• •
You can write unique IDs that are dropped during delete processing back to a file to be used later. You can recycle your own IDs by entering them in a file using the XML tag of . For example: 477 599 814
To assign unique IDs using a file: 1. 2.
In the Unique ID option group, select the Value from file option. In the File option, set the file name and path. This file must be an XML file and must adhere to the following structure: 477
To assign IDs using a constant: 1. 2.
236
In the Unique ID option group, select the Constant value option. Set the Starting value option to the desired ID value.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
To assign IDs using a field: 1. 2.
In the Unique ID option group, select the Field option. In the Starting unique ID field option, select the field that contains the starting unique ID value.
To recycle unique IDs: 1. 2. 3. 4.
In the Unique ID option group, set the Processing operation option to Delete. In the Unique ID option group, select the Value from file option. In the File option, set the file name and path Set the Recycle unique IDs option to Yes. This is the same file that you might use for assigning a beginning ID number. 358
2011
© 2011 SAP AG. All rights reserved.
237
Unit 5: Matching and Consolidating Data
238
© 2011 SAP AG. All rights reserved.
BODS30
2011
BODS30
Lesson: Consolidating Matching Records
Exercise 7: Create a Best Record Exercise Objectives After completing this exercise, you will be able to: • Create a best record from the Beta Businesses USA_Customers table by using the “DC_PHONE” and “DC_EMAIL” columns.
Business Example You have prioritized records so that “ACCT” records take precedence over “PRSP” records. You now want to define a best record action for match duplicates.
Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new Datastore for the DQSTANDALONE database.
3.
In your Local Object Library, create a new Datastore for the DELTA staging database.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
239
Unit 5: Matching and Consolidating Data
BODS30
Task 2: Create a new batch job Beta_Customers_Match4_Job with a data flow Beta_Customers_Match4_DF. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
In the Omega project, create a new batch job Beta_Customers_Match4_Job with a new data flow called Beta_Customers_Match4_DF.
Task 3: In the data flow Beta_Customers_Match4_DF, configure the Match Wizard in the Query transform. 1.
In the Match_Target3 table in the DQSTANDALONE datastore, set the content type for certain fields to Name.
2.
In the workspace for Beta_Customers_Match4_DF, add the Match_Target3 table from the DQSTANDALONE datastore as the source object connected to the Query transform.
Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.
Configure the Match Wizard in the Query transform.
Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target4. 1.
Configure a Match_NameAddr transform to the data flow.
2.
Add a new matching criterion based on the new input field DC_SSN.
Continued on next page
240
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.
Add a new criterion for Social_Security_Number1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN.
2.
Adjust the value of the No match score field to 79 for the criteria Address Primary Name Match.
3.
Adjust the values of the Address Secondary Number match criteria according to the following table:
4.
5.
Option
Value
No match score
84
One field blank score
85
Both field blank score
100
Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option
Value
Match score
101
One field blank score
0
Both field blank score
0
Initial adjustment score
90
Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Option
Value
Match score
80
No match score
79
One field blank score
90
Both field blank score
100
Initial adjustment score
100 Continued on next page
2011
© 2011 SAP AG. All rights reserved.
241
Unit 5: Matching and Consolidating Data
BODS30
6.
Adjust the match criteria for Social_Security_Number1 to ignore records when their social security number values are blank.
7.
Move the new criteria for the SSN to be the first match criteria.
Task 7: Add a new output column Source_Priority to the Query transform populated by an expression giving a higher priority to the source value Acct. 1.
Add a new output column Source_Priority to the Query transform below the Source output column.
Task 8: Add the Source_Priority field to the Match transform. Adjust the Post Match Processing to add a prioritization group Priority with the field match Source_Priority assigned. 1.
Add the Source_Priority field to the Match transform. .
2.
To further prioritize records within the match group based on blank priority, configure the blank penalty scores for the fields: DC_SSN; DC_Given_Name2; and AC_Unit_Num to 1 in the Group Prioritization Editor.
Task 9: Add the DC_Phone and DC_Email fields to the Match transform Input tab from the Input schema. 1.
Add the DC_Phone and DC_Email fields to the Match transform Input tab from the Input schema.
Task 10: Create a new target template table, connect it to the Match transform and execute the batch job Beta_Customers_Match4_Job with the default parameters.
242
1.
Create a new template table called Match_Targe4 in the DELTA datastore.
2.
Execute the batch job Beta_Customers_Match4_Job with the default parameters.
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
Solution 7: Create a Best Record Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.
Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
2.
a)
From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.
b)
In the dialog box, enter your assigned User ID.
c)
Enter your password, which is the same as your User ID.
d)
Select the Log on button.
e)
In the list of repositories, select your repository DSREPO## where ## is the number portion of your User ID.
f)
Select the OK button.
If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
243
Unit 5: Matching and Consolidating Data
BODS30
In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DQSTANDALONE
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DQSTANDALONE
User name
sourceuser (lower case)
Password
sourcepass (lower case)
Choose OK to finish. c)
In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.
d)
Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •
AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page
244
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
3.
In your Local Object Library, create a new Datastore for the DELTA staging database. a)
In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.
b)
In the resulting dialog box, use the options: Field
Value
Datastore name
DELTA
Datastore type
Database
Database type
Microsoft SQL Server
Database version
Microsoft SQL Server 2005
Database server name
Supplied by the Instructor
Database name
DELTA## (where ## is the group number assigned by your Instructor)
User name
student## or consult your instructor
Password
student## or consult your instructor
Choose OK to finish. c)
You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.
Task 2: Create a new batch job Beta_Customers_Match4_Job with a data flow Beta_Customers_Match4_DF. 1.
If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. a)
From the Project menu, choose New → Project.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
245
Unit 5: Matching and Consolidating Data
2.
BODS30
b)
When the Project New dialog box appears, enter Omega in the Project name field.
c)
Choose Create so that the new project appears in the Project area.
In the Omega project, create a new batch job Beta_Customers_Match4_Job with a new data flow called Beta_Customers_Match4_DF. a)
In the Project area, right-click the project name and choose New Batch Job from the menu.
b)
Enter the name of the job as Beta_Customers_Match4_Job.
c)
Choose Enter to commit the change.
d)
Select the Data Flow icon in the Tool Palette.
e)
Right-click in the workspace where you want to add the data flow and choose Add New → Dataflow.
f)
Enter Beta_Customers_Match4_DF as the name.
g)
Choose Enter to commit the change.
h)
Double-click the Beta_Customers_Match4_DF to open the workspace.
Task 3: In the data flow Beta_Customers_Match4_DF, configure the Match Wizard in the Query transform. 1.
In the Match_Target3 table in the DQSTANDALONE datastore, set the content type for certain fields to Name. a)
In the Local Object Library, select the Datastores tab and open the node for the DQSTANDALONE datastore.
b)
Open the node Tables and right-click on the table Match_Target3 to choose the option Edit.
c)
Right-click the field Cust_Name to select the option Properties and choose the type Name from the drop-down list for the Content Type associated with the field.
d)
Choose OK to close the dialog box.
e)
Close the table metadata by selecting the Back button. Continued on next page
246
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
2.
In the workspace for Beta_Customers_Match4_DF, add the Match_Target3 table from the DQSTANDALONE datastore as the source object connected to the Query transform. a)
In the Local Object Library, select the Datastores tab and then select the Match_Target3 table from the DQStandalone datastore.
b)
Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.
c)
In the Tool Palette, select the icon for a Query transform and then right-click in the data flow workspace to select Add New → Dataflow. Enter Beta_Customers_Match4_DF as the name.
d)
Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.
e)
Double-click the Query transform to open the transform editor.
f)
In the transform editor, map all columns from the input schema to the output schema columns. Do this by selecting all the fields listed in the Schema In window and then drag them to the Query object in the Schema Out window.
g)
Delete the following field in the Schema Out window by selecting it, right click and select Delete. Output Column SOURCE_PRIORITY
h)
Use the Back icon to close the transform editor.
Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.
Configure the Match Wizard in the Query transform. a)
In the data flow Beta_Customers_Match4_DF work space, right-click on the Query transform to select the option Run Match Wizard.
b)
On the Match Selection Type window, select the option for a simple match selection with only one way to identify an individual. Then choose the Next button. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
247
Unit 5: Matching and Consolidating Data
BODS30
c)
Set the name for Match Set Name to NameAddr.
d)
Specify that the first criterion is Full Name by selecting the field from the drop-down list in the field Match On. Use the value Word similarity for the field Compare Using. Choose Enter.
e)
Enter the second criteria as Address by selecting the field from the drop-down list in the field Match On. Use the value Word similarity for the field Compare Using. Choose Enter. Then choose Next.
f)
In the Select Criteria Fields window, confirm that the fields are mapped appropriately to the criteria as seen below. Criteria
Field
PERSON1_FAMILY NAME1
DC_FAMILY_NAME
PERSON1_GIVEN NAME1
DC_GIVEN_NAME1
PERSON1_GIVEN NAME1 MATCH_STD1
DC_GIVEN_NAME1 MATCH_STD1
PERSON1_GIVEN NAME1 MATCH_STD2
DC_GIVEN_NAME1 MATCH_STD2
PERSON1_GIVEN NAME1 MATCH_STD3
DC_GIVEN_NAME1 MATCH_STD3
PERSON1_GIVEN NAME2
DC_GIVEN_NAME2
PERSON1_GIVEN NAME2 MATCH_STD1
DC_GIVEN_NAME2 MATCH_STD1
ADDRESS_PRIMARY NAME
AC_PRIM_NAME
ADDRESS_PRIMARY NUMBER AC_PRIM_NUM ADDRESS_SECONDARY NUMBER
AC_UNIT_NUM
Notice that matching is performed on the cleansed and standardized columns, which are the fields with begin with either DC_ or AC_. If the fields are not mapped, select the field name in the Field by using the drop-down list. Hint: You may need to select the “field” More... to see more field entries. Continued on next page
248
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
Choose Next. g)
On the next window, define the break key by selecting AC_POSTCODE_FULL as the Upstream field. Set the start position to 1 and length to 5.
h)
Close the Match Wizard configuration to return to the data flow work space.
Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target4. 1.
Configure a Match_NameAddr transform to the data flow. a)
When you closed the Match Wizard, Data Services automatically inserted the Match_NameAddr transform to the data flow workspace.
b)
Double-click the Match_NameAddr transform to open the transform editor.
c)
Delete the following fields in the Schema Out window by selecting them, right-click, and choose Delete. Output Column MC_GROUP_NUMBER MC_MATCH_SCORE MC_GROUP_COUNT MC_GROUP_RANK NAMEADDR_LEVEL1_GROUP_NUMBER NAMEADDR_LEVEL1_MATCH_SCORE NAMEADDR_LEVEL1_GROUPSTATS1_GROUP_COUNT NAMEADDR_LEVEL1_GROUPSTATS1_GROUP_RANK MC_PRIORITY DC_PHONE DC_EMAIL
d)
Rename the new generated match columns as follows: Original Column Name
New Column Name Continued on next page
2011
© 2011 SAP AG. All rights reserved.
249
Unit 5: Matching and Consolidating Data
e)
BODS30
NameAddr_Level1_GROUP NUMBER
MC_GROUP NUMBER
NameAddr_Level1_MATCH SCORE
MC_MATCH SCORE
NameAddr_Level1_GroupStats1 GROUP_COUNT
MC_GROUP COUNT
NameAddr_Level1_GroupStats1 GROUP_RANK
MC_GROUP RANK
Verify that all columns in the output schema are mapped correctly. Remap any columns that are not mapped. Hint: If the column does not have a blue arrow icon, the mapping needs to be reset.
f) 2.
Add the DC_SSN column from the input schema to the Input tab in the parameters area to create a new entry.
Add a new matching criterion based on the new input field DC_SSN. a)
In the transform editor, go to the Options tab and choose the button Edit Options... to invoke the Match Editor.
b)
In the Transform Options area, expand the nodes Matching → Level1 → Match Criteria if it is not already expanded. Notice that the match method is rule-based. The order of evaluation is important. The decision is made on the last match key unless a no-match is determined based on other keys. the key that is more likely to determine a match or no-match should be evaluated first. In this case, it is the social security number. Make a note of the current Match Score and No Match Score for each criterion. Open the Person1_Given_Name2 match criteria and view the values for the options. To view the options, select the criteria and then select the Options tab.
Continued on next page
250
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.
2.
3.
Add a new criterion for Social_Security_Number1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN. a)
In the Match Editor, select the node Match Criteria and choose the Add Criteria button.
b)
Enter Social Security Number 1 as the Criteria Name.
c)
Select the tab Criteria Fields and open the node for Person. Scroll down to find and select the criteria field SOCIAL_SECURITY_NUMBER1.
d)
From the drop-down list, choose the field DC_SSN as the entry for the field Input Field Mapped Name.
e)
For the Social_Security_Number1 criteria, select the Options tab and enter the value of 101 for the field Match score and 99 for the field No match score.
f)
For the Social_Security_Number1 criteria, enter the value of Field Similarity for the field Compare data using:
Adjust the value of the No match score field to 79 for the criteria Address Primary Name Match. a)
Select the Address Primary Name Match and go to the Options tab.
b)
Set the value of the No match score field to 79.
Adjust the values of the Address Secondary Number match criteria according to the following table:
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
251
Unit 5: Matching and Consolidating Data
4.
5.
BODS30
Option
Value
No match score
84
One field blank score
85
Both field blank score
100
a)
Select the Address Secondary Number match and go to the Options tab.
b)
Adjust the values of the match criteria according to the following table: Option
Value
No match score
84
One field blank score
85
Both field blank score
100
Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option
Value
Match score
101
One field blank score
0
Both field blank score
0
Initial adjustment score
90
a)
Select the Person1 Given Name 1 match and go to the Options tab.
b)
Adjust the values of the match criteria according to the following table: Option
Value
Match score
101
One field blank score
0
Both field blank score
0
Initial adjustment score
90
Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Continued on next page
252
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
6.
Option
Value
Match score
80
No match score
79
One field blank score
90
Both field blank score
100
Initial adjustment score
100
a)
Select the Person1 Given Name 2 match and go to the Options tab.
b)
Change the value of the Compare data using: field to Word Similarity using the drop-down list.
c)
Adjust the values of the match criteria according to the following table: Option
Value
Match score
80
No match score
79
One field blank score
90
Both field blank score
100
Initial adjustment score
100
Adjust the match criteria for Social_Security_Number1 to ignore records when their social security number values are blank. a)
For the Social_Security_Number1 match criteria, set the option to ignore the values when both fields are blank. Find the field Both field blank operation and using the drop-down list, choose the value IGNORE. If there is a value for One field blank score and Both fields blank score, delete it.
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
253
Unit 5: Matching and Consolidating Data
7.
BODS30
Move the new criteria for the SSN to be the first match criteria. a)
In the Transformation Options hierarchy, select the Social_Security_Number1 match criteria under Level 1.
b)
Choose the Move Up button underneath Transformation Options to place the SSN match criteria in the first position.
c)
Choose OK to close the Match Editor.
d)
Use the Back icon in the Tool Bar to close the Match transform editor and return to the data flow workspace.
Task 7: Add a new output column Source_Priority to the Query transform populated by an expression giving a higher priority to the source value Acct. 1.
Add a new output column Source_Priority to the Query transform below the Source output column. a)
Double-click the Query transform to open the transform editor.
b)
In the Schema Out area, right-click on the Source output columnn to select the option New output column. Enter Source_Priority as the name with a data type of int. Then select Insert below to create the new output column.
c)
Select the output column Source_Priority and go to the Mapping tab. Choose the Functions button and select the category Miscellaneous Functions, select the function IfThenElse and choose Next.
d)
For the “If” part of the function, type in the string 'ACCT'.
e)
For the “Then” part of the function, type in 10.
f)
For the “Else” part of the function, type in 20 and then choose OK. The code should be as follows: ifthenselse (MATCH_TARGET3.SOURCE = 'ACCT', 10, 20)
g)
Choose Back to close the transform editor and return to the data flow work space.
Continued on next page
254
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
Task 8: Add the Source_Priority field to the Match transform. Adjust the Post Match Processing to add a prioritization group Priority with the field match Source_Priority assigned. 1.
Add the Source_Priority field to the Match transform. . a)
Double-click the Match transform to open the transform editor.
b)
In the transform editor for the Match transform, add the Source_Priority column from the Schema In window to the Input tab in the parameters area to create a new entry.
c)
Go to the Options tab and start the Match Editor.
d)
In the navigation area on the left, open the node for Post Match Processing and choose Add Operation → Group Prioritization.
e)
In the Group Prioritization Editor enter Priority as the Prioritization name. Then add these values on the Priority Order tab in the Priority fields area: Input Field
Field Order
Source_Priority
Ascending
DC_SSN
Ascending
DC_Given_Name2
Ascending
AC_Unit_Num
Ascending
At this point, you have configured prioritization based on source. This assures that the records with a source value of ACCT take precedence over those with any other value. f)
For the prioritization group Priority, go to the Record Completeness tab and make the following settings: Option
Value
Orders records on completeness of data
Select the checkbox
Define priority and penalty fields
Select the radio button
Record priority field
SOURCE_PRIORITY
Continued on next page
2011
© 2011 SAP AG. All rights reserved.
255
Unit 5: Matching and Consolidating Data
2.
BODS30
To further prioritize records within the match group based on blank priority, configure the blank penalty scores for the fields: DC_SSN; DC_Given_Name2; and AC_Unit_Num to 1 in the Group Prioritization Editor. a)
In the Group Prioritization Editor, go to the Blank penalty score area and set the Blank Penalty score for these fields: Input Field
Blank Penalty
DC_SSN
1
DC_Given_Name2
1
AC_Unit_Num
1
Hint: You may need to choose the OK button to return to the Options tab and then choose the Edit Options button a second time to permit these entries to appear. b)
In the navigation area on the left, open the node for Post Match Processing and select the node Priority and use the Move Up button to move Priority before GroupStats1..
c)
Choose OK to confirm your changes and close the Match Editor.
d)
On the Output tab of the Match transform, select the check box for the field NameAddr_Level1_Priority_PRIORITY_VALUE. This outputs the field.
e)
In the Schema Out pane, right-click on the output column NameAddr_Level1_Priority_PRIORITY_VALUE to select the option Properties and rename the output column to MC_PRIORITY. Select the OK button to return to the Options tab.
Task 9: Add the DC_Phone and DC_Email fields to the Match transform Input tab from the Input schema. 1.
Add the DC_Phone and DC_Email fields to the Match transform Input tab from the Input schema. a)
Double-click the Match transform to open the transform editor.
Continued on next page
256
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
b)
Go to the Input schema and select the input columns DC_Phone and DC_Email and drag them to the Input tab in that order.
c)
Go to the Options tab and start the Match Editor.
d)
In the navigation area on the left, select the node for Post Match Processing and select the Add Operation button and select the option Best Record.
e)
In the Best Record Editor, enter these values: Option Name
Option Value
Best Record Name
Phone
Best Record Strategy
LENGTH
Strategy Priority
LONGEST
Strategy Field
DC_PHONE
Posting Destination
ALL
Post only once per destination
YES
f)
In the Best record action fields section, use the drop-down list to select DC_PHONE as the entry for both the Source Field and Destination Field.
g)
On the Options tab, choose Edit Options.
h)
In the navigation area on the left, select the node for Post Match Processing, choose Add Operation and select the option Best Record.
i)
Choose the Add Operation button and select the option Best Record and in the Best New Record Editor, enter these values: Option Name
Option Value
Best Record Name
Email
Best Record Strategy
NON_BLANK
Strategy Field
DC_EMAIL
Posting Destination
MASTER
Post only once per destination
YES
In the Best record action fields section, use the drop-down list to select DC_EMAIL as the entry for both the Source Field and Destination Field. Continued on next page
2011
© 2011 SAP AG. All rights reserved.
257
Unit 5: Matching and Consolidating Data
BODS30
j)
Move the Phone rule above GroupStats1 by selecting Phone and use the Move Up button.
k)
Move the Email rule above GroupStats1 by selecting Email and use the Move Up button.
l)
Choose OK to confirm your changes and close the Match Editor.
m)
Remap the DC_Phone and DC_Email columns from the Output tab to the Output schema by selecting the checkboxes in the Output tab. Note: In order for the Best Record processing to function properly, fields must be remapped from the Output tab to the Output schema, not from the Input schema.
n)
Use the Back button in the Toolbar to close the transform editor and return to the data flow workspace.
Task 10: Create a new target template table, connect it to the Match transform and execute the batch job Beta_Customers_Match4_Job with the default parameters. 1.
Create a new template table called Match_Targe4 in the DELTA datastore. a)
From the Tool Palette, select the icon for the template table and click inside the data flow workspace. enter the name Match_Target4 in the DELTA datastore.
b)
Choose OK to create the table.
c)
Connect the Match transform to the template table by selecting the Match transform, hold down the mouse button and drag to the template table. Once the template table is selected, release the mouse button to create the link.
Continued on next page
258
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Consolidating Matching Records
2.
Execute the batch job Beta_Customers_Match4_Job with the default parameters. a)
In the Project area, right-click the batch job Beta_Customers_Match4_Job to select the option Execute. Choose OK to save all objects you have created.
b)
Choose OK to accept the default execution parameters.
c)
In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.
d)
In the target table, select the column heading for the MC_GROUP_NUMBER column to sort the data. Examine the DC_Phone and DC_Email columns for match duplicates of Thomas Davenport and Gloria Peterson to verify that the Best Record action was correctly applied. Initially, the master record for Gloria Peterson did not have phone and e-mail data; however, after the Best Record action, phone and e-mail values for the master and subordinate records are identical. Initially, the matching records for Thomas Davenport had the same phone number and different e-mail addresses. After the Best Record transformation, these values remain unchanged. The phone numbers are identical and none of Thomas Davenport's record were missing e-mail data.
Figure 65: Best Record Results
2011
© 2011 SAP AG. All rights reserved.
259
Unit 5: Matching and Consolidating Data
BODS30
Lesson Summary You should now be able to: • Consolidate matching records to form a best record • Configure the best record operation • Generate a unique identification number
260
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using Advanced Match Strategies
Lesson: Using Advanced Match Strategies Lesson Overview Introduction Householding and associative matching are advanced match strategies available in Data Quality.
Lesson Objectives After completing this lesson, you will be able to: • •
Use consumer and consumer householding Use associative matching
Business Example In examining the records in the Beta Businesses USA_Customers table, you find multiple levels of consumer (for example, individual and family accounts). You also find multiple levels of company (for example, corporation, subsidiary, and contact levels. You want to know how you can match records among these multiple levels.
Defining Householding Householding is a process in which organizations hierarchically structure their data to obtain context-sensitive views of customers, suppliers, patients, and so on. The initial requirement is to provide clear and concise marketing data by identifying all individuals who reside at the same physical residence, hence the name householding. This enables execution of efficient and cost-saving campaigns to any level of granularity. In Master Data Management (MDM), where master data is shared by several disparate groups within an organization, it is crucial to have a hierarchical customer view. These hierarchies, or households, can be constructed to any number of levels. From a the standpoint of running the business, having data linked together properly is essential. Customer service can provide a complete 360–degree view of their customer base, providing an improved Customer Relationship Management (CRM) prospective. For example, sales can better understand the true cost of sale and cost of acquisition, and management can get a clear understanding of the key financial costs related to the various groups.
2011
© 2011 SAP AG. All rights reserved.
261
Unit 5: Matching and Consolidating Data
BODS30
A householding strategy should be applied whenever matching business rules consist of multiple levels of consumer or corporate relationships (for example, individual and family accounts). For a corporation, householding may be applied at corporation, subsidiary, and contact levels.
Using Consumer and Corporate Householding Consumer householding is used to identify when multiple individuals in a family or group live together in the same residence. When setting up consumer householding, there are Match transform configurations available to match on the following criteria: •
•
For situations where individuals sharing a residence may not belong to the same family or share a family name, you can use the ConsumerHouseholdResInd_MatchBatch configuration. This allows you to match on two criteria: residence based on similar address data and individual based on similar name data. For situations where individuals sharing a residence are expected to share a family, you can use the ConsumerHouseholdResFamInd_MatchBatch configuration. This allows you to match on three criteria: residence based on similar address data, family based on similar family name data, and individual based on similar given name and post name data.
In the sample output below, notice that there are four records that match at the residence level. Of those four records, two records match at the individual level (Elizabeth and Liz).
Figure 66: Advanced Match Strategies 1–Householding
Corporate householding is used when your matching business rules consist of multiple levels of corporate relationships, such as corporate matches, subsidiary matches, and contact matches. For these records, you can use the CorporateHouseholdFirmInd_MatchBatch configuration. This allows you to match on two criteria: corporation based on similar firm and address data, and contact based on similar name data.
262
© 2011 SAP AG. All rights reserved.
2011
BODS30
Lesson: Using Advanced Match Strategies
Using Associative Matching Associative matching is the process where records are compared between match sets. If a record belongs to more than one match set, all records for these match sets become members of a new associated match group. The objective of this type of matching is to uncover matches that cannot be found based on a single match criteria. Use associative matching anytime you want to uncover hidden relationships. Associative matching is commonly used to identify people who may have multiple addresses for marketing and mailing efforts. For example, college students have one address during the academic school year and another permanent address. The resulting groups of previous match transforms can be combined together in the Associate transform. This transform uses the generated group data from previous match processes and determines the overlapped results, such as name and address, with name and SSN, as a single ID. This is commonly referred to as association matching. The Associate transform accepts a group number field, generated by the Match transforms, for each match result that will be combined. The transform can then output a new associated group number. The Associate transform can operate either on all the input records or on one data collection at a time. The latter is needed for transactional support.
Figure 67: Advanced Matching Strategies 2 – Associative
In the sample output above, notice the matches for Isabel Garcia-Cruz. There are three records in match group 1 (based on name and address matching), two records each in match groups 1 and 2 (based on name and email), and two records in match group 4 (based on name and phone). Since there is an overlap between the name and address and name and email match groups for Isabel Garcia-Cruz, all records in the match groups are considered a match. Since there is an overlap between the name and email and name and phone match groups for Isabel Garcia-Cruz, all records in these match groups are considered a match. Finally, since match group 2 (based on name and email) matches match group 1 (based on name and address), as well as match group 4 (based on name and phone), we can associate all records of these match sets can be associated to form one large associated match group. Without associated matching, the largest match group for Isabel Garcia-Cruz had three members. By performing associated matching, three hidden relationships were uncovered and a match set of six records found.
2011
© 2011 SAP AG. All rights reserved.
263
Unit 5: Matching and Consolidating Data
BODS30
Lesson Summary You should now be able to: • Use consumer and consumer householding • Use associative matching
264
© 2011 SAP AG. All rights reserved.
2011
BODS30
Unit Summary
Unit Summary You should now be able to: • Create a match strategy • Define match components • Set up matching using the Match Wizard • Complete a simple match for full names on the Beta Businesses USA_Customers table • Control record comparisons • Use match levels • Define match criteria • Use blank field options • Use match criteria options • Modify match criteria using the Match Editor • Perform post-match processing • Set priorities • Configure the group prioritization operation • Set up for match reports • Consolidate matching records to form a best record • Configure the best record operation • Generate a unique identification number • Use consumer and consumer householding • Use associative matching
2011
© 2011 SAP AG. All rights reserved.
265
Unit Summary
266
BODS30
© 2011 SAP AG. All rights reserved.
2011
Test Your Knowledge
267
BODS30
© 2011 SAP AG. All rights reserved.
2011
Course Summary
BODS30
Course Summary You should now be able to: • • • • •
268
Complete an Address Cleanse transform Work with global address data Complete a Data Cleanse transform Set up matching using the Match Wizard Consolidate matching records to form a best record
© 2011 SAP AG. All rights reserved.
2011
Feedback SAP AG has made every effort in the preparation of this course to ensure the accuracy and completeness of the materials. If you have any corrections or suggestions for improvement, please record them in the appropriate place in the course evaluation.
2011
© 2011 SAP AG. All rights reserved.
269