BODS30 - SAP Data Services - Data Quality Management

November 11, 2016 | Author: mfarrukhhanif | Category: N/A

Share Embed Donate

Report this link

Short Description

BODS30 - SAP Data Services - Data Quality Management...

Description

BODS30 Data Services - Data Quality Management SAP BusinessObjects - Data Services

Date Training Center Instructors Education Website

Participant Handbook Course Version: 96 Course Duration: 2 Day(s) Material Number: 50104426

An SAP course - use it to learn, reference it for work

Copyright Copyright © 2011 SAP AG. All rights reserved. No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed without prior notice. Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors.

Trademarks •

Microsoft®, WINDOWS®, NT®, EXCEL®, Word®, PowerPoint® and SQL Server® are registered trademarks of Microsoft Corporation.

•

IBM®, DB2®, OS/2®, DB2/6000®, Parallel Sysplex®, MVS/ESA®, RS/6000®, AIX®, S/390®, AS/400®, OS/390®, and OS/400® are registered trademarks of IBM Corporation.

•

ORACLE® is a registered trademark of ORACLE Corporation.

•

INFORMIX®-OnLine for SAP and INFORMIX® Dynamic ServerTM are registered trademarks of Informix Software Incorporated.

•

UNIX®, X/Open®, OSF/1®, and Motif® are registered trademarks of the Open Group.

•

Citrix®, the Citrix logo, ICA®, Program Neighborhood®, MetaFrame®, WinFrame®, VideoFrame®, MultiWin® and other Citrix product names referenced herein are trademarks of Citrix Systems, Inc.

•

HTML, DHTML, XML, XHTML are trademarks or registered trademarks of W3C®, World Wide Web Consortium, Massachusetts Institute of Technology.

•

JAVA® is a registered trademark of Sun Microsystems, Inc.

•

JAVASCRIPT® is a registered trademark of Sun Microsystems, Inc., used under license for technology invented and implemented by Netscape.

•

SAP, SAP Logo, R/2, RIVA, R/3, SAP ArchiveLink, SAP Business Workflow, WebFlow, SAP EarlyWatch, BAPI, SAPPHIRE, Management Cockpit, mySAP.com Logo and mySAP.com are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. All other products mentioned are trademarks or registered trademarks of their respective companies.

Disclaimer THESE MATERIALS ARE PROVIDED BY SAP ON AN "AS IS" BASIS, AND SAP EXPRESSLY DISCLAIMS ANY AND ALL WARRANTIES, EXPRESS OR APPLIED, INCLUDING WITHOUT LIMITATION WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, WITH RESPECT TO THESE MATERIALS AND THE SERVICE, INFORMATION, TEXT, GRAPHICS, LINKS, OR ANY OTHER MATERIALS AND PRODUCTS CONTAINED HEREIN. IN NO EVENT SHALL SAP BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, CONSEQUENTIAL, OR PUNITIVE DAMAGES OF ANY KIND WHATSOEVER, INCLUDING WITHOUT LIMITATION LOST REVENUES OR LOST PROFITS, WHICH MAY RESULT FROM THE USE OF THESE MATERIALS OR INCLUDED SOFTWARE COMPONENTS.

g201192861428

About This Handbook This handbook is intended to complement the instructor-led presentation of this course, and serve as a source of reference. It is not suitable for self-study.

Typographic Conventions American English is the standard used in this handbook. The following typographic conventions are also used. Type Style

Description

Example text

Words or characters that appear on the screen. These include field names, screen titles, pushbuttons as well as menu names, paths, and options. Also used for cross-references to other documentation both internal and external.

2011

Example text

Emphasized words or phrases in body text, titles of graphics, and tables

EXAMPLE TEXT

Names of elements in the system. These include report names, program names, transaction codes, table names, and individual key words of a programming language, when surrounded by body text, for example SELECT and INCLUDE.

Example text

Screen output. This includes file and directory names and their paths, messages, names of variables and parameters, and passages of the source text of a program.

Example text

Exact user entry. These are words and characters that you enter in the system exactly as they appear in the documentation.

Variable user entry. Pointed brackets indicate that you replace these words and characters with appropriate entries.

© 2011 SAP AG. All rights reserved.

iii

About This Handbook

BODS30

Icons in Body Text The following icons are used in this handbook. Icon

Meaning For more information, tips, or background

Note or further explanation of previous point Exception or caution Procedures

Indicates that the item is displayed in the instructor's presentation.

iv

© 2011 SAP AG. All rights reserved.

2011

Contents Course Overview ............................................................................. vii Course Goals.................................................................................vii Course Objectives ...........................................................................vii

Unit 1: Describing Data Services ...........................................................1 Defining Data Services....................................................................... 2

Unit 2: Using Data Quality Transforms .................................................. 21 Data Quality Transforms ................................................................... 22

Unit 3: Using Address Cleanse Transforms............................................ 51 Using the Address Cleanse Transforms ................................................ 52

Unit 4: Using Data Cleanse Transforms ................................................. 93 Using the Data Cleanse Transform....................................................... 94

Unit 5: Matching and Consolidating Data..............................................129 Using the Match Wizard .................................................................. 131 Configuring the Match Transform........................................................ 156 Performing Post-Match Processing ..................................................... 204 Consolidating Matching Records ........................................................ 230 Using Advanced Match Strategies ...................................................... 261

2011

© 2011 SAP AG. All rights reserved.

v

Contents

vi

BODS30

© 2011 SAP AG. All rights reserved.

2011

Course Overview SAP BusinessObjects Data Quality enables you to parse, cleanse, standardize, consolidate, and enhance records. In this two-day course, you will learn about cleansing address and firm data, and match and consolidate records. As a business benefit, by being able to create efficient data quality projects, you can use the transformed data to help improve operational and supply chain efficiencies, enhance customer relationships, create new revenue opportunities, and optimize return on investment from enterprise applications.

Target Audience This course is intended for the following audiences: • •

Consultants and project team members responsible for implementing data quality projects Customer users responsible for administering and managing data quality projects

Course Prerequisites Required Knowledge • • • • •

Thorough knowledge of ETL (Extraction, Transformation and Loading) data processes Thorough knowledge of defining source and target metadata in the SAP BusinessObjects Designer Thorough knowledge of creating batch jobs Thorough knowledge of using functions, scripts, and variables Thorough knowledge of using Data Integrator Platform transforms

Course Goals This course will prepare you to: • •

2011

Use Data Quality transforms to parse, standardize, cleanse, enhance and match records Consolidate matching records to form a best record

© 2011 SAP AG. All rights reserved.

vii

Course Overview

BODS30

Course Objectives After completing this course, you will be able to: • • • • •

viii

Complete an Address Cleanse transform Work with global address data Complete a Data Cleanse transform Set up matching using the Match Wizard Consolidate matching records to form a best record

© 2011 SAP AG. All rights reserved.

2011

Unit 1 Describing Data Services Unit Overview Lesson introduction Data Services is a graphical interface for creating and staging jobs for data integration and data quality purposes After completing this lesson, you will be able to: • • • •

Describe the purpose of Data Services Describe Data Services architecture Define Data Services objects Use the Data Services Designer interface

Unit Objectives After completing this unit, you will be able to: • •

Define Data Services objects Use the Data Services Designer interface

Unit Contents Lesson: Defining Data Services ..................................................... 2

2011

© 2011 SAP AG. All rights reserved.

1

Unit 1: Describing Data Services

BODS30

Lesson: Defining Data Services Lesson Overview Data Services is a graphical interface for creating and staging jobs for data integration and data quality purposes.

Lesson Objectives After completing this lesson, you will be able to: • •

Define Data Services objects Use the Data Services Designer interface

Business Example For reporting in SAP NetWeaver Business Warehouse, your company needs data from diverse data sources, such as SAP systems, non-SAP systems, the Internet, and other business applications. You should therefore examine the technologies that SAP NetWeaver BW offers for data acquisition.

Describing Data Services BusinessObjects Data Services provides a graphical interface that allows you to easily create jobs that extract data from heterogeneous sources, transform that data to meet the business requirements of your organization, and load the data into a single location. Note: Although Data Services can be used for both real-time and batch jobs, this course covers batch jobs only. Data Services combines both batch and real-time data movement and management with intelligent caching to provide a single data integration platform for information management from any information source and for any information use.

2

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Defining Data Services

Figure 1: Data Services Architecture

This unique combination allows you to: • • • •

Stage data in an operational data store, data warehouse, or data mart. Update staged data in batch or real-time modes. Create a single environment for developing, testing, and deploying the entire data integration platform. Manage a single metadata repository to capture the relationships between different extraction and access methods and provide integrated lineage and impact analysis.

Data Services performs three key functions that can be combined to create a scalable, high-performance data platform. It: •

• •

Loads Enterprise Resource Planning (ERP) or enterprise application data into an operational datastore (ODS) or analytical data warehouse, and updates in batch or real-time modes. Creates routing requests to a data warehouse or ERP system using complex rules. Applies transactions against ERP systems.

Data mapping and transformation can be defined using the Data Services Designer graphical user interface. Data Services automatically generates the appropriate interface calls to access the data in the source system. For most ERP applications, Data Services generates SQL optimized for the specific target database (Oracle, DB2, SQL Server, Informix, and so on). Automatically-generated, optimized code reduces the cost of maintaining data

2011

© 2011 SAP AG. All rights reserved.

3

Unit 1: Describing Data Services

BODS30

warehouses and enables you to build data solutions quickly, meeting user requirements faster than other methods (for example, custom-coding, direct-connect calls, or PL/SQL). Data Services can apply data changes in a variety of data formats, including any custom format using a Data Services adapter. Enterprise users can apply data changes against multiple back-office systems singularly or sequentially. By generating calls native to the system in question, Data Services makes it unnecessary to develop and maintain customized code to manage the process. You can also design access intelligence into each transaction by adding flow logic that checks values in a data warehouse or in the transaction itself before posting it to the target ERP system.

The Data Services Architecture Data Services relies on several unique components to accomplish the data integration and data quality activities required to manage your corporate data. Data Services includes the standard components: • • • • • • • • • •

Designer Repository Job Server Engines Access Server Adapters Real-time Services Address Server Cleansing Packages, Dictionaries, and Directories Management Console

This diagram illustrates the relationships between these components: The Data Services Designer Data Services Designer is a Windows client application used to create, test, and manually execute jobs that transform data and populate a data warehouse. Using Designer, you create data management applications that consist of data mappings, transformations, and control logic.

4

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Defining Data Services

Figure 2: Data Services Designer Interface

You can create objects that represent data sources, and then drag, drop, and configure them in flow diagrams. Designer allows you to manage metadata stored in a local repository. From the Designer, you can also trigger the Job Server to run your jobs for initial application testing. The Data Services Repository The Data Services repository is a set of tables that holds user-created and predefined system objects, source and target metadata, and transformation rules. It is set up on an open client/server platform to facilitate sharing metadata with other enterprise tools. Each repository is stored on an existing Relational Database Management System (RDBMS).

2011

© 2011 SAP AG. All rights reserved.

5

Unit 1: Describing Data Services

BODS30

Figure 3: Data Services Repository

There are three types of repositories: •

•

•

A local repository (known in Designer as the Local Object Library) is used by an application designer to store definitions of source and target metadata and Data Services objects. A central repository (known in Designer as the Central Object Library) is an optional component that can be used to support multiuser development. The Central Object Library provides a shared library that allows developers to check objects in and out for development. A profiler repository is used to store information that is used to determine the quality of data.

The Data Services Job Server Each repository is associated with at least one Data Services Job Server, which retrieves the job from its associated repository and starts the data movement engine. The data movement engine integrates data from multiple heterogeneous sources, performs complex data transformations, and manages extractions and transactions from ERP systems and other sources. The Job Server can move data in batch or real-time mode and uses distributed query optimization, multithreading, in-memory caching, in-memory data transformations, and parallel processing to deliver high data throughput and scalability.

6

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Defining Data Services

Figure 4: Data Services Architecture – Job Server

While designing a job, you can run it from the Designer. In your production environment, the Job Server runs jobs triggered by a scheduler or by a real-time service managed by the Data Services Access Server. In production environments, you can balance job loads by creating a Job Server Group (multiple Job Servers), which executes jobs according to overall system load. Data Services provides distributed processing capabilities through the Server Groups. A Server Group is a collection of Job Servers that each reside on different Data Services server computers. Each Data Services server can contribute one, and only one, Job Server to a specific Server Group. Each Job Server collects resource utilization information for its computer. This information is utilized by Data Services to determine where a job, data flow, or subdata flow (depending on the distribution level specified) should be executed. The Data Services Engines When Data Services jobs are executed, the Job Server starts Data Services engine processes to perform data extraction, transformation, and movement. Data Services engine processes use parallel processing and in-memory data transformations to deliver high data throughput and scalability. The Data Services Cleansing Packages

2011

© 2011 SAP AG. All rights reserved.

7

Unit 1: Describing Data Services

BODS30

Figure 5: Data Services Architecture–Cleansing Packages-

The Data Quality Cleansing Packages files provide referential data for the Data Cleanse and Address Cleanse transforms to use when parsing, standardizing, and cleansing name and address data. Cleansing Packages enhance the ability of Data Cleanse to accurately process various forms of global data by including language-specific reference data and parsing rules. Directories provide information on addresses from postal authorities; cleansing package files are used to identify, parse, and standardize data such as names, titles, and firm data. Dictionaries also contain acronym, match standard, gender, capitalization, and address information. The Data Services Management Console

8

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Defining Data Services

The Data Services Management Console provides access to these features: •

Administrator Administer Data Services resources, including:

•

– Scheduling, monitoring, and executing batch jobs – Configuring, starting, and stopping real-time services – Configuring Job Server, Access Server, and repository usage – Configuring and managing adapters – Managing users – Publishing batch jobs and real-time services via web services – Reporting on metadata Auto Documentation View, analyze, and print graphical representations of all objects as depicted in Data Services Designer, including their relationships, properties, and more.

•

Data Validation Evaluate the reliability of your target data based on the validation rules you create in your Data Services batch jobs to quickly review, assess, and identify potential inconsistencies or errors in source data.

•

Impact and Lineage Analysis Analyze end-to-end impact and lineage for Data Services tables and columns, and SAP BusinessObjects Business Intelligence platform objects such as universes, business views, and reports.

•

Operational Dashboard View dashboards of status and performance execution statistics of Data Services jobs for one or more repositories over a given time period.

•

Data Quality Reports Use data quality reports to view and export SAP Crystal Reports for batch and real-time jobs that include statistics-generating transforms. Report types include job summaries, transform-specific reports, and transform group reports. To generate reports for Match, US Regulatory Address Cleanse, and Global Address Cleanse transforms, you must enable the Generate report data option in the Transform Editor.

Other Data Services Tools There are also several tools to assist you in managing your Data Services installation.

2011

© 2011 SAP AG. All rights reserved.

9

Unit 1: Describing Data Services

BODS30

The Data Services Repository Manager allows you to create, upgrade, and check the versions of local, central, and profiler repositories. The Data Services Server Manager allows you to add, delete, or edit the properties of Job Servers. It is automatically installed on each computer on which you install a Job Server. Use the Server Manager to define links between Job Servers and repositories. You can link multiple Job Servers on different machines to a single repository (for load balancing) or each Job Server to multiple repositories (with one default) to support individual repositories (for example, separating test and production environments). The License Manager displays the Data Services components for which you currently have a license. The Metadata Integrator allows Data Services to seamlessly share metadata with SAP BusinessObjects Intelligence products. Run the Metadata Integrator to collect metadata into the Data Services repository for Business Views and Universes used by SAP Crystal Reports, Desktop Intelligence documents, and Web Intelligence documents. Note: These other tools are not discussed in this course.

Defining Data Services Objects Data Services provides you with a variety of objects to use when you are building your data integration and data quality applications.

Figure 6: Data Services Object Types

Data Services Objects

10

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Defining Data Services

In Data Services, all entities you add, define, modify, or work with are objects. Some of the most frequently used objects are: • • • • • •

Projects Jobs Work flows Data flows Transforms Scripts

This diagram shows some common objects.

Figure 7: Data Services Objects

All objects have options, properties, and classes. Each can be modified to change the behavior of the object. Options control the object. For example, to set up a connection to a database, the database name is an option for the connection. Properties describe the object. For example, the name and creation date describe what the object is used for and when it became active. Attributes are properties used to locate and organize objects. Classes define how an object can be used. Every object is either reusable or single-use. Single-use objects appear only as components of other objects. They operate only in the context in which they were created. You cannot copy single-use objects.

2011

© 2011 SAP AG. All rights reserved.

11

Unit 1: Describing Data Services

BODS30

A reusable object has a single definition and all calls to the object refer to that definition. If you change the definition of the object in one place, and then save the object, the change is reflected to all other calls to the object. Most objects created in Data Services are available for reuse. After you define and save a reusable object, Data Services stores the definition in the repository. You can then reuse the definition as necessary by creating calls to it. For example, a data flow within a project is a reusable object. Multiple jobs, such as a weekly load job and a daily load job, can call the same data flow. If this data flow is changed, both jobs call the new version of the data flow. You can edit reusable objects at any time independent of the current open project. For example, if you open a new project, you can open a data flow and edit it. However, the changes you make to the data flow are not stored until you save them. Defining Relationship between Objects Jobs are composed of work flows and/or data flows: • •

A work flow is the incorporation of several data flows into a sequence. A data flow process transforms source data into target data.

Figure 8: Data Services Object Relationships

12

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Defining Data Services

A work flow orders data flows and the operations that support them. It also defines the interdependencies between data flows. For example, if one target table depends on values from other tables, you can use the work flow to specify the order in which you want Data Services to populate the tables. You can also use work flows to define strategies for handling errors that occur during project execution, or to define conditions for running sections of a project. A data flow defines the basic task that Data Services accomplishes, which involves moving data from one or more sources to one or more target tables or files. You define data flows by identifying the sources from which to extract data, the transformations the data should undergo, and targets. Defining projects and jobs A project is the highest-level object in Designer. Projects provide a way to organize the other objects you create in Designer. A job is the smallest unit of work that you can schedule independently for execution. A project is a single-use object that allows you to group jobs. For example, you can use a project to group jobs that have schedules that depend on one another or that you want to monitor together. Projects have these characteristics: • • •

Projects are listed in the Local Object Library. Only one project can be open at a time. Projects cannot be shared among multiple users.

The objects in a project appear hierarchically in the project area. If a plus sign (+) appears next to an object, you can expand it to view the lower-level objects contained in the object. Data Services displays the contents as both names and icons in the project area hierarchy and in the workspace. Jobs must be associated with a project before they can be executed in the project area of Designer. Using Work Flows Jobs with data flows can be developed without using work flows. However, one should consider nesting data flows inside of work flows by default. This practice can provide various benefits. Always using work flows makes jobs more adaptable to additional development and/or specification changes. For instance, if a job initially consists of four data flows that are to run sequentially, they could be set up without work flows. But what if specification changes require that they be merged into another job instead? The developer would have to replicate their sequence correctly in the other job. If these had been initially added to a work flow, the developer could then have simply

2011

© 2011 SAP AG. All rights reserved.

13

Unit 1: Describing Data Services

BODS30

copied that work flow into the correct position within the new job. There would be no need to learn, copy, and verify the previous sequence. The change can be made more quickly with greater accuracy. Even if there is one data flow per work flow, there are benefits to adaptability. Initially, it may have been decided that recovery units are not important; the expectation being that if the job fails, the whole process could simply be rerun. However, as data volumes tend to increase, it may be determined that a full reprocessing is too time-consuming. The job may then be changed to incorporate work flows to benefit from recovery units to bypass reprocessing of successful steps. However, these changes can be complex and can consume more time than allotted for in a project plan. It also opens up the possibility that units of recovery are not properly defined. Setting these up during initial development when the full analysis of the processing nature is preferred. Note: This course focuses on creating batch jobs using database Datastores and file formats.

Using the Data Services Designer The Data Services Designer interface allows you to plan and organize your data integration and data quality jobs in a visual way. Most of the components of Data Services can be programmed with this interface. Describing the Designer Window The Data Services Designer interface consists of a single application window and several embedded supporting windows. The application window contains the menu bar, toolbar, Local Object Library, project area, tool palette, and workspace. Using the Local Object Library The Local Object Library gives you access to the object types listed in the table below. The table shows the tab on which the object type appears in the Local Object Library and describes the Data Services context in which you can use each type of object. You can import objects to and export objects from your Local Object Library as a file. Importing objects from a file overwrites existing objects with the same names in the destination Local Object Library. Whole repositories can be exported in either .atl or .xml format. Using the .xml file format can make repository content easier for you to read. It also allows you to export Data Services to other products. Using the Tool Palette

14

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Defining Data Services

The tool palette is a separate window that appears by default on the right edge of the Designer workspace. You can move the tool palette anywhere on your screen or dock it on any edge of the Designer window. The icons in the tool palette allow you to create new objects in the workspace. Disabled icons occur when there are invalid entries to the diagram open in the workspace. To show the name of each icon, hold the cursor over the icon until the tool tip for the icon appears. When you create an object from the tool palette, you are creating a new definition of an object. If a new object is reusable, it is automatically available in the Local Object Library after you create it. If you select the data flow icon from the tool palette and define a new data flow called DF1, you can later drag that existing data flow from the Local Object Library and add it to another data flow called DF2. Using the Workspace When you open a job or any object within a job hierarchy, the workspace becomes active with your selection. The workspace provides a place to manipulate objects and graphically assemble data movement processes. These processes are represented by icons that you drag and drop into a workspace to create a diagram. This diagram is a visual representation of an entire data movement application or some part of a data movement application. You specify the flow of data by connecting objects in the workspace from left to right in the order you want the data to be moved.

2011

© 2011 SAP AG. All rights reserved.

15

Unit 1: Describing Data Services

BODS30

Lesson Summary You should now be able to: • Define Data Services objects • Use the Data Services Designer interface

16

© 2011 SAP AG. All rights reserved.

2011

BODS30

Unit Summary

Unit Summary You should now be able to: • Define Data Services objects • Use the Data Services Designer interface

2011

© 2011 SAP AG. All rights reserved.

17

Unit Summary

18

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit Summary

19

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit Summary

20

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit 2 Using Data Quality Transforms Unit Overview Data Quality transforms are used to parse, standardize, cleanse, enhance, match, and consolidate your records. After completing this lesson, you will be able to: •

Describe Data Quality transforms

Unit Objectives After completing this unit, you will be able to: • • •

Define Data Quality processes Use Data Quality transforms Configure Data Quality transforms

Unit Contents Lesson: Data Quality Transforms.................................................. 22 Exercise 1: Create Source and Target Datastores.......................... 41

2011

© 2011 SAP AG. All rights reserved.

21

Unit 2: Using Data Quality Transforms

BODS30

Lesson: Data Quality Transforms Lesson Overview SAP BusinessObjects Data Services offers several Data Quality and Address transforms, which allows you to save time and resources in the production of mailings.

Lesson Objectives After completing this lesson, you will be able to: • • •

Define Data Quality processes Use Data Quality transforms Configure Data Quality transforms

Business Example Mail that is undeliverable generate a considerable cost to the mailer: wasted effort, wasted materials, and wasted time. Two of the major reasons for undeliverable mail are incorrect addresses and mail recipients moving to a new address. Incorrect addresses are the result of missing apartment or suite numbers, incorrect or missing street numbers, missing box numbers, incorrect street names. and incorrect recipients. On an average, in one year in the United States, there are 46 million people move and over one million new addresses are created. According to the United States Postal Service (USPS) National Postal Forum dated, May 17 – 20, 2009, the reason “Customer has Moved” is responsible for over 75% of undeliverable mail. Consider one mailing of 100,000 addresses sending at full rate. The postage for a 1 ounce piece of mail is 44¢. The average automation discount is 5¢. A customer would save $5,000 in postal service discounts with address cleansing. If a customer also deduplicates the list and the list has 2% duplicates, they could save an additional $780 on postage. If the piece costs 50¢ to create, the customer could save an additional $1000 in creation/printing costs. Imagine if this were a monthly or weekly billing!

Describing the data quality framework The term “Data Quality” is used often, but what does it mean?

Figure 9: What Is Data Quality?

22

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

The most common name and address data quality errors are seen below.

Figure 10: What Are Common Data Quality Errors?

Before you can implement an effective data quality project, you must first understand the data quality framework, which is a continuous cycle of activities that starts with measuring: 1. Measuring

2011

© 2011 SAP AG. All rights reserved.

23

Unit 2: Using Data Quality Transforms

BODS30

Figure 11: Data Quality – Assessment

Start the process by using a data profiling application to quantify the number and types of defects in your data. 2. Analyzing

Figure 12: Data Quality – Measure and Analyze

Once the data has been measured, assess the nature and cause of data defects. This information enables you to inform stakeholders of your findings and develop a data quality strategy. 3. Parsing

24

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

Figure 13: Data Quality – Cleansing

After the data has been assessed, identify and isolate the individual elements that make up the data. 4. Standardizing

2011

© 2011 SAP AG. All rights reserved.

25

Unit 2: Using Data Quality Transforms

BODS30

Figure 14: Data Cleansing – Parse, Standardize, and Correct

Next, normalize and format the parsed data according to business rules and third-party referential data. 5. Cleansing After standardization, verify and cleanse the data according to business rules and third-party referential data. 6. Enhancing

26

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

Figure 15: Data Quality – Enhancement

Append additional information that may increase the value of the data based on referential data.

Figure 16: Data Enhancement 1

An example of data enhancement can be seen in the below graphic.

2011

© 2011 SAP AG. All rights reserved.

27

Unit 2: Using Data Quality Transforms

BODS30

Figure 17: Data Enhancement 2

7. Matching Once the data has been fully cleansed, perform the matching process to identify duplicate records within multiple tables or databases. Matching after cleansing increases the likelihood of discovering overlaps. 8. Consolidating Once matches have been identified, consolidate the data to populate a best record in a match set or update all records in a match set. 9. Continuous Monitoring Use data profiling software to measure ongoing data quality scores and provide alerts when data scores fall outside of defined thresholds. Where Is Data Quality Used? Data quality is needed throughout an organization and should be used everywhere that data is managed. This includes not only batch processing data extracted from a current system, but also transactional processing on point of entry to the system.

28

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

Figure 18: Where Does Data Quality Fit?

The following application areas can benefit from the implementation of data quality: Application Area

Use Examples

Data Warehousing/Business Intelligence

Enterprise Resource Planning/Supply Chain Customer Relationship Management (Call Centers, SFA)

• • • •

Supplier management

• • • • • •

2011

Reporting Analysis Metrics Business intelligence

Direct mail/customer communication Campaign management Loyalty program management Cross and up sales Lifetime value 360 degree view

© 2011 SAP AG. All rights reserved.

29

Unit 2: Using Data Quality Transforms

Consumer and Corporate Householding

BODS30

• • •

Fraud Detection/Compliance

Distribution Projects

Business Process Automation

Mergers/Questions/Divestitures

Multi-level customer, supplier, and patient databases Cross selling for expanded offerings Identification of multiple organizations as a single company

• •

BASEL II Sarbanes-Oxley Act

• • • •

Distribution of goods Publishers Printers Service bureaus

• • • •

Real-time decisions Order entry Web forms Self service

Market bounce/EU

Defining Data Quality Processes You can use Data Services to improve the quality of your data by using Data Quality transforms to parse, standardize, cleanse, enhance, and match records. Parsing Data Parsing identifies data and breaks it down into its component parts. It rearranges data elements in a single field or moves multiple data elements from a single data field to multiple discrete fields. For example, a name can be parsed into new fields such as prename, first name, last name, and postname.

30

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

Input Record

Output fields Prename:

Mr.

First Name:

Dan

Middle Name:

R.

Mr. Dan R. Smith, Jr., CPA Account Mgr.

Last Name:

Smith

Maturity Postname:

Jr.

Jones Inc.

Other Postname:

CPA

Dept. of Accounting

Title:

Account Mgr.

PO Box 567

Firm:

Jones Inc.

Biron, WI

Department:

Dept. of Accounting

54494

Locality:

Biron

Region:

WI

Primary Number:

567

Postal Code:

54494

Standardizing Data Even the most stringent data entry controls can permit data with an unacceptable level of variability to enter your data store. To make your records more consistent, you can use data quality to standardize customer and operational data. Formats, abbreviations, acronyms, punctuation, greetings, casing, order, and pattern matching are examples of elements you can control. Input Record

Output Fields Purchase order:

34-5654

Purchase date:

03-01-2002

Purchase date: 3–01–02

Part number:

R2-0113-245500

Part number: R20113245500

Diameter:

3"

Length:

52'

Purchase order: PO345654

Part desc: 3” diameter, 1 @ 52 ft Cleansing Data

2011

© 2011 SAP AG. All rights reserved.

31

Unit 2: Using Data Quality Transforms

BODS30

Data with incorrect elements is known as “dirty data”. Cleansing dirty data involves correcting it and adding missing elements like adding directions and primary types to addresses. Cleansing can occur on a wide variety of data. Depending on the type of data, you can remove or correct incorrect data using sophisticated algorithms and rules in conjunction with referential data. For example, you can use address information obtained from a national postal service to correct address data. Input Record

Mark Kessler 117–138 St W Harlem NY 10030

Output Records Name:

Mark Kessler

Address:

117 Odell Clark Pl

Locality:

New York

Region:

NY

Postal Code:

10030-2342

Enhancing Data Data can increase in value when additional related data is attached to it. Data Quality can: •

•

•

•

32

Provide full international postal codes. For example, ZIP+4 can be appended to the primary five-digit postal code for US addresses. When these additional values are used on mailing labels, they assist postal services to route the mail more quickly and efficiently. Append other valuable information obtained from the directory services. For example, you can attach assignment, delivery, quality, geocensus, and geospatial codes and information. In addition, you can append match name standards, gender codes, and customized greetings to also increase the value of that record. Assign longitude and latitude information to records. By assigning latitude and longitude information to records with a physical mailing address, your data immediately becomes more valuable. Address interpolation allows you to look at trends in your data based on these geographic demographics. Marketing initiatives by geography (for example, targeted marketing, predictive modeling) become immediately more feasible. Append other geospatial information used for census, marketing, and insurance. Information codes can be appended for the following: US Census Bureau Minor Civil Division (MCD) or Census County Division (CCD) codes; Federal Information Processing Standard (FIPS) place, state, and country codes; US Census Tract and Block Numbering Area (BNA) codes; Core–Based Statistical

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

Area (CBSA) codes; Metropolitan Statistics Area codes (MSA); and Section codes for US Census information. Some of these codes can be used to match to demographic coding databases for census purposes, insurance rating territories, and so on for US address data. The record below is an example of geocoding information that has been appended to a record containing an address: Input Record

Appended Information Address Latitude:

40.723175

Address Longitude:

-74.004970

Centroid Latitude:

40.723195

Centroid Longitude:

-74.004977

Match quality:

Highest quality address

101 Avenue of the Americas

FIPS State Code:

36 New York

FIPS County Code:

061 New York

New York, NY 10013–1933 (222) 922–9922

FIPS Place Code:

51000 New York

MCD Code:

44919

BSA:

35620

Metro code:

5600

Section Code:

0051001012

Margaret Smith–Kline, Ph.D. Future Electronics

A number of enhancements are available for your data: Status and Error Coding Status and error codes indicate what has changed or is incorrect with an address. They enable you to identify issues with the data and determine the best way to resolve them. Enhanced Line of Travel (eLOT) Enhanced Line of Travel (eLOT) is available for the USA Regulatory Address Cleanse transform only. It is used for sorting to achieve mailing discounts. The original line of travel (LOT) narrowed the mail carrier's delivery route to the block face level (ZIP+4 level) by discerning whether an address resided on the odd or even side of a street or thoroughfare. eLOT narrows the mail carrier's delivery route walk sequence to the house (delivery point) level. This allows mailings to be sorted on a more precise level. Geocoding

2011

© 2011 SAP AG. All rights reserved.

33

Unit 2: Using Data Quality Transforms

BODS30

Geocodes provide latitude and longitude processing to a roof or block level. Geocoding assigns this information to records with a physical mailing address. It enables you to look at trends in your data based on these geographic demographics. Delivery Point Validation (DPV) Delivery Point Validation (DPV) determines if the address is deliverable. With DPV, you can identify addresses that are undeliverable as addressed and whether an address is a Commercial Mail Receiving Agency (CMRA). DPV can be useful in the following areas: • • • •

Mailing: Screen out undeliverable-as-addressed (UAA) mail and reduce mailing costs. Information quality: Verify an address down to the individual house, suite, or apartment rather than block face to increase the data's level of accuracy. Increased assignment rate: Use DPV tiebreak mode to resolve a tie when other tie-breaking methods are not conclusive may increase assignment rates. Preventing mail-order-fraud: Verify valid delivery addresses and Commercial Mail Receiving Agencies (CMRA). Eliminate shipping of merchandise to individuals who place fraudulent orders.

LACS Link (Rural Route Conversion) LACS Link is available for USA records with the USA Regulatory Address Cleanse transform only and is required for CASS certification. It converts emergency addresses. LACS Link updates addresses when the physical address does not move but the address has changed, such as when the municipality changes rural route addresses to street–name addresses. Rural route conversions make it easier for police, fire, ambulance, and postal personnel to locate a rural address. LACS Link also converts addresses when streets are renamed or post office boxes renumbered. Residential Delivery Indicator (RDI) The Residential Delivery Indicator (RDI) feature is available in the USA Regulatory Address Cleanse transform. RDI determines whether a given address is for a residence. This is because, unlike the United States Postal Service (USPS), some delivery services charge more for residential deliveries. You can use RDI if you are processing your data for CASS certification or if you are processing in a noncertified mode. In addition, RDI does not require that you use DPV processing. Matching and Consolidating Data

34

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

Matching is used once the data has been parsed, standardized, cleansed, and enhanced to determine whether records refer to the same entity, such as a customer. Once matches are identified, consolidation salvages data from records in a match set to create a “best” record or post data to all the records in a match set based on the priority of the records. Input Record Ms Margaret Smith–Kline Ph.D. Future Electronics

Consolidated Record Name:

Ms. Margaret Smith-Kline Ph.D.

Company name:

Future Electronics Co. LLC

SSN:

001-12-4367

Hire date:

5/23/2003

Address:

101 Avenue of the Americas, New York NY 10013-1933

Phone:

(222) 922-9922

Email:

maggie.kline@future_electronics.com

101 Avenue of the Americas New York NY 10013–1933 maggie.kline@future_electronics.com May 23, 2003 Maggie Smith Future Electronics Co. LLC 101 6th Ave. Manhattan, NY 10012 maggie.kline@future_electronics.com 001–12–4367 Ms. Peg Kline Future Elect. Co. 101 6th Ave.

2011

© 2011 SAP AG. All rights reserved.

35

Unit 2: Using Data Quality Transforms

BODS30

New York NY 10013 01–12–4367 (222) 922–9922 5/23/03

Describing the Components that Support Data Quality Data Quality transforms are supported by several unique components in the architecture of Data Services. Country–specific engines process Australian, Canadian, Japanese, and USA address data. An EMEA engine processes European data, and the Global Address engine processes the additional ISO countries. The Address Server is used to process European addresses using the Global Address Cleanse transform. It provides access to detailed address line information for most European countries. Note that the Address Server is resource-intensive and should be disabled until required. It takes time to initialize, which may impact performance when it is first started. Cleansing Packages are one of the four types of repositories. They enhance the ability of Data Services to accurately process various forms of global data by including language-specific reference data and parsing rules. Cleansing packages are used in conjunction with rule files to identify and parse name, title, and firm data. They contain entries for words and phrases. Each entry specifies how the word or phrase might be used. You can tailor a cleansing package to better suit your needs. Directories are used to define how to cleanse address data. Optional USA directories are available for specialized address data. The US Postal Service requires the software disable itself when a user attempts to use expired US directories.

Describing Data Quality Transforms Data Quality transforms are used to improve the quality of your data. They are used to parse, standardize, cleanse, enhance, match, and consolidate data. The following Data Quality transforms are available on the Transforms tab of the Local Object Library:

36

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

Transform

Description

Associate

Works downstream from Match transforms to provide a way to combine, or associate, the match results by using the Match transform-generated Group Number fields.

Country ID

Parses input data and then identifies the country of destination for each record.

Data Cleanse

Identifies, parses, standardizes, and enhances name, title, and firm data, phone numbers, Social Security numbers, dates, and e-mail addresses. Assigns gender, adds prenames, generates Match standards, and creates personlized greetings. Can also be used for international, operational, and product data.

Global Address Cleanse

Identifies, parses, validates, and corrects global address data, such as primary number, primary name, primary type, directional, secondary identifier, and secondary number.

Global Suggestion List

Offers a way to complete and populate addresses with minimal data, and can make suggestions for possible matches. Is ideal for use in call center and other transactional environments to cleanse and standardize data at the point of entry.

Match

Matches data based on user-defined business rules.

USA Regulatory Address Cleanse

Identifies, parses, validates, and corrects USA address data according to the USA Coding Accuracy Support System (CASS). Outputs many useful codes and creates USPS Form 3553. Can also run in non-certified mode.

User Defined

2011

Provides custom processing in a data flow using full Python scripting

© 2011 SAP AG. All rights reserved.

37

Unit 2: Using Data Quality Transforms

BODS30

language.User–Defined transforms should only be used for those tasks that cannot be accomplished with other transforms or with the Data Services scripting language in the Query transform. Note: The User–Defined transform lacks scalability and performance compared to other transforms and Data Services scripting language in the Query transform. Use the User–Defined transform for functions that extend beyond these tasks.

Using Transform Configurations Transform configurations are customized versions of base Data Quality transforms. They use preconfigured input fields, output fields, and options that include best practices and can be used in multiple data flows. Once you place an instance of the transform configuration in a data flow, you can override the preset defaults. You can also create your own transform configurations, either by replicating an existing transform configuration or creating a new one from scratch. Note: Transform configurations are only available for Data Quality transforms as part of the Data Services platform. This feature does not apply to Data Integrator or Platform transforms.

Using the Transform Editor When you are working with Data Quality transforms, the parameters area of the Transform Editor provides access to some unique functionality. For each tab, you can choose to filter to show all available options, only those that are currently in use, or those that are designated as "best practice" for that particular transform configuration. Note: When you create your own transform configurations, you can specify which fields and options are best practice to take advantage of these filters.

38

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

Tab

Description

Input tab

The Input tab displays the available field names that are recognized by the transform. Mapping input fields to field names that the transform recognizes tells the transform how to process that field.

Options tab

The Options tab contains business rules that determine how the transform processes your data. Each transform has a different set of available options. If you change an option value from its default value, a green triangle appears next to the option name to indicate that you made an override.

Output tab

The Output tab displays the field names that can be output by the transform. Data cleansing transforms can generate fields in addition to the input fields that transform processes, so that you can output many fields.

Configuring Data Quality Transforms The primary difference configuring Data Quality transforms versus configuring platform transforms is that Data Quality transforms have passthrough columns versus columns that are used by the transform as is the case with Platform transforms. You can map passthrough, input, and output columns in the transform editor by dragging the columns from one pane to another. Passthrough columns are not affected by the transform, but are available to other transforms downstream. To map a passthrough column, click and drag it from the input schema area to the output schema area. Input columns are used by the transform to generate parsed, standardized, and cleansed output. To map an input column, click and drag it from the input schema area to the appropriate field in the Input tab. Columns for which a mapping type have been identified are mapped automatically. Output columns contain parsed, standardized, and cleansed data from the transform. To map a column to the output scheme, select the check box on the Output tab. To replace an existing field with the new data, drag the column from the Output tab to the column in the output schema area and select Remap Column from the popup menu.

2011

© 2011 SAP AG. All rights reserved.

39

Unit 2: Using Data Quality Transforms

40

© 2011 SAP AG. All rights reserved.

BODS30

2011

BODS30

Lesson: Data Quality Transforms

Exercise 1: Create Source and Target Datastores Exercise Objectives After completing this exercise, you will be able to: • Create Datastores and import metadata for the Alpha Acquisitions, Beta Businesses, Delta, HR Data Mart and Omega databases

Business Example You have been hired as a Data Services designer for Alpha Acquisitions. Alpha has recently acquired Beta Businesses, an organization that develops and sells software products and related services. In an effort to consolidate and organize the data, and simplify the reporting process for the growing company, the Omega data warehouse is being constructed to merge the data for both organizations, and a separate data mart is being developed for reporting on Human Resources data. You also have access to a database for staging purposes called Delta. To start the development process, you must create datastores and import the metadata for all of these data sources.

Task: Create Datastores and import metadata for the Beta Businesses and Delta databases.

2011

1.

Log in to the Data Services Designer.

2.

In your Local Object Library, create a new source Datastore for the BETA Businesses database.

3.

In your Local Object Library, create a new Datastore for the DQSTANDALONE database.

4.

In your Local Object Library, create a new Datastore for the DELTA staging database.

© 2011 SAP AG. All rights reserved.

41

Unit 2: Using Data Quality Transforms

BODS30

Solution 1: Create Source and Target Datastores Task: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

2.

Log in to the Data Services Designer. a)

From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.

b)

In the dialog box, enter your assigned User ID.

c)

Enter your password, which is the same as your User ID.

d)

Choose the Log on button.

e)

In the list of repositories, select your repository DSREPO##, where ## is the number portion of your User ID.

f)

Choose OK.

In your Local Object Library, create a new source Datastore for the BETA Businesses database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

BETA

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

BETA

User name

sourceuser (lower case)

Password

sourcepass (lower case)

Continued on next page

42

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

Choose OK to finish. c)

In the Local Object Library, right-click the BETA datastore to select the Open option from the context menu. The tables from the BETA database are displayed on the Metadata window on the right.

d)

Import the metadata for the BETA database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • • • • • •

source.addrcodes source.categories source.country source.customer source.employees source.orderdetails source.orders source.products source.region source.suppliers source.usa_customers Note: This database is a read-only database.

e)

3.

View the data for the USA_Customers table and confirm that Jane Hartley from Planview Inc. is the first customer record by right-clicking on the table in the Local Object Library and choose the option View data.

In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DQSTANDALONE

Datastore type

Database

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

43

Unit 2: Using Data Quality Transforms

BODS30

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DQSTANDALONE

User name

sourceuser (lower case)

Password

sourcepass (lower case)

Choose OK to finish. c)

In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.

d)

Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •

AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database.

4.

In your Local Object Library, create a new Datastore for the DELTA staging database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DELTA

Continued on next page

44

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Data Quality Transforms

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DELTA## (where ## is the group number assigned by your instructor)

User name

student## or consult your instructor

Password

student## or consult your instructor

Choose OK to finish. c)

You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.

2011

© 2011 SAP AG. All rights reserved.

45

Unit 2: Using Data Quality Transforms

BODS30

Lesson Summary You should now be able to: • Define Data Quality processes • Use Data Quality transforms • Configure Data Quality transforms

46

© 2011 SAP AG. All rights reserved.

2011

BODS30

Unit Summary

Unit Summary You should now be able to: • Define Data Quality processes • Use Data Quality transforms • Configure Data Quality transforms

2011

© 2011 SAP AG. All rights reserved.

47

Unit Summary

48

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit Summary

49

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit Summary

50

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit 3 Using Address Cleanse Transforms Unit Overview This lesson describes the address cleansing capabilities of Data Services. After completing this lesson, you will be able to: •

Describe the Address Cleanse transforms

Unit Objectives After completing this unit, you will be able to: • • • •

Prepare your input data for Address Cleanse transforms Use the Address Cleanse transforms Work with global address data Work with transactional address data

Unit Contents Lesson: Using the Address Cleanse Transforms ............................... 52 Exercise 2: Use the Address Cleanse Transform ........................... 73

2011

© 2011 SAP AG. All rights reserved.

51

Unit 3: Using Address Cleanse Transforms

BODS30

Lesson: Using the Address Cleanse Transforms Lesson Overview The Address Cleanse transforms parse, correct, complete, and standardize address components. Address Cleanse contains engines to process Australian, Canadian, Japanese, and USA address data. An EMEA engine processes European data, and the Global Address engine processes the additional International Organization for Standardization (ISO) countries. For example, postal codes can be corrected and added for some countries with the Global Address Cleanse and USA Regulatory Cleanse transforms.

Lesson Objectives After completing this lesson, you will be able to: • • • •

Prepare your input data for Address Cleanse transforms Use the Address Cleanse transforms Work with global address data Work with transactional address data

Business Example Mail, which is undeliverable generates a considerable cost to the mailer: wasted effort, wasted materials, and wasted time. Two of the major reasons for undeliverable mail are incorrect addresses and mail recipients moving to a new address. Incorrect addresses are the result of missing apartment or suite numbers, incorrect or missing street numbers, missing box numbers, incorrect street names and incorrect recipients. On an average, in one year in the United States, there are 46 million people move and over 1,000,000 new addresses are created. According to the United States Postal Service (USPS) National Postal Forum dated, May 17 – 20, 2009, the reason “Customer has Moved” is responsible for over 75% of undeliverable mail. Consider one mailing of 100,000 addresses sending at full rate. The postage for a 1 ounce piece of mail is 44¢. The average automation discount is 5¢. A customer would save $5,000 in postal service discounts with address cleansing. If a customer also deduplicates the list and the list has 2% duplicates, they could save an additional $780 on postage. If the piece costs 50¢ to create, the customer could save an additional $1000 in creation/printing costs. Imagine if this were a monthly or weekly billing!

52

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

Understanding the Business Need for Address Cleanse Transforms

Figure 19: Why Cleanse Addresses?

There are two primary reasons for cleansing your address data: 1.

2.

You can reduce costs from misdirected mail. Compliance with a defined set of standards can enable your organization to attain certification that allows you to take advantage of discounts offered by many national postal organizations, such as the United States Postal Service. You can properly prepare your data for matching. When comparing two addresses to see if they match, the smallest data entry error could keep you from finding matches. You must have standardized data before you look for matches. Address cleansing can parse an address down into its smallest component parts (for example, separating the house number from the street name) so that your matching process can compare individual items and improve match results.

Figure 20: Standardization: USPS Address Preferences

The various address components need to be parsed before address cleansing can take place. For example:

2011

© 2011 SAP AG. All rights reserved.

53

Unit 3: Using Address Cleanse Transforms

BODS30

Figure 21: Parsing Address Components

Address cleansing tasks include: •

•

•

•

•

Parsing address components (such as locality, region, and postal code) and generating discrete fields for improved match results. The level of parsing and correction varies by country, and the correction depends on the amount of input address information available in the input record. Standardizing address fields based on business rule options set in the transform, correcting the format of address components, and changing the way the address line looks. For example, Data Services can add or strip punctuation and abbreviate or spell out address components. Verifying that locality, region, and postal codes agree with one another. If a record contains only a locality and region, Data Services can add the postal code and vice versa, depending on the country. Appending other information, such as address quality measurement in regard to whether region and postal code agree with the corrected address (for USA, Canada, and Australia) and demographic information (if the GeoCensus Option is purchased). Identifying any undeliverable addresses, such as vacant lots and condemned buildings (USA records only).

• •

54

Assigning diagnostic codes to help determine why addresses are not assigned or how they are corrected. Generating address-related reports.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

Preparing your Input Data for Address Cleanse Transforms

Figure 22: Preparing Input Data

You can cleanse address data that is stored in discrete, multiline, and multiline hybrid address line formats. Before setting up the cleansing process, you must identify the format of your data. The Address Cleanse transform accepts discrete, multiline, and multiline hybrid address line formats. The discrete format is a database record format in which pieces of data appear in exactly the same field arrangement in every record. For this reason, this option is the best performing format.

Figure 23: Discrete Line Format

Note: The discrete line formats are not recommended for processing multi–country databases. They are more appropriate for use with USA, Canadian, and Australian addresses.

2011

© 2011 SAP AG. All rights reserved.

55

Unit 3: Using Address Cleanse Transforms

BODS30

Fields

Address1

Address2

Address3

Firm

Micro Elektronic Ges.m.b.H

Address1

Baumschulengasse 250

Nybrovej 120

225 rue de Bragance

Locality1

Wein

Lyngby

Luxembourg

Postcode

1010

2800

1449

Country

Germany

Denmark

Luxembourg

Dupong S.A.

Region1

The multiline format is a database record format in which address data is not consistently located in the same arrangement in all records. That is, data items float among fields. For example, field one can contain the name for one record, the company for another, and the address for another.

Figure 24: Multiline Format

56

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

Fields

Address1

Multiline1

Lawrence Robinson

Multiline2

Information Services Plc

Gran Via de Carles S.A

Multiline3

Cornwall House

Ibarra Campillo

Multiline4

56 Henley Road

Multiline5

Address2

Peter Grant

United Car Rental, Ltd 48010 Bilbao

Multiline6

London

Multiline7

NW2Y 5DM

3150 S. Larkin Rd., Suite 10

Multiline8 Country

Address3

Etobicoke ON M9C4T9 United Kingdom

Spain

Canada

The multiline hybrid format is a record format that contains multiline and discrete fields.

Figure 25: Multiline Hybrid Format

2011

© 2011 SAP AG. All rights reserved.

57

Unit 3: Using Address Cleanse Transforms

Fields

Address1

Multiline1

Jonas Lancaster

Multiline2

Woodworks Ltd

Multiline3

Wetherby House

Multiline4

20 Oakley Road

Multiline5

BODS30

Address2

Address3 Akzo Chemicals

Gran Via de Carles S.A. Moreelsepk 24 Ibarra Cambillo 16-3-A

Multiline6 Multiline7 Locality1

Bilbao

Locality2

Harlow

Region1

Essex

Postcode

CM19 5AE

48010

Lastline Country

3511 EP Utrecht United Kingdom

Spain

Netherlands

Defining the Address Cleanse Transforms The Address Cleanse transforms identify, parse, validate, standardize, and correct address data. An extensive number of new fields are generated by the Address Cleanse transform. Address type, primary type, primary number, unit number, locality, region, and postal code are just a few of the output fields that are commonly used. If you perform both address cleansing and data cleansing, the Address Cleanse transform should be placed before the Data Cleanse transform in the data flow. The are four available Address Cleanse transforms:

58

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

Transform

Description

Global Address Cleanse

Cleanses your address data from any of the supported countries (not for US certification). You must use one or more of the Global Address Cleanse transform configurations Australia (AMAS), Canada (SERP), Japan, multicountry, or USA. With this transform, you can create a Statement of Address Accuracy Report (SERP) as well as an Australian Address Matching Processing report (AMAS) report.

USA Regulatory Address Cleanse

Cleanses your US address data for certification. You can also use any of the add-on options: Delivery Point Validation (DPV), Enhanced Line of Travel (eLOT), Early Warning System (EWS), Locatable Address Conversion system (LACSLink), Residential Delivery Indicator (RDI), suggestion lists (not for certification), or Z4Change. With this transform, you can create a USPS Form 3553.

Global Suggestion List

Offers suggestions for possible address matches for your global address data. This transform is used for transactional processing and does not standardize addresses. In a project, use a Country ID transform before this transform.

Country ID

Identifies the country of destination for the record and outputs an ISO code. Use this transform before the Global Suggestion Lists transform (it is not necessary before the Global Address Cleanse transform or the USA Regulatory Address Cleanse transform).

Note: This course focuses only on the Global Address Cleanse and USA Regulatory Address Cleanse transforms.

2011

© 2011 SAP AG. All rights reserved.

59

Unit 3: Using Address Cleanse Transforms

BODS30

Describing the Global Address Transform The Global Address Cleanse transform cleanses your address data from any supported countries. You must use one or more of the Global Address Cleanse transform configurations to accomplish certification for Australia (AMAS), Canada (SERP), Japan, or multi-country records. The Global Address Cleanse transform does not support CASS certification or produce a USPS Form 3553. If you want to certify your USA address data, you must use the USA Regulatory Address Cleanse transform, which supports CASS. Input fields There are recognized Data Services input fields that you can use in the Global Address Cleanse transform. Input fields may vary based on the engine(s) that you enable: • • • • • •

Australia (A) Canada ( C) Japan (J) EMEAU (E) Global Address (G) USA (U)

See the fields listed on the Input tab in the transform editor to view each field's properties. Input fields are used by the engine to understand the data elements anticipated. For example, locality1 informs the engine what column contains the city or locality information. Options The Global Address Cleanse transform offers several option groups. Report and Analysis options The Generate Report Data option specifies whether to generate report data for this transform: • •

Yes: Generates report data for this transform. No: Turns off report data generation.

If you do not need to generate reports (during testing, for example), you should set this option to No to improve performance. Reference files Reference files are directories required by the Global Address Cleanse transform to process your data. Set the location for the individual countries based on whether the engine is activated in your job.

60

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

Use this substitution variable if you have installed the directories in the default reference data directory of the install. The substitution variable for reference files is $$RefFilesAddressCleanse. Use this substitution variable if you have accepted the default installation directory when you installed Data Services. If you installed to a location that was not the default location, you can change the substitution variable dynamically. Country ID Options Specifies whether or not to use Country ID processing. This option group is required. Option

Description Specifies whether to always use the specified Country Name or to run Country ID processing. The options are available: •

• Country ID Mode

Constant: Assumes all of your input data is for the specified Country Name and does not run Country ID processing. Choose this option only if all of your data is from one country. This option may save processing time Assigned: Runs Country ID processing. Choose this option if the input data is from more than one country. For example, choose Country ID Mode if one of these scenarios applies: –

–

2011

You use two or more of the engines and your input addresses contain country data. You use only one engine and your input source contains many addresses that cannot be processed by that engine. Any addresses that cannot be processed are not sent to

© 2011 SAP AG. All rights reserved.

61

Unit 3: Using Address Cleanse Transforms

BODS30

the engine and the transform use the Country_Name option value as the default country. Specifies the country of destination. These options are available: •

•

None: Select when the Country ID Mode is set to Assigned and you do not want a default country to be set when the country cannot be identified. Special considerations: –

Country Name

If the Country ID Mode is set to Constant, choose the country of destination from the Country Name list. The transform assumes that all of your data is for this country. Note: You cannot choose None if the Country ID Mode is set to Constant.

Script Code

–

If the Country ID Mode is set to Assigned, choose a country name to be used when the Country ID cannot identify a country.

–

If Country Name is set to None, the address is sent to the Default engine.

Specifies the ISO four-character script code for your data. Unless you are processing Japanese data, choose LATN.

Engines Assigns the engines that you want to use with the Global Address Cleanse transform.

62

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

The Global Address Cleanse transform must have a Global Address Cleanse engines enabled process your data. • •

Yes: Activates the engine for this transform. No: De-activates the engine for this transform.

Specify which engine to use with this transform: • • • • • •

Australia Canada EMEA Global Address Japan USA

Country Standardization Options These options are found under Standardization Options Country Options. This option group contains the country–specific options for standardizing your address data. Suggestion List Options These options apply to both the Global Address Cleanse and USA Regulatory Address Cleanse transform. This option group specifies what suggestion list components to output. • •

Yes: Outputs the component. No: Does not output the component. Note: For the USA Regulatory Address Cleanse transform configuration, these options are listed under Suggestion List Options Output.

Output fields There are several output fields that can be used for the Global Address Cleanse transform. Each output field is available based on the engine(s) you enable. Generated_Field_AddrClass and Generated_Field_Class values are available for each field. These values indicate the data component to be outputted. Similar to the input field for locality1, the subsequent output field contains the official locality or city name.

2011

© 2011 SAP AG. All rights reserved.

63

Unit 3: Using Address Cleanse Transforms

BODS30

Describing the USA Regulatory Address Cleanse Transform The USA Regulatory Address Cleanse transform identifies, parses, validates, and corrects USA address data according to the US Coding Accuracy Support System (CASS). This transform can create the USPS Form 3553 and output many useful codes for your records. You can also run in a non–certification mode, as well as produce suggestion lists. For a list of reference files used by the USA Regulatory Address Cleanse transform, see the USA Regulatory Address Cleanse Options section of the Transforms Reference of the Data Services Reference Guide. The USA Regulatory Address Cleanse transform contains sample transform configurations that help you to set up your data flow. The transforms include all of the required options except input fields. This table lists the transform configurations for the USA Regulatory Address Cleanse transform:

64

Transform Configuration

Description

USA Regulatory Address Cleanse

Cleanses address data according to CASS requirements.

USA Regulatory EWS Address Cleanse

Cleanses address data according to CASS requirements, with Early Warning System.

USA Regulatory Geo Address Cleanse

Cleanses address data according to CASS requirements, with GeoCensus.

USA Regulatory Noncertified Address Cleanse

Cleanses address data in noncertified mode.

USA Regulatory Noncertified Geo Address Cleanse

Cleanses address data in noncertified mode, with GeoCensus.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

USA Regulatory RDI Address Cleanse

Cleanses address data according to CASS requirements, with Residential Delivery Indicator.

USA Regulatory Suggestions Address Cleanse

Cleanses address data using the Suggestion List feature.

USA Regulatory Z4 Change Address Cleanse

Cleanses address data according to CASS requirements, with Z4Change.

Note: If an input record has characters outside the Latin1 code page, the USA Regulatory Address Cleanse transform does not process that data. Instead, the mapped input record is sent to the corresponding standardized output field (if applicable). No other output fields are populated for that record. If your Unicode database has valid USA addresses from the Latin1 character set, this transform processes as normal. Input/Output Data Services offers a number of input and output fields for the USA Regulatory Address Cleanse transform. Options The USA Regulatory Address Cleanse transform offers several options. Option group

Description The Generate Report Data option specifies whether to generate report data for this transform. •

Report and Analysis options

Reference Files options

2011

Yes: Generates report data for this transform. • No: Turns off report data generation. If you do not need to generate reports (during testing, for example), you should set this option to No to improve performance. Reference files are directories used by the USA Regulatory Address Cleanse transform to correct and standardize your data. It is best to use a substitution variable for the Option value, such as

© 2011 SAP AG. All rights reserved.

65

Unit 3: Using Address Cleanse Transforms

BODS30

$$DIR_PATH_USA. This substitution variable represents the path, and you can change it dynamically.

Transform Performance options

The Transform Performance option group contains options that affect the performance of DPV, RDI, and LACSLink processing. Set these options if you have DPV, RDI, or LACSLink activated in the transform.

Assignment options

With this option group, you can choose the add-on features that you want to use during processing.

Standardization options

This option group contains all of the standardization settings that you need to define for processing USA data. (The options are listed alphabetically.)

Noncertified options

This option group includes options to process your data without some select CASS certification rules.

CASS Report options

With this option group, you can add the necessary USPS Form 3553 information as required by the USPS when certifying a mailing.

Suggestion List options

Set these options to configure how suggestions lists are output.

Working with Specific Types of Address Data There are several approaches to take when using the Address Cleanse transform, depending on the type of data that you are cleansing. USA data for USPS certification To process your USA address data for CASS certification or for USA suggestion lists, select the USA Regulatory Address Cleanse (URAC) transform configuration that most closely meets the project requirements and modify to meet your business requirements. Note: There are also USA Address Cleanse transforms configured for: Delivery Point Validation (DPV); Early Warning system (EWS); Enhanced Line of Travel (eLOT); GeoCensus; Locatable Address Conversion System (LACSLink); Residential Delivery Indicator (RDI); and Z4Change.

66

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

URAC helps ensure that an address in the United States and its territories is current and deliverable. Delivery Point Validation confirms the existence of an address.

Figure 26: Delivery Point Validation

LACSLink is a process that converts addressed that are changed by local governments.

2011

© 2011 SAP AG. All rights reserved.

67

Unit 3: Using Address Cleanse Transforms

BODS30

Figure 27: Locatable Address Conversion System

SuiteLink Provides secondary address information for businesses.

Figure 28: SuiteLink

68

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

National Change of Address (NCOALink) software provides a secure methodology to interface with the NCOA database from the USPS using name and address matching within the SAP Data Services product. It makes change-of-address information available to mailers to help reduce undeliverable mail pieces before mail enters the mail stream.

Figure 29: National Change of Address (NCOALink)

Data Services 4.0 introduces the use of the DSF2® walk sequence number. Two addresses (at top of slide) have the same Lot Sequence number, but were assigned different walk-sequence numbers. These houses share the same delivery unit and hence the same Lot number. The addresses (postal units) below are derived from actual neighborhoods located in Holmen, WI. To aid in mail sorting, eLOT contains an eLOT sequence number field and an ascending/descending code. The eLOT sequence number indicates the first occurrence of delivery made to the add-on range within the carrier route, and the ascending/descending code indicates the approximate delivery order within the sequence number. eLOT processing may be used by mailers to qualify for enhanced carrier route presort discounts.

2011

© 2011 SAP AG. All rights reserved.

69

Unit 3: Using Address Cleanse Transforms

BODS30

Figure 30: DSF2® Walk Sequencer

USA data not for USPS certification To process USA–specific address data, use the USA Regulatory Address Cleanse transform. Although you can use the Global Address transform as long as CASS certification is not required, you may experience a performance gain by using the USA Regulatory Address Cleanse transform. Address data for multiple purposes When you have two addresses intended for different purposes (for example, a billing address and a shipping address), use two Address Cleanse transform configurations. Include a Case transform to send one type of address to one transform and the other type to the other transform. Global address data To cleanse address data from any supported country, use the appropriate Global Address Cleanse transform configuration in your project. Note that the transform configuration for Australia can be used for AMAS certification, and the one for Canada can be used for SERP certification. However, the USA Global Address Cleanse transform configuration cannot be used for USPS certification. Multi–country address data

70

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

Identify which country or countries your address data refers to and select the country-specific Global Address Cleanse transform that most closely fits your needs. You can modify options to process addresses for additional countries if necessary.

2011

© 2011 SAP AG. All rights reserved.

71

Unit 3: Using Address Cleanse Transforms

72

© 2011 SAP AG. All rights reserved.

BODS30

2011

BODS30

Lesson: Using the Address Cleanse Transforms

Exercise 2: Use the Address Cleanse Transform Exercise Objectives After completing this exercise, you will be able to: • Use the USA Address Cleanse transform configuration to parse, cleanse, and standardize data • Use the Query transform to create a new column for information code descriptions • Assign content types for input data to assist the definition of mappings

Business Example After analyzing the target data for the Beta Businesses, you have decided that the address data requires parsing, cleansing, and standardization. You would also like to add a column that provides descriptive information for the information code values.

Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

If the BETA, DQSTANDALONE AND DELTA datastores already exist, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new source Datastore for the BETA Businesses database. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

73

Unit 3: Using Address Cleanse Transforms

BODS30

3.

In your Local Object Library, create a new Datastore for the DQSTANDALONE database.

4.

In your Local Object Library, create a new Datastore for the DELTA staging database.

Task 2: In the USA_Customer table in the Beta datastore, set the content type for certain fields to Name. 1.

In the USA_Customer table in the BETA datastore, set the content type for certain fields to Name.

Task 3: Within the Omega project, create a batch job Beta_Customers_AddressCleanse_Job and a dataflow Beta_Customers_AddressCleanse_DF containing the USA Address Cleanse transform with a source table and a target template table. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

In the Omega project, create a new batch job Beta_Customers_AddressCleanse_Job with a new data flow called Beta_Customers_AddressCleanse_DF.

3.

In the workspace for Beta_Customers_AddressCleanse_DF, add the USA_Customers table from the Beta datastore as the source object connected to the USA Address Cleanse transform with a target template table AddressCleansed_USA in the Delta datastore.

Task 4: Configure the USA Address Cleanse transform to parse the names and addresses in the USA_Customers table. 1.

Map input columns to output columns for those columns that should be processed by the USA Address Cleanse transform and output additional columns, which parse the original information based on the USA engine. Continued on next page

74

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

Task 5: Execute the Beta_Customers_AddressCleanse_Job with the default options after saving all objects. 1.

Execute the job with the default execution properties after saving all created objects.

Task 6: Add a Query transform to the data flow to use a lookup function to provide the definition for AC_INFO_CODE values and execute the batch job again to see the results.

2011

1.

Add the Query transform to the workspace between the USA Address Cleanse transform and the target template table AddressCleansed_USA.

2.

Use the Functions wizard to create a new lookup expression using the lookup_ext function.

3.

Execute the job with the default execution properties after saving all created objects.

© 2011 SAP AG. All rights reserved.

75

Unit 3: Using Address Cleanse Transforms

BODS30

Solution 2: Use the Address Cleanse Transform Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

a)

From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.

b)

In the dialog box, enter your assigned User ID.

c)

Enter your password, which is the same as your User ID.

d)

Choose the Log on button.

e)

In the list of repositories, select your repository DSREPO##, where ## is the number portion of your User ID.

f)

Choose OK.

If the BETA, DQSTANDALONE AND DELTA datastores already exist, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new source Datastore for the BETA Businesses database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Continued on next page

76

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

Field

Value

Datastore name

BETA

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

BETA

User name

sourceuser (lower case)

Password

sourcepass (lower case)

Click OK to finish. c)

In the Local Object Library, right-click the BETA datastore to select the Open option from the context menu. The tables from the BETA database are displayed on the Metadata window on the right.

d)

Import the metadata for the BETA database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • • • • • •

source.addrcodes source.categories source.country source.customer source.employees source.orderdetails source.orders source.products source.region source.suppliers source.usa_customers Note: This database is a read-only database.

e)

View the data for the USA_Customers table and confirm that Jane Hartley from Planview Inc. is the first customer record by right clicking on the table in the Local Object Library and choose the option View data. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

77

Unit 3: Using Address Cleanse Transforms

3.

BODS30

In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DQSTANDALONE

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DQSTANDALONE

User name

sourceuser (lower case)

Password

sourcpass (lower case)

Choose OK to finish. c)

In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.

d)

Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •

AddressCleanse_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page

78

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

4.

In your Local Object Library, create a new Datastore for the DELTA staging database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DELTA

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DELTA## (where ## is the group number assigned by your Instructor)

User name

student## or consult your instructor

Password

student## or consult your instructor

Choose OK to finish. c)

You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.

Task 2: In the USA_Customer table in the Beta datastore, set the content type for certain fields to Name. 1.

In the USA_Customer table in the BETA datastore, set the content type for certain fields to Name. a)

In the Local Object Library, select the Datastores tab and open the node for the BETA datastore.

b)

Open the node Tables and right-click on the table USA_Customer to choose the option Edit.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

79

Unit 3: Using Address Cleanse Transforms

BODS30

c)

Right-click the field Cust_Name to select the option Properites and choose the type Name from the drop-down list for the Content Type associated with the field.

d)

Select OK to close the dialog box.

e)

Close the table metadata by selecting the Back button.

Task 3: Within the Omega project, create a batch job Beta_Customers_AddressCleanse_Job and a dataflow Beta_Customers_AddressCleanse_DF containing the USA Address Cleanse transform with a source table and a target template table. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

80

a)

From the Project menu, choose the option New → Project.

b)

When the Project New dialog box appears, enter Omega in the Project name field.

c)

Select Create so that the new project appears in the Project area.

In the Omega project, create a new batch job Beta_Customers_AddressCleanse_Job with a new data flow called Beta_Customers_AddressCleanse_DF. a)

In the Project area, right-click the project name and choose New Batch Job from the menu.

b)

Enter the name of the job as Beta_Customers_AddressCleanse_Job.

c)

Press Enter to commit the change.

d)

Open the job Beta_Customers_AddressCleanse_Job by double-clicking it.

e)

Select the Data Flow icon in the Tool Palette.

f)

Select the workspace where you want to add the data flow.

g)

Enter Beta_Customers_AddressCleanse_DF as the name. Continued on next page

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

3.

h)

Press Enter to commit the change.

i)

Double-click the data flow to open the data flow workspace.

In the workspace for Beta_Customers_AddressCleanse_DF, add the USA_Customers table from the Beta datastore as the source object connected to the USA Address Cleanse transform with a target template table AddressCleansed_USA in the Delta datastore. a)

In the Local Object Library, select the Datastores tab and then select the USA_Customers table from the Beta datastore.

b)

Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.

c)

On the Transforms tab of the Local Object Library, expand the Data Quality Global Address Cleanse branch to select and drag the USA Address Cleanse transform to the data flow workspace.

d)

Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.

e)

In the Tool Palette, select the icon for a template table and then select in the data flow workspace. Enter AddressCleansed_USA as the table name in the DELTA## datastore.

f)

Connect the transform to the target table by selecting the transform and while holding down the mouse button, drag to the target table. Release the mouse button to create the link.

Task 4: Configure the USA Address Cleanse transform to parse the names and addresses in the USA_Customers table. 1.

Map input columns to output columns for those columns that should be processed by the USA Address Cleanse transform and output additional columns, which parse the original information based on the USA engine. a)

In the data flow workspace, double-click the USA Address Cleanse transform to open the transform editor.

b)

In the transform editor for the USA Address Cleanse transform, map all columns from the Input Schema to the Output Schema by selecting all the columns in the Input Schema and dragging them to the Output Schema.

c)

Map these fields from the Schema In to the Input tab of the parameters area: Continued on next page

2011

© 2011 SAP AG. All rights reserved.

81

Unit 3: Using Address Cleanse Transforms

BODS30

Schema In

Input Mapping

Address1

MULTILINE1

Address2

MULTILINE2

City

LOCALITY1

PostalCode

POSTCODE

Create the mapping by selecting the field form the Schema In area and dragging them to the field in the Input tab. Note: The fields Firm, Region, and Country are already mapped, because these content types are relevant to address cleanse transforms. d)

On the Options tab of the parameters area, ensure that only the USA engine is enabled. Note: Other settings may be modified to conform to business rules. In this case, accept the default address options.

e)

On the Output tab, filter the output columns based on best practice by selecting the radio button Best Practice.

f)

Since selection order determines output order, add these output columns in the order specified to the output schema after the columns you have already mapped. Use the check boxes to make your selections. GENERGENERGENERATED_FIELD_NAME ATED_FIELD_CLASS ATED_FIELD_ADDRCLASS PRIMARY_SECONDARY_ADDRESS

BEST

DELIVERY

PRIMARY_SECONDARY_ADDRESS

BEST

DUAL

PRIMARY_NUMBER

BEST

DELIVERY

PRIMARY_PREFIX1

BEST

DELIVERY

PRIMARY_NAME1

BEST

DELIVERY Continued on next page

82

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

g)

PRIMARY_POSTFIX1

BEST

DELIVERY

PRIMARY_TYPE1

BEST

DELIVERY

UNIT_NUMBER

BEST

DELIVERY

LOCALITY1_NAME

BEST

DELIVERY

REGION1

BEST

DELIVERY

POSTCODE_FULL

BEST

DELIVERY

ISO_COUNTRY_CODE_2CHAR

BEST

DELIVERY

STATUS_CODE

NONE

NONE

INFO_CODE

NONE

NONE

Rename the columns in the output schema by right-clicking on each one in turn and selecting the option Properties. Choose OK to close the Properties dialog box. ORIGINAL

NEW

PRIMARY_SECONDARY_ADDRESS_BEST_COMPONENT_DELIVERY

AC_PRIM_SEC_ADDRESS

PRIMARY_SECONDARY_ADDRESS_BEST_COMPONENT_DUAL

AC_PRIM_SEC_ADDRESS_DUAL

PRIMARY_NUMBER_BEST_COMPONENT_DELIVERY

AC_PRIM_NUM

PRIMARY_PREFIX1_BEST_COMPONENT_DELIVERY

AC_PREFIX

PRIMARY_NAME1_BEST_COMPONENT_DELIVERY

AC_PRIM_NAME

PRIMARY_POSTFIX1_BEST_COMPONENT_DELIVERY

AC_POSTFIX

PRIMARY_TYPE1_BEST_COMPONENT_DELIVERY

AC_PRIM_TYPE

UNIT_NUMBER_BEST_COMPONENT_DELIVERY

AC_UNIT_NUM

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

83

Unit 3: Using Address Cleanse Transforms

BODS30

LOCALITY1_NAME_BEST_COMPONENT_DELIVERY

AC_LOCALITY

REGION1_BEST_COMPONENT_DELIVERY

AC_REGION

POSTCODE_FULL_BEST_COMPONENT_DELIVERY

AC_POSTCODE_FULL

ISO_COUNTRY_CODE_2CHAR_BEST_COMPONENT_DELIVERY

AC_COUNTRY_CODE

STATUS_CODE_NONE_ASSIGNMENT_INFO_NONE

AC_STATUS_CODE

INFO_CODE_NONE_ASSIGNMENT_INFO_NONE

AC_INFO_CODE

Task 5: Execute the Beta_Customers_AddressCleanse_Job with the default options after saving all objects. 1.

Execute the job with the default execution properties after saving all created objects. a)

In the Project area, right-click the Beta_Customers_AddressCleanse_Job and choose Execute from the menu.

b)

Data Services prompts you to save any objects that have not been saved. Choose OK.

c)

The Execution Properties dialog box appears. Choose OK.

d)

Return to the data flow workspace and view the target table data by right-clicking the target table and choosing the option View Data. Notice that the output includes various AC_INFO_CODE values, but there is no output column that defines what these values represent.

Continued on next page

84

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

Task 6: Add a Query transform to the data flow to use a lookup function to provide the definition for AC_INFO_CODE values and execute the batch job again to see the results. 1.

Add the Query transform to the workspace between the USA Address Cleanse transform and the target template table AddressCleansed_USA. a)

In the Tool Palette, select the Query transform icon and select the workspace to add a Query template to the data flow.

b)

Right-click the link between the USA Address Cleanse transform and the target template table AddressCleansed_USA to choose the option Delete.

c)

Connect the USA Address Cleanse transform to the Query transform by selecting the USA Address Cleanse transform and holding down the mouse button, rag the cursor to the Query transform. Then release the mouse button.

d)

Connect the Query transform to the target template table by selecting the Query transform and holding down the mouse button, drag the cursor to the target table. Then release the mouse button.

e)

Double-click the Query transform to open the transform editor.

f)

Map all columns from the input schema to the output schema by selecting each field in the input schema and while holding down the mouse button, drag to the appropriate column in the output schema.

g)

Right-click the last column in the output schema to select the option Create output column and then the option Below. Enter AC_INFO_DESC as the name of the new output column with data type of varchar and a length of 255. Choose OK.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

85

Unit 3: Using Address Cleanse Transforms

2.

BODS30

Use the Functions wizard to create a new lookup expression using the lookup_ext function. a)

In the Mapping tab for the new output column, select the Function button and in the Select Function dialog box, open the category of “Database Functions”.

b)

From the list of function names, select the lookup_ext function and choose Next.

c)

In the Lookup_ext - Select Parameters dialog box, enter the parameters: Field/Option

Value

Lookup table

BETA.SOURCE.ADDRCODES

Condition Columns in lookup table

Info_Code

Op.(&)

=

Expression

USA_AddressCleanse.AC_INFO_CODE

Output Column in lookup table

Info_Desc

The code is generated: lookup_ext ([BETA.SOURCE.ADDRCODES, 'PRE_LOAD_CASH', 'MAX'], [Info_Desc], [NULL], [Info_Code, '=”, USA_AddressCleanse.AC_INFO_CODE]) SET (“run_as_separate_process”='no') d)

Select the Back icon to close the editor.

Continued on next page

86

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Address Cleanse Transforms

3.

Execute the job with the default execution properties after saving all created objects. a)

In the Project area, right-click the Beta_Customers_AddressCleanse_Job and select Execute from the menu.

b)

Data Services prompts you to save any objects that have not been saved. Select OK.

c)

The Execution Properties dialog box appears and select OK.

d)

Return to the data flow workspace and view the target table data by right-clicking the target table and choosing the option View Data. The target table now includes descriptions in the newly created AC_INFO_DESC column.

2011

© 2011 SAP AG. All rights reserved.

87

Unit 3: Using Address Cleanse Transforms

BODS30

Lesson Summary You should now be able to: • Prepare your input data for Address Cleanse transforms • Use the Address Cleanse transforms • Work with global address data • Work with transactional address data

Related Information • •

88

Refer to the Supported Countries section in the Address Cleanse chapter of the Data Services Reference Guide For more information, see Chapter 3 in the Data Services Reference Guide.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Unit Summary

Unit Summary You should now be able to: • Prepare your input data for Address Cleanse transforms • Use the Address Cleanse transforms • Work with global address data • Work with transactional address data

2011

© 2011 SAP AG. All rights reserved.

89

Unit Summary

90

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit Summary

91

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit Summary

92

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit 4 Using Data Cleanse Transforms Unit Overview This lesson describes the data cleansing capabilities of Data Services. After completing this lesson, you will be able to: • •

Describe the Data Cleanse transforms Describe the Universal Data Cleanse transform

Unit Objectives After completing this unit, you will be able to: •

Complete a Data Cleanse transform

Unit Contents Lesson: Using the Data Cleanse Transform ..................................... 94 Exercise 3: Use the Data Cleanse Transform .............................. 109

2011

© 2011 SAP AG. All rights reserved.

93

Unit 4: Using Data Cleanse Transforms

BODS30

Lesson: Using the Data Cleanse Transform Lesson Overview The Data Cleanse transform identifies and parses names, titles, firm data, phone numbers, Social Security numbers, dates, and e-mail addresses. You can also assign gender code, add prenames, create personalized greetings, generate match standards, and convert input sources to a standard format. Additionally, you can parse multiple names from individual records so that records can be created for each individual.

Lesson Objectives After completing this lesson, you will be able to: •

Complete a Data Cleanse transform

Business Example The Data Cleanse transform's perspective is on data other than addresses. Often titles, names, phone information, and e-mail addresses are incorrect. To correct these errors, the data must be parsed down to the lowest level. This is a preparatory step before standardization can be accomplished.

Understanding how Data Cleanse Parses Data There are five main steps that the Data Cleanse transforms takes while parsing operational data: 1.

2.

3.

4.

5.

94

Word breaking – breaks the input line down into smaller, more usable pieces. Data Cleanse breaks the input line on white space, punctuation, and alphanumeric transitions. Gathering – recombines words that belong together, such as words that look up together in the dictionary. Data Cleanse does not attempt to combine words that have been broken for a custom parser. Tokenization – assigns specific meanings to each of the pieces. Data Cleanse looks up each individual input word in the dictionary. A list of tokens is created using the classifications associated with each word in the dictionary. Rule matching – matches the token meanings against defined rules. Data Cleanse does not match the pattern of specific words against the rules; it matches pattern of the types, or classifications, of the words. Action item assignment – outputs parsed data based upon matched rules.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

Understanding the Business Need for Data Cleanse Transforms As with address cleansing, one of the primary uses of data cleansing is to prepare names, titles, firm data, phone numbers, Social Security numbers, dates, and e-mail addresses for matching. Dictionaries are used with rule files to parse data to its smallest components and standardize it. This reduces variability and increases the possibility of a successful match. In addition, data cleansing can generate match name standards, which provide: • • •

Alternative spellings for names (for example, Kathy and Cathy) Alternatives for ambiguous names (for example, Patrick and Patricia for Pat) Names associated with nicknames (for example, Elizabeth for Liz, Beth, or Betsy)

This example shows how data cleansing can prepare records for matching.

2011

© 2011 SAP AG. All rights reserved.

95

Unit 4: Using Data Cleanse Transforms

BODS30

Figure 31: Business Need for Data Cleanse Transform 1

• • • • •

•

96

Identifying and isolating a wide variety of data, even if the data is floating in lines. Standardizing data to make records more consistent, such as fixing casing, punctuation, and abbreviations. Assigning a precise gender code to each name – strong male, strong female, weak male, weak female, and ambiguous. Assigning a prename such as Mr., Ms., or Mrs. based on gender codes. Creating personalized greetings in formal, casual, and title styles: Dear Mr. Jones, Dear Robert, and Dead Manager. The transform creates a greeting for each person, as well as a dual name greeting for records with two names. Creating a separate output record for each person in a record with multiple persons. For example, an input database can contain one record for each customer with multiple contact persons, each of which can be split into separate records.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

Figure 32: Business Need for Data Cleanse Transform 2

Understanding Strategies for Data Cleanse Transform How you configure your Data Cleanse transforms depends on the type of data you are cleansing. Name, title, firm, and firm location data You can standardize name data and generate discrete standardized fields for prename, first name, middle name, last name, maturity post name, and honorary post name based on which field you decide to evaluate to determine if two records match. For the first name and middle name match standards, you can generate up to six first name match standards and up to six middle name match standards. Even though there are a maximum of six first and middle match name standards, you can only use a maximum of three first name match name standards and a maximum of three middle name match name standards when matching. The Data Cleanse transform also parses up to six job titles per record, up to two firm names (such as IBM), and up to two firm locations (such as Engineering Dept.). This transform can also convert firm names to accepted acronyms, such as General Motors Corp. to GM. Social Security number data

2011

© 2011 SAP AG. All rights reserved.

97

Unit 4: Using Data Cleanse Transforms

BODS30

Data Cleanse parses US Social Security numbers (SSN) that are either by themselves or on an input line surrounded by other text. Data Cleanse outputs the individual components of a parsed Social Security number: the entire SSN, the area, the group, and the serial. Data Cleanse parses Social Security numbers in two steps. First, it identifies a potential SSN by looking for any of three patterns: Once the pattern is identified, Data Cleanse performs a validity check on the first five digits only. If the number fails validation, the number is not output, as it is not considered a valid SSN as defined by the US government. E-mail data When Data Cleanse parses input data it recognizes as an e-mail address, it outputs the individual components of a parsed address: the e-mail user name, complete domain name, top domain, second domain, third domain, fourth domain, fifth domain, and host name. You can also verify that an e-mail address is properly formatted and flag the address as belonging to an Internet service provider (ISP). Data Cleanse does not verify whether the domain name is registered. Nor does it verify that an e-mail server is active at that address, the user name is registered on that e-mail server, or that the personal name in the record can be reached at this e-mail address. For example, with the input data, [email protected], Data Cleanse outputs each element in the following fields: Output field

Output value

Email

[email protected]

Email user

expat

EMail_Domain_All

london.home.office.city.co.uk

Email_Domain_Top

uk

Email_Domain_Second

co

Email_Domain_Third

city

Email_Domain_Fourth

office

Email_Domain_Fifth

home

Email_Domain_Host

london

Phone number data Data Cleanse can parse both North American (US and Canada) and international phone numbers. When Data Cleanse parses a phone number, it outputs the individual components of the number into the appropriate fields.

98

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

Data Cleanse recognizes phone numbers by their pattern and (for non-US numbers) by their country code. For North American phone numbers, it looks for commonly used patterns such as (234) 567-8901, 234-567-8901, and 2345678901. It gives you the option for some reformatting on output (such as your choice of delimiters). Data Cleanse searches for European and Pacific-Rim numbers by pattern. The patterns used are defined from the US and require that the country code appear at the beginning of the number. Note that Data Cleanse does not offer any options for reformatting international phone numbers or cross-compare to the address to see if the country and city codes in the phone match the address. Date data Data Cleanse can parse up to six dates from your defined record. Data Cleanse identifies the dates in the input, breaks dates into day/month/year components, and makes dates available as output in either the original format (for example DD-MMM-YY) or a user-selected standard format (for example, MM/DD/YYYY). International data

2011

© 2011 SAP AG. All rights reserved.

99

Unit 4: Using Data Cleanse Transforms

BODS30

By default, Data Cleanse can identify international data presented in multiple formats. There are also several ways that you can use Data Cleanse to identify and manipulate various forms of other international data, including prenames, greetings, and personal identification numbers. •

Customizing greetings and prenames per country The default prenames and salutations found in the Data Cleanse greetings option group are commonly used in English-speaking nations. For countries where English is not the primary language, you can modify these options to reflect common prenames and salutations.

•

Modifying the phone file for other countries By default, Data Cleanse includes phone number patterns for many countries. However, if you find that you need parsing for a country that is not included, you can modify the international phone file (drlphint.dat) to enable Data Cleanse to detect phone number patterns that follow a different format. New phone number patterns can be added to the international phone file using regular expressions.

•

Using personal ID numbers With a default Data Quality installation, Data Cleanse can identify USA Social Security numbers and separate them into discrete components. If your data includes personal identification numbers, which are different from US SSNs, you can use User-Defined Pattern Matching to identify the numbers. Number formats to be identified by User-Defined Pattern Matching can be set up using regular expressions.

•

Using cleansing packages Cleansing Packages are packages that enhance the ability of Data Cleanse to accurately process various forms of global data by including language-specific reference data and parsing rules. Since cleansing packages are based on the standard Data Cleanse transform, you can use the sample transforms in your projects in the same way you would use Data Cleanse and gain the advantage of enhanced reference data and parsing rules.

The Data Cleanse Transform A default installation of Data Services with the Data Quality transforms includes several basic Data Cleanse transform configurations. A default transform configuration, EnglishNorthAmerica_DataCleanse, is preconfigured to cleanse name, title, firm, date, phone, Social Security number, and e-mail data using English-based data quality rules.

100

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

Other Data Cleanse transform configurations are preconfigured to cleanse name, title, firm, data, phone, and e-mail data using data quality rules specific to Dutch, French, German, Italian, Spanish, or Portuguese. These Data Cleans transforms include all required options except input fields.

Configuring the Data Cleanse Transform Configure options on the Data Cleanse transform to fit your operational data project. You need to set several options: •

Import the ATL file transferred from Cleansing Package Builder. Importing the ATL file brings the required information and automatically sets the following options:

•

– Cleansing Package – Engine – Filter Output Fields – Input Word Breaker – Parser Configuration In the input schema, select the input fields that you want to map and drag them to the appropriate fields in the Input tab. –

•

•

2011

Name and firm data can be mapped either to discrete fields or multiline fields. – Custom data must be mapped to multiline fields. – Phone, date, e-mail, Social Security number, and user-defined pattern data can be mapped either to discrete fields or multiline fields. The corresponding parser must be enabled. In the Options tab, select the appropriate option values to determine how Data Cleanse processes your data. – The Cleansing Package option – The Parser_Sequence_Multiline options In the Output tab, select the fields that you want to output from the transform. In Cleansing Package Builder, output fields are referred to as attributes. Ensure that you map any attributes (output fields) defined in user-defined patterns in Cleansing Package Builder reference data.

© 2011 SAP AG. All rights reserved.

101

Unit 4: Using Data Cleanse Transforms

BODS30

Understanding Dictionaries and Rule Files Data cleansing in Data Services uses cleansing packages and rule files. The Reference Files section of the Options tab points to the location of supporting dictionary and data files needed for parsing. By default, these files are installed in the LINK_DIR\DataQuality|datacleanse folder. The default substitution parameter for dictionary locations is [$$RefFilesDataCleanse]. Parsing rule files contain rules that define how an element is parsed based on the tokens assigned from the dictionary. Parsing dictionaries in cleansing packages contain entries for words and phrases that define how each might be used. For example, the dictionary indicates that the word "Engineering" can be used in a firm name, such as “Smith Engineering Inc.”, or in a job title, such as “VP of Engineering”. The dictionary also defines the following: Type of information in dictionary

Description

Aliases

These are potential matches for a value. For example, Patrick and Patricia are aliases for Pat.

Gender

These define the gender assigned to a value. For example, Anne is a feminine given name and Mr. is a masculine prename.

Standards

These indicate the proper capitalization or other output formatting on data parsed by Data Cleanse.

All of the necessary dictionaries and rule files are contained in a Cleansing Package which is installed with the Cleansing Package Builder.

102

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

Creating a Custom Cleansing Package in a Cleansing Package

Figure 33: The Cleansing Package Builder

The default Data Cleanse cleansing package consists of name and firm data that is not required for parsing operational data. If you use a complete custom dictionary, Data Cleanse does not have to consider all the default name and firm data, and parsing is faster and more accurate. Based on your analyzed data, you can create new output categories and fields where Data Cleanse can place parsed and standardized data. For data sets where the input data comes in only a few orders, using a single output category is sufficient. However, if your data can come in any order, you can use multiple output categories to reduce the number of rules needed. Cleansing Package-based classifications add meaning to terms when assigned to primary dictionary entries. For example, the primary entry blue might be assigned the classification color, meaning Data Cleanse can identify blue as a color term. Pattern-based classifications are defined using regular expressions, and are not assigned to specific terms in the dictionary. Instead, Data Cleanse looks for patterns in the input data that match the specified regular expression. To execute successfully, a Data Cleanse job must reference a cleansing package. The cleansing package may be either an SAP-supplied cleansing package or a cleansing package that you have modified and published in the Cleansing Package Builder module of Information Steward.

2011

© 2011 SAP AG. All rights reserved.

103

Unit 4: Using Data Cleanse Transforms

BODS30

With the introduction of the Cleansing Package Builder in Information Steward, you no longer need to specify individual dictionary, rules, and reference data files. The information formerly contained in those files is now included in the cleansing package. The Cleansing Package option group and Cleansing Package option has been added to the Data Cleanse transform.

Improving Parsing Results The Cleansing Package Builder module of Information Steward is required for Data Services 4.0 to modify or customize any type of data. The dictionary menu has been removed from the Data Services Designer menu bar. Data Cleanse no longer requires a separate cleansing package repository. With the introduction of the Cleansing Package Builder in Information Steward, you no longer need to specify individual dictionary, rules, and reference data files. The information formerly contained in those files is now included in the cleansing package. The Cleansing Package option group and Cleansing Package option has been added to the Data Cleanse transform. You can modify the parsing dictionary used by the Data Cleanse transform to improve parsing results for your data.

Figure 34: Refining Data Cleansing Results

Correct specific parsing behavior You can customize the parsing dictionary to correct specific parsing behavior that you have seen in your output. Recognize local names The name data in Data Cleanse's default parsing dictionary, PERSON_FIRM_EN, is based on an analysis of USA residents. As such, the parsing dictionary is broadly useful across the United States. However, you can tailor the dictionary to better suit your data by adding ethnic or regional names. If Data Cleanse does not recognize a specific name, for example, “Jinco Xandru”, you can add “Jingco” to the dictionary as a first name and “Xandru” as a last name.

104

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

Identify industry-specific jargon The default parsing dictionary is useful across many industries. You can tailor the dictionary to better suit your own industry by adding special titles, prenames or postname, standardized abbreviations, or other jargon words. For example, if you process data for the real estate industry, you might add industry-specific postnames such as Certified Residential Specialist (CRS), Accredited Buyer Representative (ABR), or Graduate Realtor Institute (GRI). Recognize specific phrases Some words can be used in both firm names and job titles. As a result, Data Cleanse may incorrectly recognize some job titles as firm names. To improve parsing, you can add these job title phrases to the dictionary. Identify firm names containing personal names Often a firm name is made up of personal names. As a result, Data Cleanse may incorrectly parse the firm as a personal name. For example, the catalog retailer “J. Crew ” may be parsed as a personal name rather than as a firm. To improve parsing, you can add multiple-word firm names to the dictionary. For example, to parse “J. Crew” as a firm rather than as a personal name, you could add “J ” and “Crew” to the dictionary with the Firm_Name classification, and “J Crew” with the Firm_Name_Alone classification. Changes to structure require that you upgrade all Data Services 3.2 Data Cleanse dictionaries to cleansing packages of the current version. To execute successfully, a Data Cleanse job must reference a cleansing package. The cleansing package may be either an SAP-supplied cleansing package or a cleansing package that you have modified and published in the Cleansing Package Builder module of Information Steward. Restriction: If you currently have modified a person and firm dictionary or created a custom dictionary using Universal Data Cleanse, ensure that the Cleansing Package Builder in Information Steward is available before you upgrade. You need Cleansing Package Builder to migrate your dictionary rules, and reference files to the new cleansing package format.

Improving Casing Results cleansing package dictionaries can also be used to adjust casing. If you use mixed case, the general rule is to capitalize the first letter of the word and put the rest of the word in lowercase. However, there are exceptions to that rule, such as McDonald, Ph.D., IBM, NY, and so on.

2011

© 2011 SAP AG. All rights reserved.

105

Unit 4: Using Data Cleanse Transforms

BODS30

To handle mixed-case exceptions, Data Cleanse consults secondary information standards in the dictionary. The dictionary contains the correct casing of a word and also indicates when that casing should be used. For example, the entry MS is cased differently depending on how it is used. “M.S.” is an abbreviation for the honorary postname “Master of Science”, which “Ms.” is a prename. The dictionary indicates which formatting to use based on the content type. Most Data Cleanse users find that the default capitalization of words in the dictionary is sufficient for producing good mixed-case results. However, it is impossible for the default dictionary to contain every mixed-case exception. If Data Cleanse does not case a word as you want, you can create a custom standard in the dictionary. For example, “TechTel” is not in the default dictionary, so Data Cleanse capitalizes only the first letter of the word. However, if you add the word “TechTel” to your dictionary with a standard for firm name use, you can achieve the desired mixed-case results: Input

Standard

Output

TECHTEL, INC.

(none)

Techtel Inc.

TECHTEL, INC.

TechTel

TechTel Inc.

Ranking and Prioritizing Parsing Engines The Data Cleanse transform can be configured to use only specific parsers or a specific parser order when dealing with multiline input. You can change the parser order for a specific multiline input by modifying the corresponding parser sequence option in the Parser_Configuration options group for the Data Cleanse transform. For example, to change the order of parsers for the Multiline1 input field, modify the Parser_Sequence_Multiline1 option.

106

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

By default, Data Cleanse parses multiline input using parsers in the following order: 1. 2. 3. 4. 5. 6.

User-defined pattern matching E-mail Social Security number North American phone number International phone number Person or firm name Hint: Data Cleanse parser prioritization options can be modified with the ordered options window. Carefully selecting which parsers to use in what order can be beneficial. Turning off parsers that you do not need significantly improves parsing speed and reduces the chances that your data will be parsed incorrectly.

2011

© 2011 SAP AG. All rights reserved.

107

Unit 4: Using Data Cleanse Transforms

108

© 2011 SAP AG. All rights reserved.

BODS30

2011

BODS30

Lesson: Using the Data Cleanse Transform

Exercise 3: Use the Data Cleanse Transform Exercise Objectives After completing this exercise, you will be able to: • Create columns to identify individual name components in the Data Cleanse transform • Create match standards to assist in the matching process

Business Example After analyzing the target data for the Beta Businesses, you have decided that you would like to create new columns from the USA_Customers table to contain the parsed name data and name standards for the match. After the column is created, you would then like to filter the AC_INFO_CODE results according to descriptions.

Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new Datastore for the DQSTANDALONE database.

3.

In your Local Object Library, create a new Datastore for the DELTA staging database. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

109

Unit 4: Using Data Cleanse Transforms

BODS30

Task 2: In the Omega project, create a new batch job Beta_Customers_DataCleanse_Job containing a data flow Beta_Customers_DataCleanse_DF as a replication of Beta_Customers_Address_Cleanse_DF. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

In the Omega project, create a new batch job Beta_Customers_DataCleanse_Job containing a data flow Beta_Customers_DataCleanse_DF.

Task 3: Add and configure a Data Cleanse transform to the data flow Beta_Customers_DataCleanse_DF. 1.

In the Address_Cleansed_USA table in the DQSTANDALONE datastore, set the content type for certain fields to Name.

2.

In the workspace for Beta_Customers_DataCleanse_DF, add the AddressCleansed_USA table from the DQSTANDALONE datastore as the source object connected to the English North America Data Cleanse transform with a target template table DataCleansed_USA in the DELTA datastore.

3.

Configure the Data Cleanse transform in the data flow Beta_Customers_DataCleanse_DF.

Task 4: Because of the additional transform, remap the fields from the input schema to the output schema in the Query transform. 1.

In the Query transform, remap the fields from the input schema to the output schema.

2.

Use the Functions wizard to create a new lookup expression using the lookup_ext function.

Continued on next page

110

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

Task 5: Save all objects and execute the batch job Beta_Customers_DataCleanse_Job with the default parameters. 1.

Save all objects and execute the batch job Beta_Customers_DataCleanse_Job with the default parameters.

Task 6: In the Beta_Customers_DataCleanse2_Job, modify the Query transform to add a description field DC_GENDER_DESC to further clarify the DC_GENDER_ID field. 1.

Modify the Query transform in the Beta_Customers_DataCleanse2_DF workspace with the addition of the DC_GENDER_DESC field.

Task 7: Save all objects and execute the batch job Beta_Customers_DataCleanse2_Job with the default parameters. 1.

2011

Save all objects and execute the batch job Beta_Customers_DataCleanse2_Job with the default parameters.

© 2011 SAP AG. All rights reserved.

111

Unit 4: Using Data Cleanse Transforms

BODS30

Solution 3: Use the Data Cleanse Transform Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

a)

From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.

b)

In the dialog box, enter your assigned User ID.

c)

Enter your password, which is the same as your User ID.

d)

Choose the Log on button.

e)

In the list of repositories, select your repository DSREPO##, where ## is the number portion of your User ID.

f)

Choose OK.

If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

Continued on next page

112

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DQSTANDALONE

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DQSTANDALONE

User name

sourceuser (lower case)

Password

sourcepass (lower case)

Choose OK to finish. c)

In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.

d)

Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •

AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

113

Unit 4: Using Data Cleanse Transforms

3.

BODS30

In your Local Object Library, create a new Datastore for the DELTA staging database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DELTA

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DELTA## (where ## is the group number assigned by your Instructor)

User name

student## or consult your instructor

Password

student## or consult your instructor

Choose OK to finish. c)

You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.

Task 2: In the Omega project, create a new batch job Beta_Customers_DataCleanse_Job containing a data flow Beta_Customers_DataCleanse_DF as a replication of Beta_Customers_Address_Cleanse_DF. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. a)

From the Project menu, choose New → Project. Continued on next page

114

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

2.

b)

When the Project New dialog box appears, enter Omega in the Project name field.

c)

Select Create so that the new project appears in the Project area.

In the Omega project, create a new batch job Beta_Customers_DataCleanse_Job containing a data flow Beta_Customers_DataCleanse_DF. a)

In the Project area, right-click on the Omega project to select the option New Batch Job and enter the name Beta_Customers_DataCleanse_Job.

b)

Select the Data Flow icon in the Tool Palette.

c)

Select the workspace where you want to add the data flow.

d)

Enter Beta_Customers_DataCleanse_DF as the name.

e)

Press Enter to commit the change.

f)

Double-click the Beta_Customers_DataCleanse_DF to open the workspace.

Task 3: Add and configure a Data Cleanse transform to the data flow Beta_Customers_DataCleanse_DF. 1.

In the Address_Cleansed_USA table in the DQSTANDALONE datastore, set the content type for certain fields to Name. a)

In the Local Object Library, select the Datastores tab and open the node for the DQSTANDALONE datastore.

b)

Open the node Tables and right-click on the table Address_Cleansed_USA to choose the option Edit.

c)

Right-click the field Cust_Name to select the option Properties and choose the type Name from the drop-down list for the Content Type associated with the field.

d)

Choose OK to close the dialog box.

e)

Close the table metadata by selecting the Back button.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

115

Unit 4: Using Data Cleanse Transforms

2.

3.

BODS30

In the workspace for Beta_Customers_DataCleanse_DF, add the AddressCleansed_USA table from the DQSTANDALONE datastore as the source object connected to the English North America Data Cleanse transform with a target template table DataCleansed_USA in the DELTA datastore. a)

In the Local Object Library, select the Datastores tab and then select the AddressCleansed_USA table from the DQStandalone datastore.

b)

Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.

c)

On the Transforms tab of the Local Object Library, expand the Data Quality Data Cleanse branch to select and drag the English North America Data Cleanse transform to the data flow workspace.

d)

Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.

e)

In the Tool Palette, select the icon for a Query transform and then select in the data flow workspace.

f)

Connect the English North America Data Cleanse transform to the Query transform by selecting the English North America Data Cleanse transform and while holding down the mouse button, drag to the Query transform. Release the mouse button to create the link.

g)

In the Tool Palette, select the icon for a template table and then select in the data flow workspace. Enter DataCleansed_USA as the table name in the DELTA datastore. Connect the Query transform to the target table by selecting the Query transform and while holding down the mouse button, drag to the target table. Release the mouse button to create the link.

Configure the Data Cleanse transform in the data flow Beta_Customers_DataCleanse_DF. a)

Double-click the Data Cleanse transform to open the transform editor.

b)

In the transform editor, map all columns from the input schema to the output schema except for the AC* columns.

c)

Schema In columns with defined content types are automatically mapped for this transform. Verify that columns with content types are mapped as follows: Continued on next page

116

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

Name

Mapping

EMAIL1

Email

FIRM_LINE1

Firm

NAME_LINE1

Cust_Name

PERSON1_TITLE

Title

PHONE1

Phone

d)

The input column Title is mapped to the output column Title_Line1. Right-click on the output column Person1_Title to choose the option Remove mapping.

e)

The following columns do not have content types defined. Therefore, manually map columns as listed in the following table.

f)

Input Name

Mapping

DATE1

BirthDate

SSN1

ID

TITLE_LINE1

TITLE

On the Options tab in the parameters area, set the following options: Option Name

Option Value

Date_Delimiter

DASH

Date_Format

YYYY_MM_DD

Filter Output Fields

SHOW_ALL_FIELDS

Select Enter on the keyboard to set these options. g)

On the Output tab, add the following output columns to the output schema after the columns you have already mapped in the order specified. All of these fields except SSN are found with the radio button Best practice selected for filtering. Select the checkbox for each column to be added to the output schema. PARENT COMPONENT

GENERATED FIELD NAME

GENERATED FIELD CLASS Continued on next page

2011

© 2011 SAP AG. All rights reserved.

117

Unit 4: Using Data Cleanse Transforms

h)

BODS30

FIRM1

FIRM

STANDARDIZED

PERSON1

PRENAME

STANDARDIZED

PERSON1

GIVEN_NAME1

STANDARDIZED

PERSON1

GIVEN_NAME1 MATCH_STD1

STANDARDIZED

PERSON1

GIVEN_NAME1 MATCH_STD2

STANDARDIZED

PERSON1

GIVEN_NAME1 MATCH_STD3

STANDARDIZED

PERSON1

GIVEN_NAME2

STANDARDIZED

PERSON1

GIVEN_NAME2 MATCH_STD1

STANDARDIZED

PERSON1

FAMILY_NAME1

STANDARDIZED

PERSON1

HONORARY POSTNAME

STANDARDIZED

PERSON1

MATURITY POSTNAME

STANDARDIZED

PERSON1

TITLE

STANDARDIZED

PERSON1

PERSON

STANDARDIZED

PERSON1

GENDER

STANDARDIZED

DATE1

DATE

STANDARDIZED

EMAIL1

EMAIL

STANDARDIZED

NORTH_AMERICAN PHONE1

NORTH_AMERICAN PHONE

STANDARDIZED

SSN1

SSN

STANDARDIZED

Rename these columns in the output schema as follows by right-clicking each one to select the option Properties and after changing the name, choose OK: Original

New

FIRM1_FIRM STANDARDIZED

DC_FIRM

Continued on next page

118

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

i)

PERSON1_PRENAME STANDARDIZED

DC_PRENAME

PERSON1_GIVEN_NAME1 STANDARDIZED

DC_GIVEN_NAME1

PERSON1_GIVEN_NAME1 MATCH_STD1

DC_GIVEN_NAME1 MATCH_STD1

PERSON1_GIVEN_NAME1 MATCH_STD2

DC_GIVEN_NAME1 MATCH_STD2

PERSON1_GIVEN_NAME1 MATCH_STD3

DC_GIVEN_NAME1 MATCH_STD3

PERSON1_GIVEN_NAME2 STANDARDIZED

DC_GIVEN_NAME2

PERSON1_GIVEN_NAME2 MATCH_STD1_STANDARDIZED

DC_GIVEN_NAME2 MATCH_STD1

PERSON1_FAMILY_NAME1 STANDARDIZED

DC_FAMILY_NAME1

PERSON1_HONORARY POSTNAME_STANDARDIZED

DC_HONORARY_POSTNAME

PERSON1_MATURITY POSTNAME_STANDARDIZED

DC_MATURITY_POSTNAME

PERSON1_TITLE STANDARDIZED

DC_TITLE

PERSON1_PERSON STANDARDIZED

DC_PERSON

PERSON1_GENDER STANDARDIZED

DC_GENDER

DATE1_DATE STANDARDIZED

DC_DATE

EMAIL1_EMAIL STANDARDIZED

DC_EMAIL

NORTH_AMERICAN PHONE1_PHONE STANDARDIZED

DC_NORTH_AMERICAN PHONE

SSN1_SSN STANDARDIZED

DC_SSN

Map all columns prefixed with the prefix AC_ from the input schema to the output schema following the columns prefixed by DC_ by dragging from the input column to the output column. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

119

Unit 4: Using Data Cleanse Transforms

j)

BODS30

Select the Back icon to close the transform editor and return to the data flow work space.

Task 4: Because of the additional transform, remap the fields from the input schema to the output schema in the Query transform. 1.

In the Query transform, remap the fields from the input schema to the output schema. a)

In the data flow work space, double-click the Query transform to open the transform editor.

b)

In the transform editor for the Query transform, remove all columns in the output schema, except the AC_INFO_DESC column created for the address code lookup. Select all relevant output columns. Then right-click to select the option Delete.

c)

Map all of the columns from the input schema to the output schema and insert them above the AC_INFO_DESC column by dragging from the input column to the appropriate output column. Hint: You can select several input columns at one time either by holding down the CTRL button on the keyboard or by selecting the first input column you want to move, hold down the SHIFT key and select the last input column you want to move.

Continued on next page

120

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

2.

Use the Functions wizard to create a new lookup expression using the lookup_ext function. a)

In the Mapping tab for the output column, AC_INFO_DESC, select the Function button and in the Select Function dialog box, open the category of “Database Functions”.

b)

From the list of function names, select the lookup_ext function and select the Next button.

c)

In the Lookup_ext - Select Parameters dialog box, enter the parameters: Field/Option

Value

Lookup table

BETA.SOURCE.ADDRCODES

Condition Columns in lookup table

Info_Code

Op.(&)

=

Expression

EnglishNorthAmerica_DataCleanse.AC_INFO_CODE

Output Column in lookup table

Info_Desc

This code is generated: lookup_ext ([BETA.SOURCE.ADDRCODES, 'PRE_LOAD_CASH', 'MAX'], [Info_Desc], [NULL], [Info_Code, '=”, EnglishNorthAmerica_DataCleanse.AC_INFO_CODE]) SET “(run_as_separate_process”='no') d)

Select the Back icon to close the editor.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

121

Unit 4: Using Data Cleanse Transforms

BODS30

Task 5: Save all objects and execute the batch job Beta_Customers_DataCleanse_Job with the default parameters. 1.

Save all objects and execute the batch job Beta_Customers_DataCleanse_Job with the default parameters. a)

In the Project area, right-click the batch job Beta_Customers_DataCleanse_Job to select the option Execute. Choose OK to save all objects you have created.

b)

Choose OK to accept the default execution parameters.

c)

In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.

Task 6: In the Beta_Customers_DataCleanse2_Job, modify the Query transform to add a description field DC_GENDER_DESC to further clarify the DC_GENDER_ID field. 1.

Modify the Query transform in the Beta_Customers_DataCleanse2_DF workspace with the addition of the DC_GENDER_DESC field. a)

Double-click the Beta_Customers_DataCleanse2_DF to open the workspace.

b)

Double-click the Query transform to open the transform editor.

c)

In the output schema, find and right-click the field DC_GENDER to select the option New Output Column and then the option Insert Below. Call the new field DC_GENDER_DESC with a datatype of varchar and a length of 1. Choose OK to create the output column. This column translates the DC_GENDER_ID field values into single character representation so it can be sorted.

d)

In the Mapping section for the new field, select the Functions button and select the function search_replace under the String functions. Then choose Next.

e)

In the next dialog box, select the Full string radio button and select the value DC_GENDER_ID from the drop-down list as the value of the field Input expression.

Continued on next page

122

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

f)

g)

Then in the Search-replace table, enter the following values for the Search value and Replace value: Search value

Replace value

0

U

1

M

2

M

3

A

4

F

5

F

Leave the Default replace value blank. Click Finish to close the function editor.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

123

Unit 4: Using Data Cleanse Transforms

BODS30

Task 7: Save all objects and execute the batch job Beta_Customers_DataCleanse2_Job with the default parameters. 1.

Save all objects and execute the batch job Beta_Customers_DataCleanse2_Job with the default parameters. a)

In the Project area, right-click the batch job Beta_Customers_DataCleanse2_Job to select the option Execute. Choose OK to save all objects you have created.

b)

Choose OK to accept the default execution parameters.

c)

In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data. Check the DC_GENDER_DESC field to see the designations.

Figure 35: Data Cleanse Results

124

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Data Cleanse Transform

Lesson Summary You should now be able to: • Complete a Data Cleanse transform

2011

© 2011 SAP AG. All rights reserved.

125

Unit Summary

BODS30

Unit Summary You should now be able to: • Complete a Data Cleanse transform

126

© 2011 SAP AG. All rights reserved.

2011

Unit Summary

127

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit Summary

128

BODS30

© 2011 SAP AG. All rights reserved.

2011

Unit 5 Matching and Consolidating Data Unit Overview This lesson describes the matching capabilities of Data Services. After completing this lesson, you will be able to: • • •

Understand match concepts Perform post-match processing Tailor the match process to your data

Unit Objectives After completing this unit, you will be able to: • • • • • • • • • • • • • • • •

2011

Create a match strategy Define match components Set up matching using the Match Wizard Complete a simple match for full names on the Beta Businesses USA_Customers table Control record comparisons Use match levels Define match criteria Use blank field options Use match criteria options Modify match criteria using the Match Editor Perform post-match processing Set priorities Configure the group prioritization operation Set up for match reports Consolidate matching records to form a best record Configure the best record operation

© 2011 SAP AG. All rights reserved.

129

Unit 5: Matching and Consolidating Data

• • •

BODS30

Generate a unique identification number Use consumer and consumer householding Use associative matching

Unit Contents Lesson: Using the Match Wizard ................................................. 131 Exercise 4: Use the Match Wizard ........................................... 143 Lesson: Configuring the Match Transform ...................................... 156 Exercise 5: Modify Match Criteria using the Match Editor ................ 185 Lesson: Performing Post-Match Processing .................................... 204 Exercise 6: Prioritize Matched Records ..................................... 209 Lesson: Consolidating Matching Records ....................................... 230 Exercise 7: Create a Best Record............................................ 239 Lesson: Using Advanced Match Strategies ..................................... 261

130

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Lesson: Using the Match Wizard Lesson Overview Introduction The process of identifying matching records in Data Services is highly customizable and can accommodate nearly any set of complex business rules.

Lesson Objectives After completing this lesson, you will be able to: • • • •

Create a match strategy Define match components Set up matching using the Match Wizard Complete a simple match for full names on the Beta Businesses USA_Customers table

Business Example Realizing that it is easy to create duplicate records, you want to determine if the records being used as the single version of the truth for customers do not have duplicate records. You do not want to read through each record individually, so you are looking for a programmatic way to determine if duplicates exist.

Understanding the Purpose of Matching The goal of matching is to determine whether records refer to the same entity. This involves evaluating how well the individual fields, or attributes, of records match each other.

2011

© 2011 SAP AG. All rights reserved.

131

Unit 5: Matching and Consolidating Data

BODS30

Figure 36: Data Quality Framework – Matching and Consolidate

Once matches have been identified, data from these match groups can be salvaged and posted to form a single best record or posted to all matching records to update them.

Figure 37: Basic Concepts – Matching and Consolidation

132

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Input records

Consolidated record

Ms Margaret Smith-Kline Ph.D.

Name:

Future Electronics

Ms. Margaret Smith-Kline Ph.D.

101 Avenue of the Americas New York NY 10013–1933 maggie.kline@future_electronics.com May 23, 2003

Company name:

Maggie Smith

001–12–4367

Future Electronics Co. LLC

Purchase date:

101 6th Ave.

5/23/2003

Manhattan, NY 10012

Address:

maggie.kline@future_electronics.com

101 Avenue of the Americas

001–12–4367

New York, NY 10013–1933

Future Electronics Co. LLC SSN:

Latitude: 40.722970 Longitude: Ms. Peg Kline

–74.005035

Future Elect. Co.

Fed code:

101 6th Ave.

36061

New York NY 10013

Phone:

001–12–4367

(222) 922–9922

(222) 922–9922

Email:

5/23/03

maggie.kline@future_electronics.com

Using other Transforms to Support Matching There are several transforms that you should use before or after the Match transform to optimize the matching process.

2011

© 2011 SAP AG. All rights reserved.

133

Unit 5: Matching and Consolidating Data

BODS30

Figure 38: Data Flow with Matching

Transform

Description

Case

Use the Case transform to route data to a particular Match transform, such as sending USA-specific and international data to different transforms. You can also use this transform to route blank records around a Match transform.

Address Cleanse

Data Cleanse

Use the Address Cleanse transform to ensure all address data is accurate prior to matching.

Use the Data Cleanse transform to ensure name, firm, title, phone, date, SSN, and e-mail data is parsed and standardized to improve the likelihood of identifying matches. You also may want to include match standards of name or firm data. Utilizing name match standards generated from the Data Cleanse transform can help to improve match results.

Query

Query

134

Use the Query transform to create additional fields, perform functions, and order data. You may also want to filter out empty records to assist in performance and protect against false matches. Performing search and replace functions can also assist in reducing noise words from matching, such as Inc., Corp., and Ltd. Use the Merge transform to combine data output from multiple Match transforms. This is commonly used to reunite data sets that were split before matching for purposes of routing empty records. It is also used to reunite records from multiple match transforms so the resulting match groups can be associated.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Understanding the Match Wizard The easiest way to create a Match transform is to use the Match Wizard. The Match Wizard provides intuitive screens that walk you through the process, based on the strategy you choose.

Figure 39: Develop a Matching Strategy

You can quickly set up match strategies without manually configuring each individual setting required to complete the task. The Match Wizard: • • • • •

Builds the necessary transforms to perform the match strategy you choose. Applies default values to your match criteria based on the strategy you choose. Places the resulting transform on the workspace, connected to the upstream transform you choose. Detects the appropriate upstream fields and maps to them automatically. Automatically generates report statistics and common output fields.

The Match Wizard provides you with a basic match setup that, in some cases, may require customization to meet the goals of your project. The Match Wizard does not: •

• •

•

Alter any data that flows through it. To correct non-standard entries or missing data, place the Address Cleanse and Data Cleanse transforms upstream from the match process. Connect the generated match transform to a target. Directly allow you to set rule-based or weighted-scoring values for matching. The Match Wizard incorporates a “best practices” standard. You may want to edit option values to conform to your business rules. This can be done by editing the options in the Match transform that is generated by the Match Wizard. Directly create prioritization, best record, or candidate selection. These options are set by editing the options in the generated Match transform.

Setting up Matching To set up your Match transform using the Match Wizard, you must define specific elements:

2011

© 2011 SAP AG. All rights reserved.

135

Unit 5: Matching and Consolidating Data

BODS30

1. Select the appropriate match strategy based on the types of information you want to compare. 2. Specify how many match sets to define the number of ways that you would like to identify a match. 3. Define your match levels, if appropriate, to indicate the categories of matching involved. 4. Define your match criteria by selecting the individual columns in your data that will be used for matching, based on your business rules. 5. Define the break keys that will be used to group records that are likely to match by selecting the upstream field and setting the start position and length.

Figure 40: Setting Up Matching

Deciding on a Match Strategy The Match Wizard allows you to complete a Match transform using default strategies with a few clicks. These strategies contain best practice settings for a handful of potential matching situations. For example, do you want to perform a simple match on name and address, or would you prefer to use a householding method to determine residence, family, and individual match levels? Once you have defined your requirements, select the matching strategy that best meets those business rules. The Match Wizard supports these common matching strategies:

136

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Match strategy

Description

Simple match

Use this strategy when your matching business rules consist of a single match criterion for identifying relationships in consumer, business, or product data

Consumer householding

Use this strategy when your matching business rules consist of multiple levels of consumer relationships, such as residential matches, family matches, and individual matches.

Corporate householding

Use this strategy when your matching business rules consist of multiple levels of corporate relationships, such as corporate matches, subsidiary matches, and contact matches .

Multi-national consumer

Use this match strategy when your data consists of multiple countries and your matching business rules are different for different countries.

Identify a person multiple ways

Use this strategy when your matching business rules consist of multiple match criteria for identifying relationships, and you want to find the overlap between all of those definitions.

Defining Match Components The basic components of matching are: • • • •

Match sets Match levels Match criteria Break groups

Match sets Match sets let you control how the Match transform matches certain records, segregates records, and matches on records independently. For example, you could choose to match USA records differently than records containing international data.

2011

© 2011 SAP AG. All rights reserved.

137

Unit 5: Matching and Consolidating Data

BODS30

Each match set can have its own break groups, match levels, match criteria, prioritization, and consolidation. A match set: • • •

Allows only select data into a given set of match criteria for possible comparison (for example, exclude blank SSNs, international addresses, and so on.) Permits related match scenarios to be stacked to create a multi-level match set. Prepares data so that multiple match sets can be later associated using the Associate transform.

Match levels A match set contains one or more match levels. A match level is an indicator of what type of matching will occur, such as on individual, family, resident, firm, and so on. A match level refers not to specific criteria, but to the broad category of matching. You can have as many match levels as you want. However, the Match wizard restricts you to three levels during setup (more can be added later). Each match level in a match set can be defined in a way that is increasingly more restrictive. Multi-level matching feeds only the records that match from match level to match level (for example, resident, family, individual) for comparison. A single match level is defined automatically for the Simple Match strategy. The example of consumer householding shown below depicts residence, family, and individual levels.

Figure 41: Hierarchical Levels of Matching

138

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Match level

Description

Resident

Determines whether two records should be considered members of the same residence. The Match transform compares the address data. A match means that the two records represent members of the same household. Contrast this match type with the family match type, which also compares last name data.

Family

Determines whether two people should be considered members of the same family. The Match transform compares the last name and the address data. A match means that the two records represent members of the same family.

Individual

Determines whether two records are for the same person. The Match transform compares the first name, last name, and address data. A match means that the two records represent the same person.

Firm

Determines whether two records reflect the same firm. This match type involves comparisons of firm and address data. A match means that the two records represent the same firm.

Firm-Individual

Determines whether two records are for the same person at the same firm. This match type compares the first name, last name, firm name, and address data. A match means that the two records reflect the same person at the same firm.

Alternate forms of match level criteria can also be defined. For example: • • • •

Family-level match criteria based on family (last) name and telephone number. Individual-level match criteria based on: Social Security Number (SSN) or identification number and full name or full name and account number. Firm-level match criteria based on firm name. Industrial Classification (SIC) Code or firm name and Data Universal Numbering System (DUNS) number.

Match criteria Match criteria refers to the fields you want to match on. You can use criteria options to specify business rules for matching on each of these fields. Match criteria allow you to control how exact data must be for the data to be considered a match.

2011

© 2011 SAP AG. All rights reserved.

139

Unit 5: Matching and Consolidating Data

BODS30

For example, you may require first names to be at least 85% similar, but also allow a first name initial to match a spelled–out first name, and allow a first name to match a middle name. Break groups

Figure 42: Break Group Benefits and Considerations

Controlling the number of record comparisons in the matching process is important for performance. Break groups limit the number of comparisons performed during the matching process, because matching is only considered within break groups, not between them. Break groups are established by defining criteria called a break key. Defining an appropriate break key can save valuable processing time by preventing widely divergent data from being compared. Break keys should group records that would most likely contain matches. Fields commonly used for creating break groups are postcodes, account or identification numbers, or the first two positions of a street name. For example, when matching on address data, it is common to use the first three digits of a postcode as the break key. Thus, only records that have the same first three digits of a postcode become members of a break group. In the diagram below, with a break key of the first three digits of the postal code, records in A1 would be compared to records in A2, but never to records in B1 or B2.

140

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Figure 43: Setting Up Break Keys

Once you see how break keys control the number of records that are matched, it is easier to follow the match process.

Figure 44: Break Keys and Matching

2011

© 2011 SAP AG. All rights reserved.

141

Unit 5: Matching and Consolidating Data

BODS30

To set up matching using the Match Wizard 1. 2. 3. 4. 5. 6.

142

In your data flow, right-click the transform you would like to the Match transform after, and select Run Match Wizard. Select the required match selection type. Set the Match Set according to your business rules. Set the criteria for Match Set Name according to your business rules. Define the break keys by selecting the upstream field, setting the start position, and length. Connect the Match transform to the target table.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Exercise 4: Use the Match Wizard Exercise Objectives After completing this exercise, you will be able to: • Complete a simple match for full names on the Beta Businesses USA_Customers table

Business Example You want to determine if there are duplicate records in the Beta Businesses USA_Customers table. You decide to match according to name and address.

Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new Datastore for the DQSTANDALONE database.

3.

In your Local Object Library, create a new Datastore for the DELTA staging database.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

143

Unit 5: Matching and Consolidating Data

BODS30

Task 2: You want to determine if there are duplicate records in the Beta Businesses USA_Customers table. You decide to match according to name and address by setting up a batch job Beta_Customers_Match_Job containing the data flow Beta_Customers_Match_DF by using the Match Wizard. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

In the Omega project, create a new batch job Beta_Customers_Match_Job with a new data flow called Beta_Customers_Match_DF.

Task 3: In the data flow Beta_Customers_Match_DF, configure the Match Wizard in the Query transform. 1.

In the DataCleansed2_USA table in the DQSTANDALONE datastore, set the content type for certain fields to Name.

2.

In the workspace for Beta_Customers_Match_DF, add the DataCleansed2_USA table from the DQSTANDALONE datastore as the source object connected to the Query transform with a target template table Match_Target_USA in the DELTA datastore.

Task 4: Use the Match Wizard in the Query transform to set up a Simple match. 1.

Transfer all input columns from the source table to the output columns in the Query transform.

2.

Configure the Match Wizard in the Query transform.

Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target. 1.

Configure a Match_NameAddr transform to the data flow. Continued on next page

144

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

2.

Configure a new target table for the data flow.

Task 6: Execute the batch job Beta_Customers_Match_Job with the default parameters. 1.

2011

Execute the batch job Beta_Customers_Match_Job with the default parameters.

© 2011 SAP AG. All rights reserved.

145

Unit 5: Matching and Consolidating Data

BODS30

Solution 4: Use the Match Wizard Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

a)

From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.

b)

In the dialog box, enter your assigned User ID.

c)

Enter your password, which is the same as your User ID.

d)

Choose the Log on button.

e)

In the list of repositories, select your repository DSREPO## where ## is the number portion of your User ID.

f)

Choose OK.

If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

Continued on next page

146

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DQSTANDALONE

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DQSTANDALONE

User name

sourceuser (lower case)

Password

sourcepass (lower case)

Choose OK to finish. c)

In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.

d)

Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •

AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

147

Unit 5: Matching and Consolidating Data

3.

BODS30

In your Local Object Library, create a new Datastore for the DELTA staging database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DELTA

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DELTA## (where ## is the number assigned by your Instructor)

User name

student## or consult your instructor

Password

student## or consult your instructor

Click OK to finish. c)

You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.

Continued on next page

148

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Task 2: You want to determine if there are duplicate records in the Beta Businesses USA_Customers table. You decide to match according to name and address by setting up a batch job Beta_Customers_Match_Job containing the data flow Beta_Customers_Match_DF by using the Match Wizard. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

a)

From the Project menu, choose New → Project.

b)

When the Project New dialog box appears, enter Omega in the Project name field.

c)

Choose Create so that the new project appears in the Project area.

In the Omega project, create a new batch job Beta_Customers_Match_Job with a new data flow called Beta_Customers_Match_DF. a)

In the Project area, right-click the project name and choose New Batch Job from the menu.

b)

Enter the name of the job as Beta_Customers_Match_Job.

c)

Press Enter to commit the change.

d)

Select the Data Flow icon in the Tool Palette.

e)

Right-click in the data flow workspace and choose Add New → Dataflow.

f)

Enter Beta_Customers_Match_DF as the name.

g)

Press Enter to commit the change.

h)

Double-click the Beta_Customers_Match_DF to open the workspace.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

149

Unit 5: Matching and Consolidating Data

BODS30

Task 3: In the data flow Beta_Customers_Match_DF, configure the Match Wizard in the Query transform. 1.

2.

In the DataCleansed2_USA table in the DQSTANDALONE datastore, set the content type for certain fields to Name. a)

In the Local Object Library, select the Datastores tab and open the node for the DQSTANDALONE datastore.

b)

Open the node Tables and right-click on the table DataCleansed_USA to choose the option Edit.

c)

Right-click the field Cust_Name to select the option Properties and choose the type Name from the drop-down list for the Content Type associated with the field.

d)

Choose OK to close the dialog box.

e)

Close the table metadata by selecting the Back button.

In the workspace for Beta_Customers_Match_DF, add the DataCleansed2_USA table from the DQSTANDALONE datastore as the source object connected to the Query transform with a target template table Match_Target_USA in the DELTA datastore. a)

In the Local Object Library, select the Datastores tab and then select the DataCleansed2_USA table from the DQStandalone datastore.

b)

Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.

c)

In the Tool Palette, select the icon for the Query transform and select in the dataflow workspace to add the Query transform.

d)

Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.

Continued on next page

150

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Task 4: Use the Match Wizard in the Query transform to set up a Simple match. 1.

Transfer all input columns from the source table to the output columns in the Query transform. a)

Double-click the Query transform to open the transform editor.

b)

In the transform editor, map all columns from the input schema to the output schema columns. Select all the fields listed in the Schema In window and then drag them to the Query object in the Schema Out window.

c) 2.

Select the Back icon to close the Query transform editor.

Configure the Match Wizard in the Query transform. a)

In the data flow Beta_Customers_Match_DF work space, right-click on the Query transform to select the option Run Match Wizard.

b)

On the Match Selection Type window, select the option for a Simple match selection with only one way to identify an individual. Then choose Next.

c)

Set the name for Match Set Name to NameAddr.

d)

Specify that the first criteria is Full Name by selecting the field Full Name from the drop-down list for the field Match On. Use the value Word similarity for the field Compare Using. Choose Enter.

e)

Enter the second criteria as Address and use the value Word similarity for the field Compare Using. Select Enter. Then choose the Next button.

f)

In the Select Criteria Fields window, confirm that the fields are mapped appropriately to the criteria as seen below. Criteria

Field

PERSON1_FAMILY_NAME1

DC_FAMILY_NAME

PERSON1_GIVEN_NAME1

DC_GIVEN_NAME1

PERSON1_GIVEN_NAME1 MATCH_STD1

DC_GIVEN_NAME1 MATCH_STD1

PERSON1_GIVEN_NAME1 MATCH_STD2

DC_GIVEN_NAME1 MATCH_STD2

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

151

Unit 5: Matching and Consolidating Data

BODS30

PERSON1_GIVEN_NAME1 MATCH_STD3

DC_GIVEN_NAME1 MATCH_STD3

PERSON1_GIVEN_NAME2

DC_GIVEN_NAME2

PERSON1_GIVEN_NAME2 MATCH_STD1

DC_GIVEN_NAME2 MATCH_STD1

ADDRESS_PRIMARY NAME

AC_PRIM_NAME

ADDRESS_PRIMARY NUMBER AC_PRIM_NUM ADDRESS_SECONDARY NUMBER

AC_UNIT_NUM

Notice that matching is performed on the cleansed and standardized columns, which are the fields that begin with either DC_ or AC_. If one of these fields is not mapped, then map it selecting the Field and select the field from the drop-down list. Then select the Next button. g)

On the next window, define the break key by selecting AC_Postcode_Full as the Upstream field from the drop-down list. Hint: You may need to click on the “field” More... Set the start position to 1 and length to 5.

h)

Choose Finish and close the Match Wizard configuration to return to the data flow work space.

Continued on next page

152

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target. 1.

Configure a Match_NameAddr transform to the data flow. a)

When you closed the Match Wizard, Data Services automatically inserted the Match_NameAddr transform to the data flow workspace.

b)

Double-click the Match_NameAddr transform to open the transform editor.

c)

Rename the new generated match columns as follows:

d)

Original Column Name

New Column Name

NameAddr_Level1_GROUP NUMBER

MC_GROUP NUMBER

NameAddr_Level1_MATCH SCORE

MC_MATCH SCORE

NameAddr_Level1 GroupStats1 GROUP_COUNT

MC_GROUP COUNT

NameAddr_Level1 GroupStats1 GROUP_RANK

MC_GROUP RANK

Verify that all columns in the output schema are mapped correctly. Remap any columns that are not mapped. Hint: If the column does not have a blue arrow icon, the mapping needs to be reset.

2.

Configure a new target table for the data flow. a)

From the Tool Palette, select the icon for a template table and select in the data flow work space to insert the new target template table. Enter Match_Target for the name and select the DELTA datastore from the drop-down list. Choose OK to create the template table.

b)

Connect the Match_NameAddr transform to the target template table by selecting the Match_Name_Addr transform and while holding down the mouse button, drag to the target template table and release the mouse button.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

153

Unit 5: Matching and Consolidating Data

BODS30

Task 6: Execute the batch job Beta_Customers_Match_Job with the default parameters. 1.

Execute the batch job Beta_Customers_Match_Job with the default parameters. a)

In the Project area, right-click the batch job Beta_Customers_Match_Job to select the option Execute. Choose OK to save all objects you have created.

b)

Choose OK to accept the default execution parameters.

c)

In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.

d)

In the target table, select the column heading for the MC_GROUP_NUMBER column to sort the data. Notice that there are 9 match groups. Matched records for Gloria Peterson appear to be valid matches because they share the same name, address and SSN values. However, matched records for A G Olson and Andrew Olson have different social security numbers and are therefore invalid matches. It is apparent that the match criteria needs adjustment to include matching on SSN.

154

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using the Match Wizard

Lesson Summary You should now be able to: • Create a match strategy • Define match components • Set up matching using the Match Wizard • Complete a simple match for full names on the Beta Businesses USA_Customers table

2011

© 2011 SAP AG. All rights reserved.

155

Unit 5: Matching and Consolidating Data

BODS30

Lesson: Configuring the Match Transform Lesson Overview Once the basic configuration of your Match transform has been completed using the Match Wizard, you can further refine the process using the Match Editor.

Lesson Objectives After completing this lesson, you will be able to: • • • • • •

Control record comparisons Use match levels Define match criteria Use blank field options Use match criteria options Modify match criteria using the Match Editor

Business Example While matching can take place on the parsed components of data records, there are many variable forms, spellings, abbreviations, and blank values that can get in the way of determining if two records are really a match. You want to learn how to configure the Match transform to more accurately determine if records match by setting certain options in configuration for match criteria. You need to understand how to control the number of record comparisons. You also want to know how to determine the best possible matching method, rule-based, weighted-scoring or combination matching.

Understanding the Match Editor Although the Match Wizard does a lot of the work for matching, there are some things that you must configure yourself. The basic match setup created by the Match Wizard may require customization to meet the goals of your project. You can edit the options in the transform that is generated by the Match Wizard using the Match Editor. Match transforms can be modified or manually configured using the Match Editor. This enables you to set up and modify match strategies in a single screen.

156

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Figure 45: Data Services Match Capabilities

In the Match Editor, you can make adjustments to all aspects of the matching process, including transform-level options, group forming operations, match criteria, and post-match processing.

Figure 46: Match Functionality Highlights

2011

© 2011 SAP AG. All rights reserved.

157

Unit 5: Matching and Consolidating Data

BODS30

The Match Editor: • •

Enables you to set rule-based or weighted-scoring values for matching. Provides access to prioritization, best record, and candidate selection operations.

Explaining the Match Transform The Match transform performs matching based on the business rules you define. The transform sends matching and unique records on to the next transform in the data flow. You can edit the transform-level options for the Match transform to specify the reporting settings and how the process is run.

Controlling Record Comparisons When configuring the Match transform, you can use the group forming operations to control the number of match comparisons for a more efficient comparison process. Group forming options include the following: Break groups Use fields to group together records that have a better chance of matching to reduce the number of comparisons the Match transform needs to make. Records are compared within break groups. Candidate selection Use SQL to extract a subset of records from the database that are needed for comparison. This is primarily used for real-time jobs to speed processing in a real-time match job. Group prioritization Use priorities and penalties to order records in data collections before you compare them. This helps to ensure that your best records are being used to drive the comparison process.

158

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

To create break keys 1. 2. 3. 4.

In the Match Editor, add a Break Groups operation to the Groups Forming option group. In the Break key table, add a row. Select a field in the field column that you want to use as a break key. Select the Start position and length (number of characters) you want used in the break key. You can use negative numbers to signify that you want to start and the end of a given length of characters.

5. 6.

Add more rows and fields as necessary. Order your rows by selecting a row and clicking the Move Up and Move Down buttons. Ordering your rows ensures that the fields are used in the right order in the break key.

7.

Choose OK.

Understanding Matching Methods Matching methods use a similarity score to define how well two records match. A similarity score is a percentage that represents how closely two records resemble each other, based on your business rules. This score is calculated internally by the application when records are compared. Whether the application considers the records a match depends on the match and no match scores you define in your match criteria. If the similarity score is lower than or equal to the no match score, the two records definitely do not match. If the score is higher than the no match level, but not higher than the match score, the result of the comparison is inconclusive. If the score is higher than or equal to the match score, the two records are considered a match. The following is a simple example of how similarity scores are determined. The last names are compared for length and letter-by-letter comparison in each position to determine the similarity score. Here, only the first comparison is considered a match because it is the only comparison where the similarity score meets or exceeds the no match score.

2011

© 2011 SAP AG. All rights reserved.

159

Unit 5: Matching and Consolidating Data

BODS30

Figure 47: Simple Matching

There are three matching methods to consider when setting up your match criteria in the Match transform: • • •

Rule-based matching Weighted-scoring matching Combination matching

Rule-Based Matching With rule-based matching, you rely solely on your match and no match scores to determine matches within a criterion. The following example shows how to set up this method in the Match transform:

Figure 48: Rule-based Matching

By entering a value of 101 in the match score for every criterion except the last, the First Name and Last Name criteria never determine a match, because two fields cannot be more than 100 percent alike. By setting the match and no match score for the E-mail criteria with a one point difference, any comparison that reaches the last criterion must either be a match or a no match.

160

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Weighted-scoring matching In a weighted-scoring matching, you can assign different weights to individual criterion by specifying a contribution value. The higher the value, the more weight that criterion carries in determining matches. Fields that are more likely to determine a match should be assigned more weight. For example, an SSN or account number may be assigned a higher weighted value than an E-mail address. The total of all contribution values must total 100. The Match transform generates the contribution score for each criterion by multiplying the contribution value with the similarity score. These individual contribution scores are then added to get the total contribution score. Matches are determined by comparing the total contribution score with the weighted match score. If the total contribution score is equal to or greater than the weighted match score, the records are considered a match. If the total weighted score is less than the weighted match score, the records are considered a no match. You can set the weighted match score in the Weighted Match Score option of the Match Level Editor. When you set up weighted scoring, the no match score must be set to –1, and the match score must be set to 101. These values ensure that neither a match or a no match is determined for the specified criterion. In this example, the contribution value for the E-mail criterion gives it the most importance.

Figure 49: Weighted-scoring Matching

Note: The contribution score = actual similarity X contribution value. Combination matching In combination matching, rule-based and weighted-scoring methods are combined.

2011

© 2011 SAP AG. All rights reserved.

161

Unit 5: Matching and Consolidating Data

BODS30

A no match can be determined by the similarity score of any criterion equal to or lower than the no match score. However, the only way a match determination can be made is when a no match has not been determined and the total contribution score is greater than or equal to the weighted match score. A match can never be based on any individual elements, only the weighted score of all of them, provided that a no match determination has not been made.

Figure 50: Combination Matching

The contribution score = actual similarity X contribution value.

Using Match Levels Match levels are used to increase granularity of criteria. Only matching records from the previous level area passed to a subsequent level. There are several person options to configure for each level. These person options dictate how name data is treated for all columns that relate to an individual. The following person options are available: Number of names that must match With this option, you can control how matching is performed on match keys with more than one name (for example, comparing “John and Mary Smith” to “Dave and Mary Smith”). Choose whether only one or both persons must match for the records to match. Match on hyphenated family name

162

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

With this option, you can control how matching is performed if a family name (last name) field contains a hyphenated family name (for example, comparing “Smith-Jones” to “Jones”). Choose whether both keys must have one or both names to match. Compare Given_Name1 to Given_Name2 With this option, you can compare a first name to a middle name. This option is useful for identifying matches of individuals that use their middle name, not their first name. Record #

First name

Middle name

Last name

Address

170

Leo

Thomas

Smith

225 Pushbutton Drive

198

Tom

Smith

225 Pushbutton Drive

Ignore family name when female With this option, you can control whether the family name is used for matching purposes when the gender for the record is female. This technique assists in dealing with situations where the individual's married name differs from her maiden name.

Using Compare Tables Compare tables are sets of rules that define which records to compare, acting as an additional way to create break groups. You use your logical source values to determine which records are compared or are not compared. By using compare tables, you can compare records within sources, or you can compare records across sources, or a combination of both. Compare tables are configured when defining match levels.

Defining Match Criteria Match criteria contains the set of rules used to determine matches. Each rule is based on match key, which is the portion of a field used for record comparison. For example, match criteria may include various standardized firm, name, address, phone, E-mail, and SSN fields as match keys. A standardized firm may be contained within a single field, whereas a given name may incorporate several match standards. These fields must be mapped as input to the Match transform and are logically grouped and listed in the Match Editor for ease of configuration.

2011

© 2011 SAP AG. All rights reserved.

163

Unit 5: Matching and Consolidating Data

BODS30

Match keys must be selected for each match level in a match set. They are processed sequentially for rule-based and combination matching, so it is important that you organize your match criteria keys in the order in which you want them to be evaluated. Organizing your criteria randomly may produce unacceptable results. For each match criterion, you must configure the match, no match, and weighted score options that are appropriate for the match method you use. Other criteria options for each match key enable you to apply business rules for determining how close to exact the data needs to be for records to be considered a match and how to deal with special data. For example, you may require first names to be at least 85% similar, but also allow a first name initial to match a spelled-out first name, and allow a first name to match a middle name. There are two types of criteria: Predefined criteria are available for fields typically used in matching, such as name, address, firm, phone, identification number, e-mail, and title. Predefined criteria can be modified. Custom criteria must be configured for other fields such as part numbers and proprietary data for which predefined criteria do not exist. Match criteria can be adjusted using individual options: Option

Description

Criteria name

Type a descriptive name for this criterion. Specifies the minimum similarity score needed for record to be considered a match based on this criterion. Type a value from 0 to 101.

Match score

A value of 101 ensures that this criterion alone is not enough to consider two records a match and that you want to consider other criteria in the comparison process. For example, a value of 90 means that you consider this data to be important enough that of the data in two records is 90% similar or higher, the records are considered a match.

164

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Specifies the maximum similarity score needed for records to be considered a no-match based on this criterion. Type a value from -1 to 100. A value of -1 ensures that this criterion is not enough to consider two records a no-match and that you want to consider other criteria in the comparison process.

No match score

For example, a value of 49 means that if the similarity between the data in two records is less than 50%, the records do not match. Specifies the contribution value when you use weighted or combination scoring methods. If no single criterion decides a match or no-match, the contribution score is calculated by multiplying the similarity score by the contribution value.

Contribution to weighted score

Enter a value between 0 and 100. Specifies whether to use this criterion if one record's field is populated and the other record's field is blank. Eval: The value entered in the One field blank score option is used as the similarity score for this criterion. Ignore: This criterion is ignored in the comparison process and its contribution to weighted score is proportionally distributed among the remaining criteria. Thus, it negates any impact this contribution score may have had.

One field blank operation

Be sure to enter a value in the One Blank score option if you set this option to Eval. Specifies the similarity score to use if one of the fields is blank and the One field blank operation is set to Eval. One field blank score

Type a value from 0 to 100. Be sure to enter a value here if you set the One field blank operation option to Eval.

2011

© 2011 SAP AG. All rights reserved.

165

Unit 5: Matching and Consolidating Data

BODS30

Specifies whether to use this criterion when both record's fields for this match key criterion are blank. Both fields blank operation

Both fields blank score

Eval: The value entered in the Both fields blank score option is used as the similarity score for this criterion. Ignore: This criterion is ignored in the comparison process and its contribution to weighted score is proportionally distributed among the remaining criteria. Thus, it negates any impact this contribution score may have had. Specifies the similarity score if both fields are blank when the Both fields blank operation is set to Eval. Enter a value form 1 to 100 depending on the matching method you use. Specifies how to handle fields where more than one word commonly exists. Field: If you choose Field, the transform compares the entire field's data as a single word. This algorithm is more efficient and should be used in fields that typically have just one word, like a first name field.

Compare algorithm

Word: If you choose Word, the transform first parses the data into words and then compares the words. This algorithm is less efficient than the Field algorithm, but does a better job comparing data that typically has more than one word in it, such as firm data. Many criteria options require this option to be set to Word.

166

Initial adjustment score

Use the Initials_Adjustment_ Score option to allow matching initials to whole words. For example, “International Health Providers” can be matched to “IHP”.

Substring adjustment score

Use the Substring_Adjustment_Score to allow matching longer strings to shorter strings. For example, the string “Mayfield Painting and Sand Blasting” can match “Mayfield painting”.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Use the Abbreviation_ Adjustment_Score option to allow matching whole words to abbreviations. For example, “International Health Providers” can be matched to “Intl Health Providers”.

Abbreviation adjustment score

This option controls matching whole words to abbreviations in which long words are abbreviated by removing letters. In this case, the first letter of the abbreviation matches the first letter of the longer word, and all remaining letters of the abbreviation appear in the longer word in the same order as the abbreviation. Enter a value from 0 to 100. • • •

A value of 0 (zero) disables this option. A value greater than 0 enables this option. A value of 100 means you want to consider an abbreviation of a longer word to be considered as a perfect match.

Note: For this option to work, you must set the Compare algorithm option to Word. Extended abbreviation matching offers functionality that handles situations not covered by the Initials_Adjustment_Score, Substring_Adjustment_Score, and Abbreviation_ Adjustment_Score options. For example, you might encounter the following situations: • Ext Abbreviation adjustment score

•

2011

Suppose you have localities in your data such as La Crosse and New York. However, you also have these same localities listed as LaCrosse and NewYork (without spaces). Under normal matching, you cannot designate these (La Crosse/LaCrosse and New York/NewYork) as matching 100%; the spaces prevent this. (These would be 94 and 93 percent matching.) Suppose you have Metropolitan Life and MetLife (an abbreviation and combination of Metropolitan Life) in your data. The Abbreviation_Adjustment_Score option cannot detect the combination of the two words.

© 2011 SAP AG. All rights reserved.

167

Unit 5: Matching and Consolidating Data

BODS30

If you are concerned about either of these cases in your data, you should use the Ext_Abbreviation_Adjustment_Score option. The score you set in the Ext_Abbreviation_Adjustment_Score option tunes your similarity score to consider these types of abbreviations and combinations in your data. The adjustment score adds a penalty for the non-matched part of the words. The higher the number, the greater the penalty. The score you set in the Ext Abbreviation Adjustment Score option tunes your similarity score to consider these types of abbreviations and combinations in your data. The adjustment score adds a penalty for the non-matched part of the words. The higher the number, the greater the penalty. A score of 100 means no penalty and score of 0 means maximum penalty. Here are some examples: • • • • • • • • •

Sim score when Adj score is 100 Sim score when Adj score is 50 Sim score when String 1 String 2 Adj score is 0 MetLife Metropolitan Life 58 79 100 MetLife Met Life 93 96 100 Sim score when Adj score is 100 Sim score when Adj score is 50 Sim score when String 1 String 2 Adj score is 0 MetLife MetropolitanLife 60 60 60

Note: In the third string comparison, Extended Abbreviation scoring was not needed or used because both strings compared are each one word. The similarity score value you enter here determines whether to use weighted scoring when a score is greater Use in weighted score if than to the value you specified for this greater than Use in weighted score if greater than option.

168

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

The similarity score value you enter here determines Use in weighted score if whether to use weighted scoring when a score is less than or equal to the value you specified for this Use in less than or equal weighted score if less than or equal option. Specifies whether the match score should be adjusted for any transposed characters encountered. Yes: The transform deducts half as many points for transposed characters as it deducts for other Check for transposed letters

non–matching characters. For example, Comparison: Smith to Simth Results: characters 2 and 3 are transposed Percentage alike: 90% No: The transform handles transposed chiropractors the same way it handles any non–matching characters. For example, Comparison: Smith to Simth Results: characters 2 and 3 are transposed Percentage alike: 80% Specifies how to match data that contains both numbers and letters. This option applies to address data and custom data, such as a part number. The following steps are required to match numerically: 1.

Numeric words match exactly 2.

The string is first broken into words. The word breaking is performed on all punctuation and spacing, and then the words are assigned a numeric attribute. A numeric word is any word that contains at least one number from 0 to 9. For example, 4L is considered a numeric word, where as FourL is not. Numeric matching is performed according to the option setting that you choose (as described below). • •

2011

None: Numeric words do not need to match exactly to be considered a match. Any_Position: Numeric words do not need to be in the same position in two different strings to be considered a match. For example:

© 2011 SAP AG. All rights reserved.

169

Unit 5: Matching and Consolidating Data

BODS30

•

–

Street comparison: “4932 Main St # 101 ” and “ # 101 4932 Main St” are considered a match.

–

Part description: “ACCU 1.4L 29BAR” and “ACCU 29BAR 1.4L” are considered a match.

Any_Position_Consider _Punctuation: This value behaves the same the Any_Position value; however, the Match transform takes the position of the decimal separator (comma or period) within the numeric words into consideration. So, decimal numbers are not broken. For example, the string 123.456 is considered a single numeric word as opposed to two numeric words. For example: –

•

Part description: “ACCU 29BAR 1.4L” and “ACCU 1.4L 29BAR” are considered a match. – Part description: “ACCU 1.4L 29BAR” and “ACCU 29BAR 1,4L” are not considered a match, because there is a decimal indicator between the 1 and the 4 in both cases. – Financial data: “25,435” and “25.435” are not considered a match. Any_Position_Ignore _Punctuation: This value is similar to the Any_Position_Consider_Punctuation option except that decimal separators do not impact the matching process. For example: –

•

170

Part description: “ACCU 1,4L 29BAR” and “ACCU 29BAR 1.4L” are considered a match even though there is a decimal indicator between the 1 and the 4. Same position: Numeric words must match exactly and be in the same position in the string to be considered a match. For example:

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

–

Field compare length Remove punctuation

608–782–5000 will match 608–782–5000, but it will not match 782– 608–5000. Note: For this option to work, you must set the Compare algorithm option to Word.

Specifies the number of characters in the field to compare. Specifies whether to remove punctuation from your data to help provide more accurate matches. Be aware of the following: • •

This option is valid for English language, Latin1 data only. This option is ignored for all other Match engine values.

Yes: Removes punctuation. No: Keeps punctuation in your data. Note: This option works on the mapped input field used in this and other criteria. If you set this option for something different than how it is set in another criterion using the same field, it overwrites that setting. Before setting this option, be sure that there are no other criteria using this same field. Convert to upper case

Specifies whether to convert all data to uppercase for matching purposes only. Be aware of the following: • •

This option is valid for English language, Latin1 only. This option is ignored for all other Match engine options.

Yes: Convert the data to uppercase where appropriate. No: Leaves the data intact. Note: This option works on the mapped input field used in this and other criteria. If you set this option to something different than how it is set in another criterion

2011

© 2011 SAP AG. All rights reserved.

171

Unit 5: Matching and Consolidating Data

BODS30

using the same field, it overwrites that setting. Before setting this option, be sure that there are no other criteria using this same field.

Convert diacritical characters

Specifies whether to include diacritical characters in the matching process. Be aware of the following: • •

This option is valid for all match engine options. This option works best when using the Latin1 engine. For example, if you are processing Japanese data you may have some Latin1 data mixed in it. In these cases, you will be able to convert diacritical characters.

Yes: Converts diacritical characters to the closest English ASCII equivalent for matching purposes. For example, ä converts to a. No: Preserves diacritical characters during the match process. For example, ä is not identical to a. Note: This option works on the mapped input field used in this and other criteria. If you set this option to something different than how it is set in another criterion using this same field, it overrides that setting. Before setting this option, be sure that there are no other criteria using this same field. Convert text to numbers

Specifies whether number represented as text (one, two, and three…) should be converted to numbers. If you choose Yes, they will be cardinal (one > 1) or ordinal (first > 1st) format. Be aware that this option is valid for all match engines except the Other_Non_Latin1 engine. Yes: Converts numbers represented as text to numbers. No: Leaves any numerical text intact. Note: This option works on the mapped input field used in this and other criteria. If you set this option to something different than how it is set in another criterion

172

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

using the same field, it overwrites that setting. Before setting this option, be sure that there are no other criteria using this same field.

To add a match criterion • • •

In the Option Explorer of the Match Editor, right-click the appropriate match level. Select Add Criteria. In the Match Criteria list, select the criterion that best represents the data that you want to match on. If you do not find what you are looking for, select Custom criteria. The Match criteria field mapping and Criteria options tables are activated.

• •

In the Match criteria field mapping table, select an input field that contains the data you want to match on for this criterion. In the Criteria options table, enter a name for your criteria in the Criteria name option. Note: You can keep the default name for predefined criteria, but you should enter a meaningful criteria name if you chose a Custom criterion.

•

•

2011

Configure the remaining criteria options. Be sure to set the Match score and No match score, because these are required options. Criteria option

Value

Match score

101

No match score

77

Compare Algorithm

Word

Use in weighted score if greater than

78

Weighted score is less than or equal

78

To order your criteria in the Options explorer of the Match Editor (or the Match Table), select a criterion and select the Move Up or Move Down buttons as necessary.

© 2011 SAP AG. All rights reserved.

173

Unit 5: Matching and Consolidating Data

BODS30

Using Blank Field Options In your business rules, you can control how the Match transform treats field comparisons when one or both of the fields compared are blank. If you choose to evaluate when one or both of the fields is blank, the Match transform scores the comparison using the score you specify. If you choose to ignore the blank fields, the score for this field rule does not contribute to the overall weighted score for the record comparison. The Match transform redistributes the contribution allotted for this field to the other criteria and recalculates. To help you decide what score to enter, determine if you want the Match transform to consider a blank field 0 percent, similar to a populated field or another blank field, 100 percent similar, or somewhere in between. Your answer depends on which field you are comparing. Giving a blank field a high score might be appropriate if you are matching on a first or middle name or a company name, for example. The following examples demonstrate how setting these blank matching options can affect the overall scoring of records. Example 1 The One Field Blank Operation for Given_Name1 field is set to Ignore:

Figure 51: Blank Field Options 1

Example 2 The One Field Blank Operation for Given_Name1 field is set to Eval, with the One Field Blank Score set to 0:

174

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Figure 52: Blank Field Options 2

Example 3 The One Field Blank Operation for Given_Name1 field is set to Eval, with the One Field Blank Score set to 100.

Figure 53: Blank Field Options 3

Using Match Criteria Options The majority of your data standardization should take place in the address cleansing and Data Cleanse transforms. However, there are a few preprocessing options specific to only the match process that can provide more accurate matching, which can be defined in the Match Editor. These options include removing punctuation, converting to upper case, converting diacritical characters, and converting text to numbers.

2011

© 2011 SAP AG. All rights reserved.

175

Unit 5: Matching and Consolidating Data

BODS30

Matching on Phonetic Data There are instances where using phonetic data can produce more matches when used as a criterion, than if you were to match on other criteria such as name or firm data. For example, the names Smith and Smythe are only 72% similar when you match based on the name field, but are a 100% match when you match phonetically. Name

Comparison score

Smith

72% similar

Smythe Name

Phonic key (primary)

Smith

SMO

Smythe

SMO

Comparison score 100% similar

To match on phonetic data, use the Double Metaphone or Soundex functions to populate a field and use it for creating break groups or use it as a criterion in matching. If you intend to match on phonetic keys, set up the criteria options as follows: Option

Setting

Compare algorithm

Field

Check for transposed characters

No

Initials adjustment score

0

Substring adjustment score

0

Abbreviation adjustment score

0

Set your match score options as follows: Option

Value

Match score

100

No match score

99

If you are matching on multiple criteria, including a phonetic criterion, place the phonetic criteria first in the order of criteria and set your match score options as follows:

176

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Option

Value

Match score

101

No match score

99

Remember that, when you break, records that have no value are not in the same group as records that have a value (unless you set up matching on blank fields). For example, consider the following two input records: Title

First name

Mr

Last name

Address

Postal code

Johnson

100 Main Street

54601

WI Scott

Johnson

100 Main Street

54601

La Crosse WI After these records are processed by the Data Cleanse transform, the first record has an empty first name field and, therefore, an empty phonetic key. is means there cannot be a match if you are creating break groups. If you are not creating break groups, there cannot be a match if you are not blank matching. The length that you assign to a phonetic function output is important. Consider the following example: First name (last name)

Key

S (Johnson)

S

S (Johnson)

SKT

Suppose these two records represent the same person. If you break on more than one character, these records are in different break groups, and therefore will not be compared.

2011

© 2011 SAP AG. All rights reserved.

177

Unit 5: Matching and Consolidating Data

BODS30

Matching Unicode Data Unicode matching lets you process any non-Latin1 Unicode data, with special processing for Chinese, Japanese, Korean, and Taiwanese (CJKT) data. For example, the Match transform will: • •

• •

• •

Consider half-width and full-width characters to be equal. Consider native script numerals and Arabic numerals to be equal. It can interpret numbers that are written in native script. This can be controlled with the Convert_Text_To_Numbers option. Includes variations for popular, personal, and firm name characters in the referential data. Consider firm words, such as Corporation or Limited, to be equal to their variations (Corp. or Ltd.) during the matching comparison process. To find the abbreviations, the transform uses native script variations of the English alphabets during firm name matching. Ignore commonly used optional markers for province, city, district, and so on, in address data comparison. Intelligently handles variations in a building marker.

With Japanese data, the Match transform will: • • • •

Block data markers, such as chome and banchi, to be equal to those used with hyphenated data. Words with or without Okurigana to be equal in address data. Variations of no marker, ga marker, and so on, to be equal. Variations of a hyphen or dashed line to be equal.

The Unicode match functionality does not: • •

Perform conversions of simplified and traditional Chinese data. Compare different scripts, such as Kana to Kanji, or Chinese to English.

The Match transform provides some data normalization options to prepare your data for matching. These options are located in the Field (Match Input) option group. Before sending Unicode data into the matching process, you must first separate out the data by country to separate match data flows. The Match Wizard can do this for you when you use the multi–national strategy.

178

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

To configure the Match transform for unicode matching 1. 2. 3.

Use a Case transform to route your data to a Match transform that handles that type of data. Open the AddressJapan_MatchBatch Match transform configuration, and save it with a different name. Set the Match engine option in the transform options option group to a value that reflects the type of data being processed. This option is set to Japanese in the match_starter_unicode sample.

4.

Set any preprocessing options in the Match Criteria Editor. For example: • •

•

When possible, use criteria for parsed components for address, firm, and name data, such as Primary_Name or Person1_Family_Name1. If you have parsed address, firm, or name data that does not have a corresponding criterion, use the Address_Data1-5, Firm_Data1-3, and Name_Data1-3 criteria. For all other data that does not have a corresponding criteria, use the Custom criteria.

Additional Match Features Typically, matching implies comparing two strings from an alphabetical perspective. In other words, we try to find out how similar two strings are if the strings are compared character by character and calculate a similarity score. But this is not a good approach if the real meaning of the characters is more important. The “semantic” of the data, the real meaning of the data, has to be considered also. With proximity matching, we are trying to interpret Geographical, Numeric and Date type data as such and calculate similarity based on how close they are. What is Proximity Matching? Ability to match records based on their proximity (instead of comparing string representation of data). Data Services provides several types: Geographical proximity Find duplicate records based on geographic proximity, using latitude and longitude information. Not driving distance, but Geographic distance (Haversine Distance algorithm). It uses WGS 84 (GPS) coordinates.

2011

© 2011 SAP AG. All rights reserved.

179

Unit 5: Matching and Consolidating Data

BODS30

Figure 54: Geographical Proximity Matching

Numeric proximity This method either finds duplicates based on numerical closeness of data based either on numbers or date. • •

180

Numeric proximity – Find duplicates based on numerical closeness of data Date proximity – Find duplicate based on date ranges.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Figure 55: Numerical and Date Proximity Matching

Comparing Compound Family Name This is an extension of the substring adjustment score option in the sense that it loosens some of the prerequisites. Not only can it find new matches, it can also boost the score for some existing matches. The Approx substring option works only on one field. If the family names are in separate fields, they need to be concatenated into one field. It may be useful to know how Adjustment scores are calculated may be useful, but not necessary for all to understand. Say there are two strings “International” and “Intl.”. Without any adjustment scores, (Number of identical characters * 100)/Total number of characters = (8*100)/17 = 47. Now if the adjustment score is 80, it is applied to the characters that are not identical. Score = (Number of identical characters * 100) + (Number of nonidentical characters * 80)/ Total number of characters = (8*100 + 9*80)/17 = 89.

2011

© 2011 SAP AG. All rights reserved.

181

Unit 5: Matching and Consolidating Data

BODS30

Figure 56: Comparing Compound Family Names – Approximate Substring Matching

The Match Wizard previously supported only name/address/e-mail/phone type fields when setting up a match strategy, but did not support non-party data like product data or financial data. Now there is the ability to set up custom fields in the match strategy in the wizard. Custom fields are only used with the “Simple Match” strategy; the rest of the strategies are for party (for example, customers, suppliers) data only.

182

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Figure 57: Customer Fields in Match Wizard

2011

© 2011 SAP AG. All rights reserved.

183

Unit 5: Matching and Consolidating Data

184

© 2011 SAP AG. All rights reserved.

BODS30

2011

BODS30

Lesson: Configuring the Match Transform

Exercise 5: Modify Match Criteria using the Match Editor Exercise Objectives After completing this exercise, you will be able to: • Refine match criteria so that invalid results are not returned for a match for Full Names on the USA_Customers table.

Business Example You have performed a simple match by Full Name on the Beta Businesses USA_Customers table. You have determined that matched records for at least one individual have different SSNs and are, therefore, invalid matches. You decide to further refine the match so that these invalid matches are eliminated.

Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new Datastore for the DQSTANDALONE database.

3.

In your Local Object Library, create a new Datastore for the DELTA staging database. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

185

Unit 5: Matching and Consolidating Data

BODS30

Task 2: Create a new batch job Beta_Customers_Match2_Job with a data flow Beta_Customers_Match2_DF. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

In the Omega project, create a new batch job Beta_Customers_Match2_Job with a new data flow called Beta_Customers_Match2_DF.

Task 3: In the data flow Beta_Customers_Match2_DF, configure the Match Wizard in the Query transform. 1.

In the Match_Target table in the DQSTANDALONE datastore, set the content type for certain fields to Name.

2.

In the workspace for Beta_Customers_Match2_DF, add the Match_Target table from the DQSTANDALONE datastore as the source object connected to the Query transform.

Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.

Configure the Match Wizard in the Query transform.

Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target. 1.

Configure a Match_NameAddr transform to the data flow.

2.

Add a new matching criterion based on the new input field DC_SSN.

Continued on next page

186

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.

Add a new criterion based on Social Security Number 1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN.

2.

Adjust the value of the No match score field to 79 for the criteria Address Primary Name Match.

3.

Adjust the values of the Address Secondary Number match criteria according to the following table:

4.

5.

Option

Value

No match score

84

One field blank score

85

Both field blank score

100

Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option

Value

Match score

101

One field blank score

0

Both field blank score

0

Initial adjustment score

90

Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Option

Value

Match score

80

No match score

79

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

187

Unit 5: Matching and Consolidating Data

6.

BODS30

One field blank score

90

Both field blank score

100

Initial adjustment score

100

Move the new criteria for the Social Security Number 1 to be the first match criteria.

Task 7: Execute the batch job Beta_Customers_Match2_Job with the default parameters. 1.

Configure a new target table for the data flow.

2.

Execute the batch job Beta_Customers_Match2_Job with the default parameters.

Task 8: Adjust the match criteria in the Match transform to ignore the values for Social Security Number 1 when both record field values are blank. Execute the batch job Beta_Customers_Match2_Job again with the default parameters.

188

1.

Adjust the match criteria in the Match transform for the data flow Beta_Customers_Match2_DF to ignore records when their social security number values are blank.

2.

Execute the batch job Beta_Customers_Match2_Job with the default parameters.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Solution 5: Modify Match Criteria using the Match Editor Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

a)

From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.

b)

In the dialog box, enter your assigned User ID.

c)

Enter your password, which is the same as your User ID.

d)

Choose the Log on button.

e)

In the list of repositories, select your repository DSREPO##, where ## is the number portion of your User ID.

f)

Choose OK.

If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

189

Unit 5: Matching and Consolidating Data

BODS30

In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DQSTANDALONE

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DQSTANDALONE

User name

sourceuser (lower case)

Password

sourcepass (lower case)

Click OK to finish. c)

In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.

d)

Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •

AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page

190

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

3.

In your Local Object Library, create a new Datastore for the DELTA staging database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DELTA

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DELTA## (where ## is the group number assigned by your Instructor)

User name

student## or consult your instructor

Password

student## or consult your instructor

Choose OK to finish. c)

You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.

Task 2: Create a new batch job Beta_Customers_Match2_Job with a data flow Beta_Customers_Match2_DF. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. a)

From the Project menu, choose New → Project. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

191

Unit 5: Matching and Consolidating Data

2.

BODS30

b)

When the Project New dialog box appears, enter Omega in the Project name field.

c)

Choose Create so that the new project appears in the Project area.

In the Omega project, create a new batch job Beta_Customers_Match2_Job with a new data flow called Beta_Customers_Match2_DF. a)

In the Project area, right-click the project name and choose New Batch Job from the menu.

b)

Enter the name of the job as Beta_Customers_Match2_Job.

c)

Press Enter to commit the change.

d)

Select the Data Flow icon in the Tool Palette.

e)

Select the workspace where you want to add the data flow.

f)

Enter Beta_Customers_Match2_DF as the name.

g)

Press Enter to commit the change.

h)

Double-click the Beta_Customers_Match2_DF to open the workspace.

Task 3: In the data flow Beta_Customers_Match2_DF, configure the Match Wizard in the Query transform. 1.

2.

In the Match_Target table in the DQSTANDALONE datastore, set the content type for certain fields to Name. a)

In the Local Object Library, select the Datastores tab and open the node for the DQSTANDALONE datastore.

b)

Open the node Tables and right-click on the table Match_Target to choose the option Edit.

c)

Right-click the field Cust_Name to select the option Properties and choose the type Name from the drop-down list for the Content Type associated with the field.

d)

Choose OK to close the dialog box.

e)

Close the table metadata by selecting the Back button.

In the workspace for Beta_Customers_Match2_DF, Continued on next page

192

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

add the Match_Target table from the DQSTANDALONE datastore as the source object connected to the Query transform. a)

In the Local Object Library, select the Datastores tab and then select the Match_Target table from the DQStandalone datastore.

b)

Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.

c)

In the Tool Palette, select the icon for a Query transform and then select in the data flow workspace.

d)

Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.

e)

Double-click the Query transform to open the transform editor.

f)

In the transform editor, map all columns from the input schema to the output schema columns. Do this by selecting all the fields in the Schema In window and then drag them to the Query object in the Schema Out window. Use the Back icon to close the Query transform editor.

Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.

Configure the Match Wizard in the Query transform. a)

In the data flow Beta_Customers_Match2_DF work space, right-click on the Query transform to select the option Run Match Wizard.

b)

On the Match Selection Type window, select the option for a simple match selection with only one way to identify an individual. Then choose Next.

c)

Set the name for Match Set Name to NameAddr.

d)

Specify that the first criterion is Full Name by selecting the field from the drop-down list in the Match On field. Use the value Word similarity for the field Compare Using. Choose Enter.

e)

Enter the second criterion as Address by selecting the field from the drop-down list in the Match On field. Use the value Word similarity for the field Compare Using. Select Enter. Choose Next.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

193

Unit 5: Matching and Consolidating Data

f)

BODS30

In the Select Criteria Fields window, confirm that the fields are mapped appropriately to the criteria as seen below. Criteria

Field

PERSON1_FAMILY_NAME1

DC_FAMILY_NAME

PERSON1_GIVEN_NAME1

DC_GIVEN_NAME1

PERSON1_GIVEN_NAME1 MATCH_STD1

DC_GIVEN_NAME1 MATCH_STD1

PERSON1_GIVEN_NAME1 MATCH_STD2

DC_GIVEN_NAME1 MATCH_STD2

PERSON1_GIVEN_NAME1 MATCH_STD3

DC_GIVEN_NAME1 MATCH_STD3

PERSON1_GIVEN_NAME2

DC_GIVEN_NAME2

PERSON1_GIVEN_NAME2 MATCH_STD1

DC_GIVEN_NAME2 MATCH_STD1

ADDRESS_PRIMARY NAME

AC_PRIM_NAME

ADDRESS_PRIMARY NUMBER AC_PRIM_NUM ADDRESS_SECONDARY NUMBER

AC_UNIT_NUM

Notice that matching is performed on the cleansed and standardized columns, which are the fields that begin with either DC_ or AC_. If the fields are not mapped, select the field name in the Field by using the drop-down list. Hint: You may need to select in the field More... to see more field entries. Then choose Next. g)

On the next window, define the break key by selecting POSTALCODE as the Upstream field. Set the start position to 1 and length to 5.

h)

Close the Match Wizard configuration to return to the data flow work space.

Continued on next page

194

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target. 1.

Configure a Match_NameAddr transform to the data flow. a)

When you closed the Match Wizard, Data Services automatically inserted the Match_NameAddr transform to the data flow workspace.

b)

Double-click the Match_NameAddr transform to open the transform editor.

c)

Delete the following fields in the Schema Out window by selecting them, right-click and choose Delete. Output Column MC_GROUP_NUMBER MC_MATCH_SCORE MC_GROUP_COUNT MC_GROUP_RANK

d)

e)

Rename the new generated match columns as follows: Original Column Name

New Column Name

NameAddr_Level1 GROUP_NUMBER

MC_GROUP NUMBER

NameAddr_Level1 MATCH_SCORE

MC_MATCH SCORE

NameAddr_Level1 GroupStats1_GROUP_COUNT

MC_GROUP COUNT

NameAddr_Level1 GroupStats1_GROUP_RANK

MC_GROUP RANK

Verify that all columns in the output schema are mapped correctly. Remap any columns that are not mapped. Hint: If the column does not have a blue arrow icon, the mapping needs to be reset.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

195

Unit 5: Matching and Consolidating Data

f)

BODS30

Add the DC_SSN column from the input schema to the Input tab in the parameters area to create a new entry. Hint: You may need to select the field by selecting on the blue arrow icon.

2.

Add a new matching criterion based on the new input field DC_SSN. a)

In the transform editor, go to the Options tab and select the button Edit Options... to invoke the Match Editor.

b)

In the Transform Options area, expand the nodes Matching → Level1 → Match Criteria if it is not already expanded. Notice that the match method is rule-based. The order of evaluation is important. The decision is made on the last match key unless a no-match is determined based on other keys. The key that is more likely to determine a match or no-match should be evaluated first. In this case, it is the social security number. Make a note of the current Match Score and No Match Score for each criterion. Open the Person1_Given_Name2 match criteria and view the values for the options. To view the options, select the criteria and then select the Options tab.

Continued on next page

196

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.

2.

3.

Add a new criterion based on Social Security Number 1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN. a)

In the Match Editor, select the node Match Criteria and choose the Add Criteria button.

b)

Enter Social Security Number 1 as the Criteria Name.

c)

Select the tab Criteria Fields and open the node for Person. Scroll down to find and select the criteria field SOCIAL_SECURITY_NUMBER1.

d)

In the Criteria field mapping section, from the drop-down list, choose the field DC_SSN as the entry for the field Input Field Mapped Name.

e)

For the SSN criteria, select the Options tab and enter the value of 101 for the Match score fieldand 99 for the field No match score.

f)

For the SSN criteria, enter the value of Field Similarity for the field Compare data using:.

g)

Select the Apply button.

Adjust the value of the No match score field to 79 for the criteria Address Primary Name Match. a)

Select the Address Primary Name Match and go to the Options tab.

b)

Set the value of the No match score field to 79.

Adjust the values of the Address Secondary Number match criteria according to the following table:

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

197

Unit 5: Matching and Consolidating Data

4.

BODS30

Option

Value

No match score

84

One field blank score

85

Both field blank score

100

a)

Select the Address Secondary Number match and go to the Options tab.

b)

Adjust the values of the match criteria according to the following table: Option

Value

No match score

84

One field blank score

85

Both field blank score

100

Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option

Value

Match score

101

One field blank score

0

Both field blank score

0

Initial adjustment score

90

a)

Select the Person1 Given Name 1 match and go to the Options tab.

b)

Adjust the values of the match criteria according to the following table: Option

Value

Match score

101

One field blank score

0

Both field blank score

0

Initial adjustment score

90

Continued on next page

198

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

5.

6.

Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Option

Value

Match score

80

No match score

79

One field blank score

90

Both field blank score

100

Initial adjustment score

100

a)

Select the Person1 Given Name 2 match and go to the Options tab.

b)

Change the value of the Compare data using: field to Word Similarity using the drop-down list.

c)

Adjust the values of the match criteria according to the following table: Option

Value

Match score

80

No match score

79

One field blank score

90

Both field blank score

100

Initial adjustment score

100

Move the new criteria for the Social Security Number 1 to be the first match criteria. a)

In the Transformation Options hierarchy, select the SSN match criteria under Level 1.

b)

Use the Move Up button underneath Transformation Options to place the SSN match criteria in the first position.

c)

Choose OK to close the Match Editor.

d)

Choose Back to close the transform editor.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

199

Unit 5: Matching and Consolidating Data

BODS30

Task 7: Execute the batch job Beta_Customers_Match2_Job with the default parameters. 1.

2.

Configure a new target table for the data flow. a)

From the Tool Palette, select the icon for a template table and select in the data flow work space to insert the new target template table. Enter Match_Target2 for the name and select the DELTA datastore from the drop-down list. Choose OK to create the template table.

b)

Connect the Match_NameAddr transform to the target template table by selecting the Match_NameAddr transform and while holding down the mouse button, drag to the target template table, and release the mouse button.

Execute the batch job Beta_Customers_Match2_Job with the default parameters. a)

In the Project area, right-click the batch job Beta_Customers_Match2_Job to select the option Execute. Choose OK to save all objects you have created.

b)

Choose OK to accept the default execution parameters.

c)

In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.

d)

In the target table, select the column heading for the MC_GROUP_NUMBER column to sort the data. Notice that there are two match groups. Records for A G Olson and Andrew Olson no longer match. However, records without social security numbers are not identified as matches in cases where they should be. The match criteria need to be adjusted again.

Continued on next page

200

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Task 8: Adjust the match criteria in the Match transform to ignore the values for Social Security Number 1 when both record field values are blank. Execute the batch job Beta_Customers_Match2_Job again with the default parameters. 1.

Adjust the match criteria in the Match transform for the data flow Beta_Customers_Match2_DF to ignore records when their social security number values are blank. a)

Open the work space for the data flow Beta_Customers_Match2_DF.

b)

Double-click the Match transform to open the transform editor.

c)

Go to the Options tab and start the Match Editor.

d)

For the Social Security Number 1 match criteria, set the option to ignore the values when both fields are blank. Find the field Both field blank operation and using the drop-down list, choose the value IGNORE. If there is a value for One field blank score and Both fields blank score, delete them.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

201

Unit 5: Matching and Consolidating Data

2.

BODS30

Execute the batch job Beta_Customers_Match2_Job with the default parameters. a)

In the Project area, right-click the batch job Beta_Customers_Match2_Job to choose the option Execute. Choose OK to save all objects you have created.

b)

Choose OK to accept the default execution parameters.

c)

In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.

d)

In the target table, select the column heading for the MC_GROUP_NUMBER column to sort the data. Notice that the matches for Thomas Davenport are correctly identified. The records for Ervin Scott Benson and Scott Benson are identified as a match due to a high-level match setting called Compare Given_Name1 to Given_Name2. Similarly, records for Isabel Cruz and Isabel A. Cruz match due to the Match on hyphenated family name setting.

Figure 58: Match Editor Results

202

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Configuring the Match Transform

Lesson Summary You should now be able to: • Control record comparisons • Use match levels • Define match criteria • Use blank field options • Use match criteria options • Modify match criteria using the Match Editor

Related Information •

2011

For more information about Compare Tables, see the Compare Tables section in the Data Services Technical Manuals

© 2011 SAP AG. All rights reserved.

203

Unit 5: Matching and Consolidating Data

BODS30

Lesson: Performing Post-Match Processing Lesson Overview Post-match operations give you the opportunity to take action based on the results of your matching setup.

Lesson Objectives After completing this lesson, you will be able to: • • • •

Perform post-match processing Set priorities Configure the group prioritization operation Set up for match reports

Business Example The matching process can divide records in groups based on break keys. The records in those break groups are considered to be match to one degree or another. You want to be able to prioritize the records within the break groups according to their source. This will help you identify the most important records as a source of data to create a single version of the truth.

Performing Post-Match Processing Post-match operations give you the opportunity to take action based on the results of your matching setup. The following post-match operations are available: Group Prioritization Use the group prioritization operation to order records for processing by other post-match operations. It is used to control duplicates by determining which record is master and which are subordinate. Group Statistics Use group statistics to analyze information about your data collections. Find out the number of records within the collection, the sequential group order number, the group rank, and whether the records in a collection belong to more than one source. Group statistics are essential for generating data for match reports. Best record

204

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

Use the best record operation to salvage data from matching records and consolidate that data into a single best record or copy it to all matching records. Unique ID Use the unique ID options to assign sequential identification numbers to each new record when adding records to a data warehouse. Unique records from the match are assigned a unique ID, while duplicates share an ID.

Setting Priorities When records match, they are assigned to a match group. The Match transform can determine which record in a match group is the best, or highest ranking, record using a group prioritization operation. The highest ranking record is referred to as the master, and all other records in that match group are subordinates. This information is critical for the best records operations that consolidate these records.

Figure 59: Post-Match Process Prioritization

Record priority is the combination of field priority and blank penalty scores: •

•

Field priority score uses a field containing numeric values to calculate the priority. For example, if you want to give a higher priority to one source over another, you could assign fewer points to the source with the higher priority. Blank penalty score assigns a penalty value up to 999 when the specified field is blank in a record.

Using best record prioritization

2011

© 2011 SAP AG. All rights reserved.

205

Unit 5: Matching and Consolidating Data

BODS30

The first option for calculating the priority of a record is to determine priority based on best record. This involves selecting a field that contains a numeric value and use that value as the initial priority for the record, sorted in ascending or descending order. You can combine multiple fields to calculate this priority. If the field you want to use is not a numeric value, you need to convert the content so that it can be used for prioritization. Use the Query transform with the appropriate function to create a new field for the converted content. For example, you can use an IfThenElse function to replace string with numeric values, or use the Julian function to convert a date to a number. Using blank penalty prioritization Alternatively, you can choose to order the records based on the completeness of data and define blank penalties. This feature assigns a lower priority to records in which particular fields are blank. Incorporating a blank penalty is appropriate if you feel that a blank field should not disqualify one record from matching another, but prefer to have the most complete record be your best records. As with best record priority, the blank penalty is a penalty-scoring system. For each blank field, you can assess a penalty of up to 999 points. You can assess the same penalty for each blank field, or assess a higher penalty for fields you consider more important. For example, if you were targeting a mailing to college students who primarily live in apartments or dormitories, you might assess a blank penalty of 5 to Given_Name2 and Primary_Postfix as less important fields. For more important fields, you might assess blank penalty of 20 to Given_Name1 and Unit_Number. As a result, records would be prioritized as follows:

Figure 60: Blank Penalty Prioritization

Combining best record with blank penalty

206

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

The best record priority and blank penalty scores are added together and considered as one score. When defining both priorities, you must determine which you consider most important, and weight each accordingly. For example, if you want to have the best record priority take precedence, set the intervals between the options to a large enough value that the combined blank penalties cannot bridge the gap.

Figure 61: Combining Best Record with Blank Penalty

With this scoring system, a record from the house source always receives priority over a record from a rented source, even if the house record has blank fields. Even though the house record contains five blank fields, it receives only 155 penalty points (100 + 5 + 20 + 5 + 5 + 20), while the record from source A receives 200 penalty points. The house record, therefore, has the lower penalty and the higher priority. You can manipulate the scores to set priority exactly as you would like. For example, suppose you prefer a rented record containing first name data over a house record without first Name data. You could set the first name blank penalty score so that a blank first name field would weigh more heavily than any source membership.

Generating Group Statistics Group statistics help you analyze your match groups. They provide information, such as the number of records in a match group, the sequential group order number, group rank, whether records in a match group belong to more than one source. Group statistics are essential for generating data for match reports.

2011

© 2011 SAP AG. All rights reserved.

207

Unit 5: Matching and Consolidating Data

BODS30

If you want to generate match reports, you must have a Group Statistics operation included in your Match transform. To get the most accurate data in your reports, make sure that you have defined unique names in the Match transform for the other operations. This helps you better understand which element is producing the data you are looking at. To generate the data, you want to see in match reports, you must also set the Generate report statistics option to Yes in the Match transforms. By turning on report data generation, you can get information about break groups, which criteria were instrumental in creating a match, and so on. There are several other match reports available to help you analyze your match results using the Data Services Management Console. There are many match reports available to help you analyze your match results. •

Match Contribution The Match Contribution report provides information on the effect of the individual break groups and individual criteria on the total matching. Evaluation of this information is helpful for fine-tuning break key and match criteria.

•

Match Criteria Summary Data Services generates one Match Criteria Summary report per match set to provide a consolidated view of all key settings and the criteria settings. You can evaluate this information to determine whether adjustment of field comparison lengths or criteria settings would be helpful.

•

Match Source Statistics Summary The Match Source Statistics report provides information about duplicates within and across sources.

•

Match Duplicate Sample The Match Duplicate Sample report provides a sample of duplicates in the match results. One report is generated for each Match transform in the job.

208

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

Exercise 6: Prioritize Matched Records Exercise Objectives After completing this exercise, you will be able to: • Prioritize records from the Beta Businesses USA_Customers table to adhere to specific business rules.

Business Example You previously used the English North America Data Cleanse transform to generate a new column (DC_FIRM) that contained cleansed and standardized firm names. By modifying the PERSON_FIRM_EN dictionary, you standardized records so that “Associates” became “Assoc”. Now you want to prioritize records based on their source. You want records with a source value of “ACCT” to take precedence over records with other source values.

Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new Datastore for the DQSTANDALONE database.

3.

In your Local Object Library, create a new Datastore for the DELTA staging database. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

209

Unit 5: Matching and Consolidating Data

BODS30

Task 2: Create a new batch job Beta_Customers_Match3_Job with a data flow Beta_Customers_Match3_DF. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

In the Omega project, create a new batch job Beta_Customers_Match3_Job with a new data flow called Beta_Customers_Match3_DF.

Task 3: In the data flow Beta_Customers_Match3_DF, configure the Match Wizard in the Query transform. 1.

In the Match_Target2 table in the DQSTANDALONE datastore, set the content type for certain fields to Name.

2.

In the workspace for Beta_Customers_Match3_DF, add the Match_Target2 table from the DQSTANDALONE datastore as the source object connected to the Query transform.

Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.

Configure the Match Wizard in the Query transform.

Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target3. 1.

Configure a Match_NameAddr transform to the data flow.

2.

Add a new matching criterion based on the new input field DC_SSN.

Continued on next page

210

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.

Add new criteria based on Social Security Number 1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN.

2.

Adjust the value of the No match score field to 79 for the match criteria Address Primary Name.

3.

Adjust the values of the Address Secondary Number match criteria according to the following table:

4.

5.

Option

Value

No match score

84

One field blank score

85

Both field blank score

100

Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option

Value

Match score

101

One field blank score

0

Both field blank score

0

Initial adjustment score

90

Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Option

Value

Match score

80

No match score

79

One field blank score

90

Both field blank score

100

Initial adjustment score

100 Continued on next page

2011

© 2011 SAP AG. All rights reserved.

211

Unit 5: Matching and Consolidating Data

BODS30

6.

Adjust the match criteria for Social Security Number 1 to ignore records when their social security number values are blank.

7.

Move the new criteria for the Social Security Number 1 to be the first match criteria.

Task 7: Add a new output column Source_Priority to the Query transform populated by an expression giving a higher priority to the source value Acct. 1.

Add a new output column Source_Priority to the Query transform below the Source output column.

Task 8: Add the Source_Priority field to the Match transform. Adjust the Post Match Processing to add a prioritization group Priority with the field match Source_Priority assigned. 1.

Add the Source_Priority field to the Match transform. .

2.

To further prioritize records within the match group based on blank priority, configure the blank penalty scores for the fields: DC_SSN; DC_Given_Name2; and AC_Unit_Num to 1 in the Group Prioritization Editor.

3.

Add a new template target table called Match_Target3 in the DELTA datastore, connect it to the Match transform and execute the job.

Task 9: Execute the batch job Beta_Customers_Match3_Job with the default parameters. 1.

212

Execute the batch job Beta_Customers_Match3_Job with the default parameters.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

Solution 6: Prioritize Matched Records Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

a)

From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.

b)

In the dialog box, enter your assigned User ID.

c)

Enter your password, which is the same as your User ID.

d)

Choose the Log on button.

e)

In the list of repositories, select your repository DSREPO##, where ## is the number portion of your User ID.

f)

Choose OK.

If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

213

Unit 5: Matching and Consolidating Data

BODS30

In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DQSTANDALONE

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DQSTANDALONE

User name

sourceuser (lower case)

Password

sourcepass (lower case)

Choose OK to finish. c)

In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.

d)

Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •

AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page

214

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

3.

In your Local Object Library, create a new Datastore for the DELTA staging database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DELTA

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DELTA## (where ## is the group number assigned by your Instructor)

User name

student## or consult your instructor

Password

student## or consult your instructor

Choose OK to finish. c)

You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.

Task 2: Create a new batch job Beta_Customers_Match3_Job with a data flow Beta_Customers_Match3_DF. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. a)

From the Project menu, choose New → Project.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

215

Unit 5: Matching and Consolidating Data

2.

BODS30

b)

When the Project New dialog box appears, enter Omega in the Project name field.

c)

Choose Create so that the new project appears in the Project area.

In the Omega project, create a new batch job Beta_Customers_Match3_Job with a new data flow called Beta_Customers_Match3_DF. a)

In the Project area, right-click the project name and choose New Batch Job from the menu.

b)

Enter the name of the job as Beta_Customers_Match3_Job.

c)

Press Enter to commit the change.

d)

Select the Data Flow icon in the Tool Palette.

e)

Right-click in the workspace where you want to add the data flow and select Add New → Dataflow.

f)

Enter Beta_Customers_Match3_DF as the name.

g)

Press Enter to commit the change.

h)

Double-click the Beta_Customers_Match3_DF to open the workspace.

Task 3: In the data flow Beta_Customers_Match3_DF, configure the Match Wizard in the Query transform. 1.

In the Match_Target2 table in the DQSTANDALONE datastore, set the content type for certain fields to Name. a)

In the Local Object Library, select the Datastores tab and open the node for the DQSTANDALONE datastore.

b)

Open the node Tables and right-click on the table Match_Target2 to choose the option Edit.

c)

Right-click the field Cust_Name to select the option Properties and choose the type Name from the drop-down list for the Content Type associated with the field.

d)

Choose OK to close the dialog box.

e)

Close the table metadata by selecting the Back button. Continued on next page

216

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

2.

In the workspace for Beta_Customers_Match3_DF, add the Match_Target2 table from the DQSTANDALONE datastore as the source object connected to the Query transform. a)

In the Local Object Library, select the Datastores tab and then select the Match_Target2 table from the DQStandalone datastore.

b)

Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.

c)

In the Tool Palette, select the icon for a Query transform and then select in the data flow workspace.

d)

Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.

e)

Double-click the Query transform to open the transform editor.

f)

In the transform editor, map all columns from the input schema to the output schema columns. Do this by selecting all the fields listed in the Schema In window and then drag them to the Query object in the Schema Out window.

g)

Use the Back icon to close the transform editor.

Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.

Configure the Match Wizard in the Query transform. a)

In the data flow Beta_Customers_Match3_DF work space, right-click on the Query transform to select the option Run Match Wizard.

b)

On the Match Selection Type window, select the option for a simple match selection with only one way to identify an individual. Then choose Next.

c)

Set the name for Match Set Name to NameAddr.

d)

Specify that the first criterion is Full Name by selecting the field from the drop-down list in the field Match On. Use the value Word similarity for the field Compare Using. Choose Enter.

e)

Enter the second criteria as Address by selecting the field from the drop-down list in the Match On field.Use the value Word similarity for the field Compare Using. Choose Enter. Then choose Next. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

217

Unit 5: Matching and Consolidating Data

f)

BODS30

In the Select Criteria Fields window, confirm that the fields are mapped appropriately to the criteria as seen below. Criteria

Field

PERSON1_FAMILY_NAME1

DC_FAMILY_NAME

PERSON1_GIVEN_NAME1

DC_GIVEN_NAME1

PERSON1_GIVEN_NAME1 MATCH_STD1

DC_GIVEN_NAME1 MATCH_STD1

PERSON1_GIVEN_NAME1 MATCH_STD2

DC_GIVEN_NAME1 MATCH_STD2

PERSON1_GIVEN_NAME1 MATCH_STD3

DC_GIVEN_NAME1 MATCH_STD3

PERSON1_GIVEN_NAME2

DC_GIVEN_NAME2

PERSON1_GIVEN_NAME2 MATCH_STD1

DC_GIVEN_NAME2 MATCH_STD1

ADDRESS_PRIMARY_NAME

AC_PRIM_NAME

ADDRESS_PRIMARY NUMBER AC_PRIM_NUM ADDRESS_SECONDARY NUMBER

AC_UNIT_NUM

Notice that matching is performed on the cleansed and standardized columns, which are the fields that begin with either DC_ or AC_. If the fields are not mapped, select the field name in the Field by using the drop-down list. Hint: You may need to select the More... “field ”to see more field entries. Choose Next. g)

On the next window, define the break key by selecting AC_POSTCODE_FULL as the Upstream field. Set the start position to 1 and length to 5.

h)

Close the Match Wizard configuration to return to the data flow work space.

Continued on next page

218

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target3. 1.

Configure a Match_NameAddr transform to the data flow. a)

When you closed the Match Wizard, Data Services automatically inserted the Match_NameAddr transform to the data flow workspace.

b)

Double-click the Match_NameAddr transform to open the transform editor.

c)

Delete the following fields in the Schema Out window by selecting them, right-click and choose Delete. Output Column MC_GROUP_NUMBER MC_MATCH_SCORE MC_GROUP_COUNT MC_GROUP_RANK NAMEADDR_LEVEL1_GROUP_NUMBER NAMEADDR_LEVEL1_MATCH_SCORE NAMEADDR_LEVEL1_GROUPSTATS1_GROUP_COUNT NAMEADDR_LEVEL1_GROUPSTATS1_GROUP_RANK

d)

Rename the new generated match columns as follows: Original Column Name

New Column Name

NameAddr_Level1_GROUP NUMBER

MC_GROUP NUMBER

NameAddr_Level1_MATCH SCORE

MC_MATCH SCORE

NameAddr_Level1_GroupStats1 GROUP_COUNT

MC_GROUP COUNT

NameAddr_Level1_GroupStats1 GROUP_RANK

MC_GROUP RANK

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

219

Unit 5: Matching and Consolidating Data

e)

BODS30

Verify that all columns in the output schema are mapped correctly. Remap any columns that are not mapped. Hint: If the column does not have a blue arrow icon, the mapping needs to be reset.

f) 2.

Add the DC_SSN column from the input schema to the Input tab in the parameters area to create a new entry.

Add a new matching criterion based on the new input field DC_SSN. a)

In the transform editor, go to the Options tab and select the button Edit Options... to invoke the Match Editor.

b)

In the Transform Options area, expand the nodes Matching → Level1 → Match Criteria if it is not already expanded. Notice that the match method is rule-based. The order of evaluation is important. The decision is made on the last match key unless a no-match is determined based on other keys. the key that is more likely to determine a match or no-match should be evaluated first. In this case, it is the social security number. Make a note of the current Match Score and No Match Score for each criterion. Open the Person1_Given_Name2 match criteria and view the values for the options. To view the options, select the criteria and then select the Options tab.

Continued on next page

220

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.

2.

3.

Add new criteria based on Social Security Number 1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN. a)

In the Match Editor, select the node Match Criteria and choose the Add Criteria button.

b)

Enter Social Security Number 1 as the Criteria Name.

c)

Select the tab Criteria Fields and open the node for Person. Scroll down to find and select the criteria field SOCIAL_SECURITY_NUMBER1.

d)

From the drop-down list, choose the field DC_SSN as the entry for the field Input Field Mapped Name.

e)

For the Social Security Number 1 criteria, select the Options tab and enter the value of 101 for the field Match score and 99 for the field No match score.

f)

For the Social Security Number 1 criteria, enter the value of Field Similarity for the field Compare data using:.

Adjust the value of the No match score field to 79 for the match criteria Address Primary Name. a)

Select the match Address Primary Name and go to the Options tab.

b)

Set the value of the No match score field to 79.

Adjust the values of the Address Secondary Number match criteria according to the following table:

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

221

Unit 5: Matching and Consolidating Data

4.

5.

BODS30

Option

Value

No match score

84

One field blank score

85

Both field blank score

100

a)

Select the Address Secondary Number match and go to the Options tab.

b)

Adjust the values of the match criteria according to the following table: Option

Value

No match score

84

One field blank score

85

Both field blank score

100

Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option

Value

Match score

101

One field blank score

0

Both field blank score

0

Initial adjustment score

90

a)

Select the Person1 Given Name 1 match and go to the Options tab.

b)

Adjust the values of the match criteria according to the following table: Option

Value

Match score

101

One field blank score

0

Both field blank score

0

Initial adjustment score

90

Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Continued on next page

222

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

6.

Option

Value

Match score

80

No match score

79

One field blank score

90

Both field blank score

100

Initial adjustment score

100

a)

Select the Person1 Given Name 2 match and go to the Options tab.

b)

Change the value of the Compare data using: field to Word Similarity using the drop-down list.

c)

Adjust the values of the match criteria according to the following table: Option

Value

Match score

80

No match score

79

One field blank score

90

Both field blank score

100

Initial adjustment score

100

Adjust the match criteria for Social Security Number 1 to ignore records when their social security number values are blank. a)

For the Social Security Number 1 match criteria, set the option to ignore the values when both fields are blank. Find the field Both field blank operation and using the drop down list, choose the value IGNORE. If there is a value for One field blank score and Both fields blank score, delete it.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

223

Unit 5: Matching and Consolidating Data

7.

BODS30

Move the new criteria for the Social Security Number 1 to be the first match criteria. a)

In the Transformation Options hierarchy, select the Social Security Number 1 match criteria under Level 1.

b)

Use the Move Up button underneath Transformation Options to place the Social Security Number 1 match criteria in the first position.

c)

Choose the OK button to close the Match Editor.

d)

Use the Back icon in the Tool Bar to close the Match transform editor and return to the data flow workspace.

Continued on next page

224

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

Task 7: Add a new output column Source_Priority to the Query transform populated by an expression giving a higher priority to the source value Acct. 1.

Add a new output column Source_Priority to the Query transform below the Source output column. a)

Double-click the Query transform to open the transform editor.

b)

In the Schema Out area, right-click on the Source ouput columnn to select the option New output column and select Insert below. Enter Source_Priority as the name with a data type of int.

c)

Select the output column Source_Priority and go to the Mapping tab. Select the Functions button and select the category Miscellaneous Functions, select the function IfThenElse and choose Next.

d)

For the “If” part of the function, choose the ... button to open the Smart Editor. Select the Data tab and open the node on the table Match_Target2. Scroll down to find and select the field Source.

e)

Right-click on the field Source and choose Enter. In the Smart Editor, type in an equal sign = followed by the string 'ACCT'. Choose OK to return.

f)

For the “Then” part of the function, type in 10.

g)

For the “Else” part of the function, type in 20 and then select OK. The code should be as follows: ifthenselse (MATCH_TARGET2.SOURCE = 'ACCT', 10, 20)

h)

Choose Back to close the transform editor and return to the data flow work space.

Task 8: Add the Source_Priority field to the Match transform. Adjust the Post Match Processing to add a prioritization group Priority with the field match Source_Priority assigned. 1.

Add the Source_Priority field to the Match transform. . a)

Double-click the Match transform to open the transform editor.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

225

Unit 5: Matching and Consolidating Data

BODS30

b)

In the transform editor for the Match transform, add the Source_Priority column from the Schema In window to the Input tab in the parameters area to create a new entry.

c)

Go to the Options tab and start the Match Editor.

d)

In the Transform Options area on the left, select the node for Post Match Processing and then choose the Add Operation → Group Prioritization option.

e)

In the Group Prioritization Editor enter Priority as the Prioritization name. Then add these values on the Priority Order tab in the Priority fields area: Input Field

Field Order

Source_Priority

Ascending

DC_SSN

Ascending

DC_Given_Name2

Ascending

AC_Unit_Num

Ascending

At this point, you have configured prioritization based on source. This assures that the records with a source value of ACCT take precedence over those with any other value. f)

For the prioritization group Priority, go to the Record Completeness tab and make the following settings: Option

Value

Orders records on completeness of data

Select the checkbox

Define priority and penalty fields

Select the radio button

Record priority field

SOURCE_PRIORITY

Choose the Apply button.

Continued on next page

226

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

2.

To further prioritize records within the match group based on blank priority, configure the blank penalty scores for the fields: DC_SSN; DC_Given_Name2; and AC_Unit_Num to 1 in the Group Prioritization Editor. a)

In the Group Prioritization Editor, go to the Blank penalty score area and set the Blank Penalty score for these fields: Input Field

Blank Penalty

DC_SSN

1

DC_Given_Name2

1

AC_Unit_Num

1

Hint: You may need to choose OK to return to the Options tab and then select the Edit Options button a second time to permit these entries to appear.

3.

b)

In the navigation area on the left, select the node for Post Match Processing and select the node Priority and use the Move Up button to move Priority before GroupStats1..

c)

Choose OK to confirm your changes and close the Match Editor.

d)

On the Output tab of the Match transform, select the check box for the field NameAddr_Level1_Priority_PRIORITY_VALUE. This outputs the field.

e)

In the Schema Out pane, right-click on the output column NameAddr_Level1_Priority_PRIORITY_VALUE to select the option Properties and rename the output column to MC_PRIORITY. Choose OK.

f)

Choose the Back button to close the transform editor and return to the data flow work space.

Add a new template target table called Match_Target3 in the DELTA datastore, connect it to the Match transform and execute the job. a)

From the Tool Palette, select the icon for a template table and then select in the data flow workspace to insert the table. Use Match_Target3 as the name and DELTA as the datastore. Choose OK to create the table.

b)

Connect the Match transform to the new target template table by select the Match transform, hold down the mouse button, drag to the template table and release the mouse button to create the link. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

227

Unit 5: Matching and Consolidating Data

BODS30

Task 9: Execute the batch job Beta_Customers_Match3_Job with the default parameters. 1.

Execute the batch job Beta_Customers_Match3_Job with the default parameters. a)

In the Project area, right-click the batch job Beta_Customers_Match3_Job to choose the option Execute. Choose OK to save all objects you have created.

b)

Choose OK to accept the default execution parameters.

c)

In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.

d)

In the target table, select the column heading for the MC_GROUP_NUMBER column to sort the data. Notice that the master match record for Gloria Peterson now has a MC_Priority of 12 and the subordinate has a score of 22. Recall that the master record is the record with the lowest priority value. Priority and score are inversely proportional. Score (12) = Source_Priority (10) + DC_Given_Name2 Blank Penalty (1) + AC_Unit_Num Blank Penalty (1) + DC_SSN Blank Penalty (0)

Figure 62: Prioritizing Match Results

228

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Performing Post-Match Processing

Lesson Summary You should now be able to: • Perform post-match processing • Set priorities • Configure the group prioritization operation • Set up for match reports

Related Information •

2011

For more information about these individual reports, see the Data Services Management Console: Metadata Reports Guide.

© 2011 SAP AG. All rights reserved.

229

Unit 5: Matching and Consolidating Data

BODS30

Lesson: Consolidating Matching Records Lesson Overview Consolidation either eliminates, filters, or combines duplicate records utilizing configurable rules. Consolidation can build a “best record” allowing the best field selection based on your priorities: source, frequency, completeness, recency, and so on. It can also build reference keys to track individual records and their associations across multiple databases.

Lesson Objectives After completing this lesson, you will be able to: • • •

Consolidate matching records to form a best record Configure the best record operation Generate a unique identification number

Business Example The Match/Consolidate functionality in Data Services helps drive critical business decisions by understanding complex and multidimensional relationships in the enterprise data. It helps ensure that customer relationships are complete and accurate, enabling you to produce more effective targeted campaigns and provide better customer service. Within Business Intelligence, Enterprise, or Middleware applications, Match/Consolidate’s ability to handle a variety of customer and business data elements will improve duplicate detection coverage, allowing you to leverage multiple data sources for your analytical or operational needs.

Consolidating Matching Records to Form a Best Record A key component in most data consolidation efforts is salvaging data from records within match groups and consolidating or posting that data to a best record or all matching records. Best record operations enable you to update information in record fields based on their membership in a match group, their priority, and the completeness of the records within that match group.

230

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

Figure 63: Match to Consolidation

With the best record operation, you can: •

•

Salvage useful data from matching records before discarding them. For example, when checking a driver license file against your main source file, you might pick up gender or date-of-birth data to add to your master record. Post updated data, such as the most recent phone number, to all of the records in a match group. You can choose to post data to the master record, to all the subordinate members of the match group, or to all members of the match group.

The operations you set up in the Best Record operation always start with the highest priority member of the match group (the master) and work their way down to the last subordinate, one at a time. This ensures that data can be salvaged from the higher-priority record to the lower-priority record.

2011

© 2011 SAP AG. All rights reserved.

231

Unit 5: Matching and Consolidating Data

BODS30

Best record strategies act as a filter for taking action on other fields. There are several strategies to assist in setting up the best record operation quickly and easily. •

Date Select a date field and determine best record based on the oldest or most recent date.

•

Length Select a string field and determine best record based on the shortest or longest string of data in the field.

•

Non Blank Select any field and determine best record based on completeness of data in the field.

•

Priority Number Select a numeric field and determine best record based on the highest or lowest number.

•

Priority String Select a string field and determine best record based on ascending or descending alphabetic order for the data.

•

Custom Base your strategy entirely on custom Python code. This allows you to open the Python Expression editor and create custom Python code.

If none of these strategies fit your project needs, create a custom best record operation, using custom Python code. Note: The Best Record Summary report shows statistics about best record processing that indicate configuration settings and the results of the posting. This information can be used to assist in fine tuning configuration settings.

Generating a Unique Identification Number A unique ID refers to a field within your data that contains a unique value that is associated with a record or group of records. A unique ID is to data what a social security number (SSN) is to a person. It creates and tracks data relationships from between multiple jobs. With the Unique ID operation, you can set your own starting ID during the first execution. For each subsequent execution, the unique ID is the

232

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

next sequential value based on the existing highest unique ID. You can set your own starting ID for new key generation or have it dynamically assigned based on existing data by determining where the highest unique ID from the previous run ended.

Figure 64: Unique Identification Numbers

When the unique ID operation combines the update source information with the master database information to form one source of match group information, the following options are available: •

Assign The Assign option assigns a new ID to unique records that need one or assigns a new ID to all members of a group that do not have an ID. In addition, the assign operation copies an existing ID if a member of a destination match group already has an ID. Each record is assigned a value. –

– –

2011

Records in a match group where one record had an input unique ID share the value with other records in the match group that had no input value. The first value encountered is shared. Order affects this: Place a group prioritization post-match operation prior to the unique ID operation if you have a priority field that can be sequenced using ASCII order. Records in a match group where two or more records had different unique ID input values each keep their input value. Records that do not have an input unique ID value have the next available ID assigned to it.

© 2011 SAP AG. All rights reserved.

233

Unit 5: Matching and Consolidating Data

•

BODS30

Combine The Combine option combines the IDs of a match group if more than one ID is represented. For example, this would occur because John Smith and Mary Higgins are on the update source, but they are also listed in the master database. Each record is assigned a value: –

•

Records in a match group where two or more records have different unique ID input values share the first input value. Order affects this: Place a group prioritization post-match operation prior to the unique ID operation if you have a priority field that can be sequenced using ASCII order. – Records in a match group where the first record has a blank unique ID value while later records have a value all have blank output unique ID values. – Records that do not have an input unique ID value have a blank unique ID on output even if a prior record in the match group has an input unique ID value. AssignCombine The AssignCombine option performs both an assign and a combine operation. Each record is assigned a value: –

•

Records that do not have an input unique ID value and are not found to match another record containing an input unique ID value have the next available ID assigned to it. These are “add” records that could be unique records or could be matches but not to another record that had previously been assigned a unique ID value. – Records in a match group where one or more records had an input unique ID with the same or different values share the first value encountered with all other records in the match group. Order affects this: Place a group prioritization post-match operation prior to the unique ID operation if you have a priority field that can be sequenced using ASCII order. Delete The Delete option deletes unique IDs from records that no longer need them, provided that they are not protected from being deleted. If you are using a file and are recycling IDs, this ID is added to the file. The output for the unique ID is dependent on whether an input field in that record has a value that indicates that the ID is protected. –

–

234

If the unique ID protected field is not mapped as an input field, the unique ID on output is blank for every record, regardless of the fact that an ID came in or not. If the unique ID protected field is mapped as an input field, a value of N means that the record's input data is deleted in the output unique ID.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

–

•

If the unique ID protected field is mapped as an input field, a value other than N means that the record's input data is retained in the output unique ID field.

Split Changes a split group's unique records so that the records that do not belong to the same match group have a different ID. The record with the group's highest priority keeps its unique ID. The rest are assigned new unique IDs. Examples of this are when an adult child moves from his or her family home and establishes their own household or in the case of a divorce. –

–

–

Records in a match group where two or more records had different unique ID input values or blank values each retain their input value, filled or blank respective to the record. Records that do not have an input unique ID value and do not match any record with an input unique ID value record have a blank unique ID on output. Records that come in with the same input unique ID value that no longer are found as matches have the first record output with the input value and the subsequent record is assigned a new unique ID value.

Let us say you have two distinct tables of data and there is a unique identifier that relates a row in table A to a row in table B as the same customer. J. Smith is in table A and J. Smith is in table B. These are really two different people, yet you have the same unique ID assigned to both of them as though they were the same person. Now you do a more comprehensive match process between these two tables, and the fact is uncovered that these two rows are not duplicates at all, yet they currently have the same unique ID. The Split option splits these two rows into two different data collections, keeping the existing unique ID in one of them, and assigning a new unique ID to the other.

2011

© 2011 SAP AG. All rights reserved.

235

Unit 5: Matching and Consolidating Data

BODS30

Assigning Unique IDs When specifying which numbers to use for unique IDs, there are several methods available: • •

You can use a file of your own to assign a sequential number to records, beginning at whatever number you want, provided it is a positive integer greater than 0. You can manually enter a starting unique ID value. You can send the starting unique ID through a field in your data source created using the Query transform. The starting unique ID is passed to the Match transform before the first new unique ID is requested. If no unique ID is received, the starting number will default to 1. Use caution when using the Field option. The field that you use must contain the unique ID value you want to begin the sequential numbering with. This means that each record you process must contain this field, and every record must have the same value in this field.

• •

You can write unique IDs that are dropped during delete processing back to a file to be used later. You can recycle your own IDs by entering them in a file using the XML tag of . For example: 477 599 814

To assign unique IDs using a file: 1. 2.

In the Unique ID option group, select the Value from file option. In the File option, set the file name and path. This file must be an XML file and must adhere to the following structure: 477

To assign IDs using a constant: 1. 2.

236

In the Unique ID option group, select the Constant value option. Set the Starting value option to the desired ID value.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

To assign IDs using a field: 1. 2.

In the Unique ID option group, select the Field option. In the Starting unique ID field option, select the field that contains the starting unique ID value.

To recycle unique IDs: 1. 2. 3. 4.

In the Unique ID option group, set the Processing operation option to Delete. In the Unique ID option group, select the Value from file option. In the File option, set the file name and path Set the Recycle unique IDs option to Yes. This is the same file that you might use for assigning a beginning ID number. 358

2011

© 2011 SAP AG. All rights reserved.

237

Unit 5: Matching and Consolidating Data

238

© 2011 SAP AG. All rights reserved.

BODS30

2011

BODS30

Lesson: Consolidating Matching Records

Exercise 7: Create a Best Record Exercise Objectives After completing this exercise, you will be able to: • Create a best record from the Beta Businesses USA_Customers table by using the “DC_PHONE” and “DC_EMAIL” columns.

Business Example You have prioritized records so that “ACCT” records take precedence over “PRSP” records. You now want to define a best record action for match duplicates.

Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. In your Local Object Library, create a new Datastore for the DQSTANDALONE database.

3.

In your Local Object Library, create a new Datastore for the DELTA staging database.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

239

Unit 5: Matching and Consolidating Data

BODS30

Task 2: Create a new batch job Beta_Customers_Match4_Job with a data flow Beta_Customers_Match4_DF. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

In the Omega project, create a new batch job Beta_Customers_Match4_Job with a new data flow called Beta_Customers_Match4_DF.

Task 3: In the data flow Beta_Customers_Match4_DF, configure the Match Wizard in the Query transform. 1.

In the Match_Target3 table in the DQSTANDALONE datastore, set the content type for certain fields to Name.

2.

In the workspace for Beta_Customers_Match4_DF, add the Match_Target3 table from the DQSTANDALONE datastore as the source object connected to the Query transform.

Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.

Configure the Match Wizard in the Query transform.

Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target4. 1.

Configure a Match_NameAddr transform to the data flow.

2.

Add a new matching criterion based on the new input field DC_SSN.

Continued on next page

240

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.

Add a new criterion for Social_Security_Number1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN.

2.

Adjust the value of the No match score field to 79 for the criteria Address Primary Name Match.

3.

Adjust the values of the Address Secondary Number match criteria according to the following table:

4.

5.

Option

Value

No match score

84

One field blank score

85

Both field blank score

100

Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option

Value

Match score

101

One field blank score

0

Both field blank score

0

Initial adjustment score

90

Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Option

Value

Match score

80

No match score

79

One field blank score

90

Both field blank score

100

Initial adjustment score

100 Continued on next page

2011

© 2011 SAP AG. All rights reserved.

241

Unit 5: Matching and Consolidating Data

BODS30

6.

Adjust the match criteria for Social_Security_Number1 to ignore records when their social security number values are blank.

7.

Move the new criteria for the SSN to be the first match criteria.

Task 7: Add a new output column Source_Priority to the Query transform populated by an expression giving a higher priority to the source value Acct. 1.

Add a new output column Source_Priority to the Query transform below the Source output column.

Task 8: Add the Source_Priority field to the Match transform. Adjust the Post Match Processing to add a prioritization group Priority with the field match Source_Priority assigned. 1.

Add the Source_Priority field to the Match transform. .

2.

To further prioritize records within the match group based on blank priority, configure the blank penalty scores for the fields: DC_SSN; DC_Given_Name2; and AC_Unit_Num to 1 in the Group Prioritization Editor.

Task 9: Add the DC_Phone and DC_Email fields to the Match transform Input tab from the Input schema. 1.

Add the DC_Phone and DC_Email fields to the Match transform Input tab from the Input schema.

Task 10: Create a new target template table, connect it to the Match transform and execute the batch job Beta_Customers_Match4_Job with the default parameters.

242

1.

Create a new template table called Match_Targe4 in the DELTA datastore.

2.

Execute the batch job Beta_Customers_Match4_Job with the default parameters.

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

Solution 7: Create a Best Record Task 1: Create Datastores and import metadata for the Beta Businesses and Delta databases. 1.

Log in to the Data Services Designer. If you have already logged in to the Designer, you may proceed to the next step. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

2.

a)

From the Windows Terminal Server (WTS) training environment desktop, choose Start → Programs → SAP Business Objects Data Services 4.0 SP1 → Data Services Designer.

b)

In the dialog box, enter your assigned User ID.

c)

Enter your password, which is the same as your User ID.

d)

Select the Log on button.

e)

In the list of repositories, select your repository DSREPO## where ## is the number portion of your User ID.

f)

Select the OK button.

If the DQSTANDALONE AND DELTA datastores already exists, please continue to the next step. If they do not exist, then create these datastores according to the following directions. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

243

Unit 5: Matching and Consolidating Data

BODS30

In your Local Object Library, create a new Datastore for the DQSTANDALONE database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DQSTANDALONE

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DQSTANDALONE

User name

sourceuser (lower case)

Password

sourcepass (lower case)

Choose OK to finish. c)

In the Local Object Library, right-click the DQSTANDALONE datastore to select the Open option from the context menu. The tables from the DQSTANDALONE database are displayed on the Metadata window on the right.

d)

Import the metadata for the DQSTANDALONE database source tables by selecting all the tables, right-click on them and from the menu, choose the option Import. • • • • • • •

AddressCleansed_USA DataCleansed_USA DataCleansed2_USA Match_Target Match_Target2 Match_Target3 Match_Target4 Note: This database is a read-only database. Continued on next page

244

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

3.

In your Local Object Library, create a new Datastore for the DELTA staging database. a)

In the Local Object Library, select the Datastores tab and right-click in the Datastores workspace to select the New option from the context menu.

b)

In the resulting dialog box, use the options: Field

Value

Datastore name

DELTA

Datastore type

Database

Database type

Microsoft SQL Server

Database version

Microsoft SQL Server 2005

Database server name

Supplied by the Instructor

Database name

DELTA## (where ## is the group number assigned by your Instructor)

User name

student## or consult your instructor

Password

student## or consult your instructor

Choose OK to finish. c)

You do not have to import any metadata. Note: This database is a read/write database and will be used as the target for your template tables.

Task 2: Create a new batch job Beta_Customers_Match4_Job with a data flow Beta_Customers_Match4_DF. 1.

If the Omega project already exists, please continue to the next step. If it does not exist, then create a new project called Omega. Note: Since this lesson could be used in a context where no other lesson/exericse in this course is being consumed, this step from the previous exercise has been duplicated. a)

From the Project menu, choose New → Project.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

245

Unit 5: Matching and Consolidating Data

2.

BODS30

b)

When the Project New dialog box appears, enter Omega in the Project name field.

c)

Choose Create so that the new project appears in the Project area.

In the Omega project, create a new batch job Beta_Customers_Match4_Job with a new data flow called Beta_Customers_Match4_DF. a)

In the Project area, right-click the project name and choose New Batch Job from the menu.

b)

Enter the name of the job as Beta_Customers_Match4_Job.

c)

Choose Enter to commit the change.

d)

Select the Data Flow icon in the Tool Palette.

e)

Right-click in the workspace where you want to add the data flow and choose Add New → Dataflow.

f)

Enter Beta_Customers_Match4_DF as the name.

g)

Choose Enter to commit the change.

h)

Double-click the Beta_Customers_Match4_DF to open the workspace.

Task 3: In the data flow Beta_Customers_Match4_DF, configure the Match Wizard in the Query transform. 1.

In the Match_Target3 table in the DQSTANDALONE datastore, set the content type for certain fields to Name. a)

In the Local Object Library, select the Datastores tab and open the node for the DQSTANDALONE datastore.

b)

Open the node Tables and right-click on the table Match_Target3 to choose the option Edit.

c)

Right-click the field Cust_Name to select the option Properties and choose the type Name from the drop-down list for the Content Type associated with the field.

d)

Choose OK to close the dialog box.

e)

Close the table metadata by selecting the Back button. Continued on next page

246

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

2.

In the workspace for Beta_Customers_Match4_DF, add the Match_Target3 table from the DQSTANDALONE datastore as the source object connected to the Query transform. a)

In the Local Object Library, select the Datastores tab and then select the Match_Target3 table from the DQStandalone datastore.

b)

Select and drag the object to the data flow workspace and in the context menu, choose the option Make Source.

c)

In the Tool Palette, select the icon for a Query transform and then right-click in the data flow workspace to select Add New → Dataflow. Enter Beta_Customers_Match4_DF as the name.

d)

Connect the source table to the transform by selecting the source table and while holding down the mouse button, drag to the transform. Release the mouse button to create the link.

e)

Double-click the Query transform to open the transform editor.

f)

In the transform editor, map all columns from the input schema to the output schema columns. Do this by selecting all the fields listed in the Schema In window and then drag them to the Query object in the Schema Out window.

g)

Delete the following field in the Schema Out window by selecting it, right click and select Delete. Output Column SOURCE_PRIORITY

h)

Use the Back icon to close the transform editor.

Task 4: Use the Match Wizard in the Query transform to set up a simple match. 1.

Configure the Match Wizard in the Query transform. a)

In the data flow Beta_Customers_Match4_DF work space, right-click on the Query transform to select the option Run Match Wizard.

b)

On the Match Selection Type window, select the option for a simple match selection with only one way to identify an individual. Then choose the Next button. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

247

Unit 5: Matching and Consolidating Data

BODS30

c)

Set the name for Match Set Name to NameAddr.

d)

Specify that the first criterion is Full Name by selecting the field from the drop-down list in the field Match On. Use the value Word similarity for the field Compare Using. Choose Enter.

e)

Enter the second criteria as Address by selecting the field from the drop-down list in the field Match On. Use the value Word similarity for the field Compare Using. Choose Enter. Then choose Next.

f)

In the Select Criteria Fields window, confirm that the fields are mapped appropriately to the criteria as seen below. Criteria

Field

PERSON1_FAMILY NAME1

DC_FAMILY_NAME

PERSON1_GIVEN NAME1

DC_GIVEN_NAME1

PERSON1_GIVEN NAME1 MATCH_STD1

DC_GIVEN_NAME1 MATCH_STD1

PERSON1_GIVEN NAME1 MATCH_STD2

DC_GIVEN_NAME1 MATCH_STD2

PERSON1_GIVEN NAME1 MATCH_STD3

DC_GIVEN_NAME1 MATCH_STD3

PERSON1_GIVEN NAME2

DC_GIVEN_NAME2

PERSON1_GIVEN NAME2 MATCH_STD1

DC_GIVEN_NAME2 MATCH_STD1

ADDRESS_PRIMARY NAME

AC_PRIM_NAME

ADDRESS_PRIMARY NUMBER AC_PRIM_NUM ADDRESS_SECONDARY NUMBER

AC_UNIT_NUM

Notice that matching is performed on the cleansed and standardized columns, which are the fields with begin with either DC_ or AC_. If the fields are not mapped, select the field name in the Field by using the drop-down list. Hint: You may need to select the “field” More... to see more field entries. Continued on next page

248

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

Choose Next. g)

On the next window, define the break key by selecting AC_POSTCODE_FULL as the Upstream field. Set the start position to 1 and length to 5.

h)

Close the Match Wizard configuration to return to the data flow work space.

Task 5: Configure a Match_NameAddr transform to the data flow, connecting it to a new target template table Match_Target4. 1.

Configure a Match_NameAddr transform to the data flow. a)

When you closed the Match Wizard, Data Services automatically inserted the Match_NameAddr transform to the data flow workspace.

b)

Double-click the Match_NameAddr transform to open the transform editor.

c)

Delete the following fields in the Schema Out window by selecting them, right-click, and choose Delete. Output Column MC_GROUP_NUMBER MC_MATCH_SCORE MC_GROUP_COUNT MC_GROUP_RANK NAMEADDR_LEVEL1_GROUP_NUMBER NAMEADDR_LEVEL1_MATCH_SCORE NAMEADDR_LEVEL1_GROUPSTATS1_GROUP_COUNT NAMEADDR_LEVEL1_GROUPSTATS1_GROUP_RANK MC_PRIORITY DC_PHONE DC_EMAIL

d)

Rename the new generated match columns as follows: Original Column Name

New Column Name Continued on next page

2011

© 2011 SAP AG. All rights reserved.

249

Unit 5: Matching and Consolidating Data

e)

BODS30

NameAddr_Level1_GROUP NUMBER

MC_GROUP NUMBER

NameAddr_Level1_MATCH SCORE

MC_MATCH SCORE

NameAddr_Level1_GroupStats1 GROUP_COUNT

MC_GROUP COUNT

NameAddr_Level1_GroupStats1 GROUP_RANK

MC_GROUP RANK

Verify that all columns in the output schema are mapped correctly. Remap any columns that are not mapped. Hint: If the column does not have a blue arrow icon, the mapping needs to be reset.

f) 2.

Add the DC_SSN column from the input schema to the Input tab in the parameters area to create a new entry.

Add a new matching criterion based on the new input field DC_SSN. a)

In the transform editor, go to the Options tab and choose the button Edit Options... to invoke the Match Editor.

b)

In the Transform Options area, expand the nodes Matching → Level1 → Match Criteria if it is not already expanded. Notice that the match method is rule-based. The order of evaluation is important. The decision is made on the last match key unless a no-match is determined based on other keys. the key that is more likely to determine a match or no-match should be evaluated first. In this case, it is the social security number. Make a note of the current Match Score and No Match Score for each criterion. Open the Person1_Given_Name2 match criteria and view the values for the options. To view the options, select the criteria and then select the Options tab.

Continued on next page

250

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

Task 6: To improve the matching process, you want to learn how to add a criterion that definitively determines whether two records are a match or a no-match. You decide to use the social security number. 1.

2.

3.

Add a new criterion for Social_Security_Number1 with a match criterion of Social_Security_Number1 and a mapped field of DC_SSN. a)

In the Match Editor, select the node Match Criteria and choose the Add Criteria button.

b)

Enter Social Security Number 1 as the Criteria Name.

c)

Select the tab Criteria Fields and open the node for Person. Scroll down to find and select the criteria field SOCIAL_SECURITY_NUMBER1.

d)

From the drop-down list, choose the field DC_SSN as the entry for the field Input Field Mapped Name.

e)

For the Social_Security_Number1 criteria, select the Options tab and enter the value of 101 for the field Match score and 99 for the field No match score.

f)

For the Social_Security_Number1 criteria, enter the value of Field Similarity for the field Compare data using:

Adjust the value of the No match score field to 79 for the criteria Address Primary Name Match. a)

Select the Address Primary Name Match and go to the Options tab.

b)

Set the value of the No match score field to 79.

Adjust the values of the Address Secondary Number match criteria according to the following table:

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

251

Unit 5: Matching and Consolidating Data

4.

5.

BODS30

Option

Value

No match score

84

One field blank score

85

Both field blank score

100

a)

Select the Address Secondary Number match and go to the Options tab.

b)

Adjust the values of the match criteria according to the following table: Option

Value

No match score

84

One field blank score

85

Both field blank score

100

Adjust the values of the Person1 Given Name 1 match criteria according to the following table: Option

Value

Match score

101

One field blank score

0

Both field blank score

0

Initial adjustment score

90

a)

Select the Person1 Given Name 1 match and go to the Options tab.

b)

Adjust the values of the match criteria according to the following table: Option

Value

Match score

101

One field blank score

0

Both field blank score

0

Initial adjustment score

90

Adjust the values of the Person1 Given Name 2 match criteria according to the following table: Continued on next page

252

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

6.

Option

Value

Match score

80

No match score

79

One field blank score

90

Both field blank score

100

Initial adjustment score

100

a)

Select the Person1 Given Name 2 match and go to the Options tab.

b)

Change the value of the Compare data using: field to Word Similarity using the drop-down list.

c)

Adjust the values of the match criteria according to the following table: Option

Value

Match score

80

No match score

79

One field blank score

90

Both field blank score

100

Initial adjustment score

100

Adjust the match criteria for Social_Security_Number1 to ignore records when their social security number values are blank. a)

For the Social_Security_Number1 match criteria, set the option to ignore the values when both fields are blank. Find the field Both field blank operation and using the drop-down list, choose the value IGNORE. If there is a value for One field blank score and Both fields blank score, delete it.

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

253

Unit 5: Matching and Consolidating Data

7.

BODS30

Move the new criteria for the SSN to be the first match criteria. a)

In the Transformation Options hierarchy, select the Social_Security_Number1 match criteria under Level 1.

b)

Choose the Move Up button underneath Transformation Options to place the SSN match criteria in the first position.

c)

Choose OK to close the Match Editor.

d)

Use the Back icon in the Tool Bar to close the Match transform editor and return to the data flow workspace.

Task 7: Add a new output column Source_Priority to the Query transform populated by an expression giving a higher priority to the source value Acct. 1.

Add a new output column Source_Priority to the Query transform below the Source output column. a)

Double-click the Query transform to open the transform editor.

b)

In the Schema Out area, right-click on the Source output columnn to select the option New output column. Enter Source_Priority as the name with a data type of int. Then select Insert below to create the new output column.

c)

Select the output column Source_Priority and go to the Mapping tab. Choose the Functions button and select the category Miscellaneous Functions, select the function IfThenElse and choose Next.

d)

For the “If” part of the function, type in the string 'ACCT'.

e)

For the “Then” part of the function, type in 10.

f)

For the “Else” part of the function, type in 20 and then choose OK. The code should be as follows: ifthenselse (MATCH_TARGET3.SOURCE = 'ACCT', 10, 20)

g)

Choose Back to close the transform editor and return to the data flow work space.

Continued on next page

254

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

Task 8: Add the Source_Priority field to the Match transform. Adjust the Post Match Processing to add a prioritization group Priority with the field match Source_Priority assigned. 1.

Add the Source_Priority field to the Match transform. . a)

Double-click the Match transform to open the transform editor.

b)

In the transform editor for the Match transform, add the Source_Priority column from the Schema In window to the Input tab in the parameters area to create a new entry.

c)

Go to the Options tab and start the Match Editor.

d)

In the navigation area on the left, open the node for Post Match Processing and choose Add Operation → Group Prioritization.

e)

In the Group Prioritization Editor enter Priority as the Prioritization name. Then add these values on the Priority Order tab in the Priority fields area: Input Field

Field Order

Source_Priority

Ascending

DC_SSN

Ascending

DC_Given_Name2

Ascending

AC_Unit_Num

Ascending

At this point, you have configured prioritization based on source. This assures that the records with a source value of ACCT take precedence over those with any other value. f)

For the prioritization group Priority, go to the Record Completeness tab and make the following settings: Option

Value

Orders records on completeness of data

Select the checkbox

Define priority and penalty fields

Select the radio button

Record priority field

SOURCE_PRIORITY

Continued on next page

2011

© 2011 SAP AG. All rights reserved.

255

Unit 5: Matching and Consolidating Data

2.

BODS30

To further prioritize records within the match group based on blank priority, configure the blank penalty scores for the fields: DC_SSN; DC_Given_Name2; and AC_Unit_Num to 1 in the Group Prioritization Editor. a)

In the Group Prioritization Editor, go to the Blank penalty score area and set the Blank Penalty score for these fields: Input Field

Blank Penalty

DC_SSN

1

DC_Given_Name2

1

AC_Unit_Num

1

Hint: You may need to choose the OK button to return to the Options tab and then choose the Edit Options button a second time to permit these entries to appear. b)

In the navigation area on the left, open the node for Post Match Processing and select the node Priority and use the Move Up button to move Priority before GroupStats1..

c)

Choose OK to confirm your changes and close the Match Editor.

d)

On the Output tab of the Match transform, select the check box for the field NameAddr_Level1_Priority_PRIORITY_VALUE. This outputs the field.

e)

In the Schema Out pane, right-click on the output column NameAddr_Level1_Priority_PRIORITY_VALUE to select the option Properties and rename the output column to MC_PRIORITY. Select the OK button to return to the Options tab.

Task 9: Add the DC_Phone and DC_Email fields to the Match transform Input tab from the Input schema. 1.

Add the DC_Phone and DC_Email fields to the Match transform Input tab from the Input schema. a)

Double-click the Match transform to open the transform editor.

Continued on next page

256

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

b)

Go to the Input schema and select the input columns DC_Phone and DC_Email and drag them to the Input tab in that order.

c)

Go to the Options tab and start the Match Editor.

d)

In the navigation area on the left, select the node for Post Match Processing and select the Add Operation button and select the option Best Record.

e)

In the Best Record Editor, enter these values: Option Name

Option Value

Best Record Name

Phone

Best Record Strategy

LENGTH

Strategy Priority

LONGEST

Strategy Field

DC_PHONE

Posting Destination

ALL

Post only once per destination

YES

f)

In the Best record action fields section, use the drop-down list to select DC_PHONE as the entry for both the Source Field and Destination Field.

g)

On the Options tab, choose Edit Options.

h)

In the navigation area on the left, select the node for Post Match Processing, choose Add Operation and select the option Best Record.

i)

Choose the Add Operation button and select the option Best Record and in the Best New Record Editor, enter these values: Option Name

Option Value

Best Record Name

Email

Best Record Strategy

NON_BLANK

Strategy Field

DC_EMAIL

Posting Destination

MASTER

Post only once per destination

YES

In the Best record action fields section, use the drop-down list to select DC_EMAIL as the entry for both the Source Field and Destination Field. Continued on next page

2011

© 2011 SAP AG. All rights reserved.

257

Unit 5: Matching and Consolidating Data

BODS30

j)

Move the Phone rule above GroupStats1 by selecting Phone and use the Move Up button.

k)

Move the Email rule above GroupStats1 by selecting Email and use the Move Up button.

l)

Choose OK to confirm your changes and close the Match Editor.

m)

Remap the DC_Phone and DC_Email columns from the Output tab to the Output schema by selecting the checkboxes in the Output tab. Note: In order for the Best Record processing to function properly, fields must be remapped from the Output tab to the Output schema, not from the Input schema.

n)

Use the Back button in the Toolbar to close the transform editor and return to the data flow workspace.

Task 10: Create a new target template table, connect it to the Match transform and execute the batch job Beta_Customers_Match4_Job with the default parameters. 1.

Create a new template table called Match_Targe4 in the DELTA datastore. a)

From the Tool Palette, select the icon for the template table and click inside the data flow workspace. enter the name Match_Target4 in the DELTA datastore.

b)

Choose OK to create the table.

c)

Connect the Match transform to the template table by selecting the Match transform, hold down the mouse button and drag to the template table. Once the template table is selected, release the mouse button to create the link.

Continued on next page

258

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Consolidating Matching Records

2.

Execute the batch job Beta_Customers_Match4_Job with the default parameters. a)

In the Project area, right-click the batch job Beta_Customers_Match4_Job to select the option Execute. Choose OK to save all objects you have created.

b)

Choose OK to accept the default execution parameters.

c)

In the data flow workspace, select the magnifying glass in the lower right corner of the target template table to view the data.

d)

In the target table, select the column heading for the MC_GROUP_NUMBER column to sort the data. Examine the DC_Phone and DC_Email columns for match duplicates of Thomas Davenport and Gloria Peterson to verify that the Best Record action was correctly applied. Initially, the master record for Gloria Peterson did not have phone and e-mail data; however, after the Best Record action, phone and e-mail values for the master and subordinate records are identical. Initially, the matching records for Thomas Davenport had the same phone number and different e-mail addresses. After the Best Record transformation, these values remain unchanged. The phone numbers are identical and none of Thomas Davenport's record were missing e-mail data.

Figure 65: Best Record Results

2011

© 2011 SAP AG. All rights reserved.

259

Unit 5: Matching and Consolidating Data

BODS30

Lesson Summary You should now be able to: • Consolidate matching records to form a best record • Configure the best record operation • Generate a unique identification number

260

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using Advanced Match Strategies

Lesson: Using Advanced Match Strategies Lesson Overview Introduction Householding and associative matching are advanced match strategies available in Data Quality.

Lesson Objectives After completing this lesson, you will be able to: • •

Use consumer and consumer householding Use associative matching

Business Example In examining the records in the Beta Businesses USA_Customers table, you find multiple levels of consumer (for example, individual and family accounts). You also find multiple levels of company (for example, corporation, subsidiary, and contact levels. You want to know how you can match records among these multiple levels.

Defining Householding Householding is a process in which organizations hierarchically structure their data to obtain context-sensitive views of customers, suppliers, patients, and so on. The initial requirement is to provide clear and concise marketing data by identifying all individuals who reside at the same physical residence, hence the name householding. This enables execution of efficient and cost-saving campaigns to any level of granularity. In Master Data Management (MDM), where master data is shared by several disparate groups within an organization, it is crucial to have a hierarchical customer view. These hierarchies, or households, can be constructed to any number of levels. From a the standpoint of running the business, having data linked together properly is essential. Customer service can provide a complete 360–degree view of their customer base, providing an improved Customer Relationship Management (CRM) prospective. For example, sales can better understand the true cost of sale and cost of acquisition, and management can get a clear understanding of the key financial costs related to the various groups.

2011

© 2011 SAP AG. All rights reserved.

261

Unit 5: Matching and Consolidating Data

BODS30

A householding strategy should be applied whenever matching business rules consist of multiple levels of consumer or corporate relationships (for example, individual and family accounts). For a corporation, householding may be applied at corporation, subsidiary, and contact levels.

Using Consumer and Corporate Householding Consumer householding is used to identify when multiple individuals in a family or group live together in the same residence. When setting up consumer householding, there are Match transform configurations available to match on the following criteria: •

•

For situations where individuals sharing a residence may not belong to the same family or share a family name, you can use the ConsumerHouseholdResInd_MatchBatch configuration. This allows you to match on two criteria: residence based on similar address data and individual based on similar name data. For situations where individuals sharing a residence are expected to share a family, you can use the ConsumerHouseholdResFamInd_MatchBatch configuration. This allows you to match on three criteria: residence based on similar address data, family based on similar family name data, and individual based on similar given name and post name data.

In the sample output below, notice that there are four records that match at the residence level. Of those four records, two records match at the individual level (Elizabeth and Liz).

Figure 66: Advanced Match Strategies 1–Householding

Corporate householding is used when your matching business rules consist of multiple levels of corporate relationships, such as corporate matches, subsidiary matches, and contact matches. For these records, you can use the CorporateHouseholdFirmInd_MatchBatch configuration. This allows you to match on two criteria: corporation based on similar firm and address data, and contact based on similar name data.

262

© 2011 SAP AG. All rights reserved.

2011

BODS30

Lesson: Using Advanced Match Strategies

Using Associative Matching Associative matching is the process where records are compared between match sets. If a record belongs to more than one match set, all records for these match sets become members of a new associated match group. The objective of this type of matching is to uncover matches that cannot be found based on a single match criteria. Use associative matching anytime you want to uncover hidden relationships. Associative matching is commonly used to identify people who may have multiple addresses for marketing and mailing efforts. For example, college students have one address during the academic school year and another permanent address. The resulting groups of previous match transforms can be combined together in the Associate transform. This transform uses the generated group data from previous match processes and determines the overlapped results, such as name and address, with name and SSN, as a single ID. This is commonly referred to as association matching. The Associate transform accepts a group number field, generated by the Match transforms, for each match result that will be combined. The transform can then output a new associated group number. The Associate transform can operate either on all the input records or on one data collection at a time. The latter is needed for transactional support.

Figure 67: Advanced Matching Strategies 2 – Associative

In the sample output above, notice the matches for Isabel Garcia-Cruz. There are three records in match group 1 (based on name and address matching), two records each in match groups 1 and 2 (based on name and email), and two records in match group 4 (based on name and phone). Since there is an overlap between the name and address and name and email match groups for Isabel Garcia-Cruz, all records in the match groups are considered a match. Since there is an overlap between the name and email and name and phone match groups for Isabel Garcia-Cruz, all records in these match groups are considered a match. Finally, since match group 2 (based on name and email) matches match group 1 (based on name and address), as well as match group 4 (based on name and phone), we can associate all records of these match sets can be associated to form one large associated match group. Without associated matching, the largest match group for Isabel Garcia-Cruz had three members. By performing associated matching, three hidden relationships were uncovered and a match set of six records found.

2011

© 2011 SAP AG. All rights reserved.

263

Unit 5: Matching and Consolidating Data

BODS30

Lesson Summary You should now be able to: • Use consumer and consumer householding • Use associative matching

264

© 2011 SAP AG. All rights reserved.

2011

BODS30

Unit Summary

Unit Summary You should now be able to: • Create a match strategy • Define match components • Set up matching using the Match Wizard • Complete a simple match for full names on the Beta Businesses USA_Customers table • Control record comparisons • Use match levels • Define match criteria • Use blank field options • Use match criteria options • Modify match criteria using the Match Editor • Perform post-match processing • Set priorities • Configure the group prioritization operation • Set up for match reports • Consolidate matching records to form a best record • Configure the best record operation • Generate a unique identification number • Use consumer and consumer householding • Use associative matching

2011

© 2011 SAP AG. All rights reserved.

265

Unit Summary

266

BODS30

© 2011 SAP AG. All rights reserved.

2011

Test Your Knowledge

267

BODS30

© 2011 SAP AG. All rights reserved.

2011

Course Summary

BODS30

Course Summary You should now be able to: • • • • •

268

Complete an Address Cleanse transform Work with global address data Complete a Data Cleanse transform Set up matching using the Match Wizard Consolidate matching records to form a best record

© 2011 SAP AG. All rights reserved.

2011

Feedback SAP AG has made every effort in the preparation of this course to ensure the accuracy and completeness of the materials. If you have any corrections or suggestions for improvement, please record them in the appropriate place in the course evaluation.

2011

© 2011 SAP AG. All rights reserved.

269

BODS30 - SAP Data Services - Data Quality Management

Short Description

Description

Comments

We need your help!