InformaticaBest-Practices_Confidential.pdf

January 3, 2017 | Author: smruti_2012 | Category: N/A

Short Description

Download InformaticaBest-Practices_Confidential.pdf...

Description

Velocity v8 Best Practices

Best Practices

●

●

●

●

B2B Data Exchange ❍

B2B Data Transformation Installation (for Unix)

❍

B2B Data Transformation Installation (for Windows)

❍

Deployment of B2B Data Transformation Services

❍

Establishing a B2B Data Transformation Development Architecture

❍

Testing B2B Data Transformation Services

Configuration Management and Security ❍

Configuring Security

❍

Data Analyzer Security

❍

Database Sizing

❍

Deployment Groups

❍

Migration Procedures - PowerCenter

❍

Migration Procedures - PowerExchange

❍

Running Sessions in Recovery Mode

❍

Using PowerCenter Labels

Data Analyzer Configuration ❍

Deploying Data Analyzer Objects

❍

Installing Data Analyzer

Data Connectivity ❍

●

Data Connectivity using PowerCenter Connect for BW Integration Server

❍

Data Connectivity using PowerExchange for WebSphere MQ

❍

Data Connectivity using PowerExchange for SAP NetWeaver

❍

Data Connectivity using PowerExchange for Web Services

Data Migration ❍

Data Migration Principles

INFORMATICA CONFIDENTIAL

BEST PRACTICES

2 of 954

●

●

❍

Data Migration Project Challenges

❍

Data Migration Velocity Approach

Data Quality and Profiling ❍

Build Data Audit/Balancing Processes

❍

Continuing Nature of Data Quality

❍

Data Cleansing

❍

Data Profiling

❍

Data Quality Mapping Rules

❍

Data Quality Project Estimation and Scheduling Factors

❍

Developing the Data Quality Business Case

❍

Effective Data Matching Techniques

❍

Effective Data Standardizing Techniques

❍

Integrating Data Quality Plans with PowerCenter

❍

Managing Internal and External Reference Data

❍

Real-Time Matching Using PowerCenter

❍

Testing Data Quality Plans

❍

Tuning Data Quality Plans

❍

Using Data Explorer for Data Discovery and Analysis

❍

Working with Pre-Built Plans in Data Cleanse and Match

Development Techniques ❍

Designing Data Integration Architectures

❍

Development FAQs

❍

Event Based Scheduling

❍

Key Management in Data Warehousing Solutions

❍

Mapping Auto-Generation

❍

Mapping Design

❍

Mapping SDK

❍

Mapping Templates

❍

Naming Conventions

❍

Naming Conventions - B2B Data Transformation

INFORMATICA CONFIDENTIAL

BEST PRACTICES

3 of 954

●

❍

Naming Conventions - Data Quality

❍

Performing Incremental Loads

❍

Real-Time Integration with PowerCenter

❍

Session and Data Partitioning

❍

Using Parameters, Variables and Parameter Files

❍

Using PowerCenter with UDB

❍

Using Shortcut Keys in PowerCenter Designer

❍

Working with JAVA Transformation Object

Error Handling ❍

Error Handling Process

❍

Error Handling Strategies - Data Warehousing

❍

Error Handling Strategies - General

❍

Error Handling Techniques - PowerCenter Mappings

❍

●

●

Error Handling Techniques - PowerCenter Workflows and Data Analyzer

Integration Competency Centers and Enterprise Architecture ❍

Business Case Development

❍

Canonical Data Modeling

❍

Chargeback Accounting

❍

Engagement Services Management

❍

Information Architecture

❍

People Resource Management

❍

Planning the ICC Implementation

❍

Proposal Writing

❍

Selecting the Right ICC Model

Metadata and Object Management ❍

Creating Inventories of Reusable Objects & Mappings

❍

Metadata Reporting and Sharing

❍

Repository Tables & Metadata Management

❍

Using Metadata Extensions

❍

Using PowerCenter Metadata Manager and Metadata Exchange Views

INFORMATICA CONFIDENTIAL

BEST PRACTICES

4 of 954

for Quality Assurance ●

●

●

Metadata Manager Configuration ❍

Configuring Standard Metadata Resources

❍

Custom XConnect Implementation

❍

Customizing the Metadata Manager Interface

❍

Estimating Metadata Manager Volume Requirements

❍

Metadata Manager Business Glossary

❍

Metadata Manager Load Validation

❍

Metadata Manager Migration Procedures

❍

Metadata Manager Repository Administration

❍

Upgrading Metadata Manager

Operations ❍

Daily Operations

❍

Data Integration Load Traceability

❍

Disaster Recovery Planning with PowerCenter HA Option

❍

High Availability

❍

Load Validation

❍

Repository Administration

❍

Third Party Scheduler

❍

Updating Repository Statistics

Performance and Tuning ❍

Determining Bottlenecks

❍

Performance Tuning Databases (Oracle)

❍

Performance Tuning Databases (SQL Server)

❍

Performance Tuning Databases (Teradata)

❍

Performance Tuning in a Real-Time Environment

❍

Performance Tuning UNIX Systems

❍

Performance Tuning Windows 2000/2003 Systems

❍

Recommended Performance Tuning Procedures

❍

Tuning and Configuring Data Analyzer and Data Analyzer Reports

INFORMATICA CONFIDENTIAL

BEST PRACTICES

5 of 954

●

❍

Tuning Mappings for Better Performance

❍

Tuning Sessions for Better Performance

❍

Tuning SQL Overrides and Environment for Better Performance

❍

Using Metadata Manager Console to Tune the XConnects

PowerCenter Configuration ❍

Advanced Client Configuration Options

❍

Advanced Server Configuration Options

❍

Causes and Analysis of UNIX Core Files

❍

Domain Configuration

❍

Managing Repository Size

❍

Organizing and Maintaining Parameter Files & Variables

❍

Platform Sizing

❍

PowerCenter Admin Console

❍

PowerCenter Enterprise Grid Option

❍

●

●

PowerExchange Configuration ❍

PowerExchange for Oracle CDC

❍

PowerExchange for SQL Server CDC

❍

PowerExchange Installation (for AS/400)

❍

PowerExchange Installation (for Mainframe)

Project Management ❍

Assessing the Business Case

❍

Defining and Prioritizing Requirements

❍

Developing a Work Breakdown Structure (WBS)

❍

Developing and Maintaining the Project Plan

❍

Developing the Business Case

❍

Managing the Project Lifecycle

❍

●

Understanding and Setting UNIX Resources for PowerCenter Installations

Using Interviews to Determine Corporate Data Integration Requirements

Upgrades

INFORMATICA CONFIDENTIAL

BEST PRACTICES

6 of 954

❍

Upgrading Data Analyzer

❍

Upgrading PowerCenter

❍

Upgrading PowerExchange

INFORMATICA CONFIDENTIAL

BEST PRACTICES

7 of 954

B2B Data Transformation Installation (for Unix) Challenge Install and configure B2B Data Transformation on new or existing hardware, either in conjunction with PowerCenter or co-existing with other host applications on the same application server. Note: B2B Data Transformation (B2BDT) was formerly called Complex Data Exchange (CDE). All references to CDE in this document are now referred to as B2BDT.

Description Consider the following questions when determining what type of hardware to use for B2BDT: If the hardware already exists: 1. 2. 3. 4. 5.

Is the processor, operating system supported by B2BDT? Are the necessary operating system and patches applied? How many CPUs does the machine currently have? Can the CPU capacity be expanded? How much memory does the machine have? How much is available to the B2BDT application? Will B2BDT share the machine with other applications? If yes, what are the CPU and memory requirements of the other applications?

If the hardware does not already exist: 1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the complex data transformation requirements for B2BDT. The hardware requirements for the B2BDT environment depend upon the data volumes, number of concurrent users, application server and operating system used, among other factors. For exact sizing recommendations, contact Informatica Professional Services for a B2BDT Sizing and Baseline Architecture engagement.

Planning for B2BDT Installation There are several variations on the hosting environment from which B2BDT services will be called. This has implications on how B2BDT is installed and configured.

Host Software Environment The most common configurations are: ●

B2BDT to be used in conjunction with PowerCenter

●

B2BDT as a stand alone configuration

●

B2BDT in conjunction with non-PowerCenter integration using an adapter for other middleware software such as WebMethods

In addition, B2BDT 4.4 included a mechanism for exposing B2BDT services through web services so that they could be called from applications capable of calling web services. Depending on what host options are chosen, installation may vary.

Installation of B2BDT for a PowerCenter Host Environment INFORMATICA CONFIDENTIAL

BEST PRACTICES

8 of 954

Be sure to have the necessary Licenses and the additional plug-in to make PowerCenter work. Refer to the appropriate installation guide or contact Informatica support for details on installing B2BDT in PowerCenter environments.

Installation of B2BDT for a Standalone Environment When using B2BDT services in a standalone environment, it is expected that one of the invocation methods (e.g., web services, . Net, Java APIs, command line or CGI) will be used to invoke B2BDT services. Consult accompanying B2BDT documentation for use in these environments.

Non-PowerCenter Middleware Platform Integration Be sure to plan for additional agents to be installed. Refer to the appropriate installation guide or contact Informatica support for details for installing B2BDT in environments other than PowerCenter.

Other Decision Points Where will the B2BDT service repository be located? The choices for the location of the service repository are i) a path on the local file system or ii) use of a shared network drive. The justification for using a shared network drive is typically to simplify service deployment if two separate B2BDT servers want to share the same repository. While the use of a shared repository is convenient for a multi-server production environment it is not advisable for development as there could be a danger of multiple development teams potentially overwriting the same project files. When a repository is shared between multiple machines, if a service is deployed via the B2BDT Studio, the Service Refresh Interval setting controls how fast other installations of B2BDT that are currently running detect the deployment of a service. What are multi-user considerations? If multiple users share a machine (but not at same time) the environment variable “IFConfigLocation4” can be used to set the location of the configuration file to point to a different configuration file for each user.

Security Considerations As the B2BDT repository, workspace and logging locations are directory-based all directories to be used should be granted read and write permissions for the user identity under which the B2BDT service will run. The identity associated with the caller of the B2BDT services will also need to have permissions to execute the files installed in B2BDT binary directory. Special considerations should be given to environments such as web services where the user identify under which the B2BDT service runs is set to be different for the interactive user or the user associated with the calling application.

Log File and Tracing Locations Log files and tracing options should be configured for appropriate recycling policies. The calling application must have permissions to read, write and delete files to the path that is set for storing these files.

B2BDT Pre-install Checklist B2BDT has client and server components. Only the server (or engine) component is installed on UNIX platforms. The client or development studio is only supported on the Windows platform. Reviewing the environment and recording the information in a detailed checklist facilitates the B2BDT install.

Minimum System Requirements Verify that the minimum requirements for Operating System, Disk Space, Processor Speed and RAM are met and record them in INFORMATICA CONFIDENTIAL

BEST PRACTICES

9 of 954

the checklist. Verify the following: ●

B2BDT requires a Sun Java 2 Runtime Environment (version 1.5.X or above). B2BDT bundles with the appropriate JRE version. The installer can be pointed to an existing JRE or a JRE can be downloaded from Sun. ❍

If the server platform is AIX, Solaris or Linux JRE version 1.5 or higher is installed and configured.

❍

If the server platform is HP-UX, JRE version 1.5 or higher and the Java “-AA” add-on is installed and configured.

●

A login account and directory have been created for the installation

●

Confirm that the profile file is not write-protected. The setup program needs to update the profile. ❍

~/.profile if you use the sh, ksh, or bash shell

❍

~/.cshrc or ~/.tcshrc if you use the csh or tcsh shell

●

500Mb or more of temporary workspace is available.

●

Data and Stack Size ❍

If the server platform is Linux, the data and the stack size are not limited

❍

If the server platform is AIX, the data size is not limited.

PowerCenter Integration Requirements Complete a separate checklist for integration if you plan to integrate B2BDT with PowerCenter. For an existing PowerCenter installation, the B2BDT client will need to be installed on at least one PC in which the PowerCenter client resides. Also, B2BDT components will need to be installed on the PowerCenter server. If utilizing an existing PowerCenter installation ensure the following: ●

Which version of PowerCenter is being used (8.x required)?

●

Is the PowerCenter version 32 bit or 64 bit?

●

Are the PowerCenter client tools installed on the client PC?

●

Is the PowerCenter server installed on the server?

For new PowerCenter installations, the PowerCenter Pre-Install Checklist needs to be completed. Keep in mind that the same hardware will be utilized for both PowerCenter and B2BDT.

Non-PowerCenter Integration requirements In addition to general B2BDT requirements, non-PowerCenter agents require that additional components are installed. B2BDT Agent for BizTalk - requires that Microsoft BizTalk Server (version 2004 or 2006) is installed on the same computer as B2BDT. If B2BDT Studio is installed on the same computer as BizTalk Server 2004, the Microsoft SP2 service pack for BizTalk Server must be installed. B2BDT Translator for Oracle BPEL - requires that BPEL 10.1.2 or above is installed. B2BDT Agent for WebMethods - requires that WebMethods 6.5 or above is installed. B2BDT Agent for WebSphere Business Integration Message Broker – requires that WBIMB 5.0 with CSD06 (or WBIMB 6.0) is installed. Also ensure that the platform supports both the B2BDT Engine and WBIMB. A valid license key is needed to run a B2BDT project and must be installed before B2BDT services will run on the computer. Contact Informatica support to obtain a B2BDT license file (B2BDTLicense.cfg). B2BDT Studio can be used without installing a license file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

10 of 954

B2BDT Installation and Configuration The B2BDT installation process involves two main components - the B2BDT development workbench (Studio) and the B2BDT Server, which is an application deployed on a server. The installation tips apply to UNIX environments. This section should be used as a supplement to the B2B Data Transformation Installation Guide. Before installing B2BDT, complete the following steps: ●

Verify that the hardware meets the minimum system requirements for B2BDT. Ensure that the combination of hardware and operating system are supported by B2BDT. Ensure that sufficient space has been allocated to the B2BDT serviceDB.

●

Apply all necessary patches to the operating system.

●

Ensure that the B2BDT license file has been obtained from technical support.

●

Be sure to have administrative privileges for the installation user id. For *nix systems ensure that read, write and executive privileges have been given for the installation directory.

Adhere to following sequence of steps to successfully install B2BDT. 1. Complete the B2BDT pre-install checklist and obtain valid license keys. 2. Install B2BDT development workbench (studio) on the windows platform. 3. Install the B2BDT server on a server machine. When used in conjunction with PowerCenter, the server component must be installed on the same physical machine where PowerCenter resides. 4. Install necessary client agents when used in conjunction with Websphere, WebMethods and Biztalk

B2BDT Install Components ●

B2B Data Transformation Studio

●

B2B Data Transformation Engine

●

Processors

●

Optional agents

●

Optional libraries

The table below provides descriptions of each component: Component

Applicable Platform

Description

Engine

Both UNIX and Windows

The runtime module that executes B2BDT data transformations. This module is required in all B2BDT installations.

Studio

Windows only

The design and configuration environment for creating and deploying data transformations. B2BDT Studio is hosted within Eclipse on Windows platforms. The Eclipse setup is included in the B2BDT installation package.

Document Processors

Both UNIX and Windows

A set of components that perform global processing operations on documents, such as transforming their file formats. All the document processors run on Windows platforms, and most of them run on UNIX-type platforms.

Libraries

Windows only (see description)

Libraries of predefined B2BDT data transformations, which can be used with industry messaging standards such as EDI, ACORD, HL7, HIPAA, and SWIFT. Each library contains parsers, serializers, and XSD schemas for the appropriate messaging standard. The libraries can be installed on Windows platforms. B2BDT Studio can be used to import the library components to projects, and deploy the projects to Windows or UNIX-type platforms.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

11 of 954

Documentation

Windows only

An online help library, containing all the B2BDT documentation. PDF version of documentation available for UNIX platform

Install the B2BDT Engine Step 1: Run the UNIX installation file from the software folder on the installation CD and follow the prompts. Follow the wizard to complete the install. TIP During the installation a language must be selected. If there are plans to change the language at a later point in time in the Configuration Editor, Informatica recommends that a non-English language is chosen for the initial setup. If English is selected and then later changed to another language some of the services that are required for other languages might not be installed.

B2BDT supports all of the major UNIX-type systems (e.g., Sun Solaris, IBM AIX, Linux and HP-UX). On UNIX-type operating systems, the installed components are the B2BDT Engine and the document processors. Note: On UNIX-type operating systems, do not limit the data size and the stack size. To determine whether there is currently a limitation, run the following command: For AIX, HP, and Solaris: ulimit –a For Linux: limit If very large documents are processed using B2BDT, try adjusting system parameters such as the memory size and the file size. There are two install modes possible under UNIX -- Graphical Interface and Console Mode. The default installation path is /opt/Informatica/ComplexDataExchange.

The default Service Repository Path is /ServiceDB. This is the storage location for data transformations that are INFORMATICA CONFIDENTIAL

BEST PRACTICES

12 of 954

deployed as B2BDT services. The default Log path is /CMReports. Log Path page is the location where the B2BDT Engine should store its log files. The log path is also known as the reports path. The repository location, JRE path and Log path can be changed subsequent to the installation using environment variables.

Step 2: Install the license file. Verify the validity of the license file with the following command: CM_console –v The system displays information such as the location and validity of the license file (sample output shown below): $ ./bin/CM_console –v Version: 4.4.0(Build:186) Syntax version:4.00.10 Components: Engine Processors Configuration file: /websrvr/informatica/ComplexDataExchange/CMConfig.xml Package identifier: IF_AIX_OS64_pSeries_C64 License information: License-file path: /websrvr/informatica/ComplexDataExchange/CDELicense.cfg Expiration date: 21/02/08 (dd/mm/yyyy) Maximum CPUs: 1 Maximum services: 1 Licensed components: Excel,Pdf,Word,Afp,Ppt

Step 3: Load the Environment Variables. When the setup is complete, configure the system to load the B2BDT environment variables. The B2BDT setup assigns several environment variables that point to the installation directory and to other locations that the system needs. On UNIX-type platforms, the system must be configured to load the environment variables. B2BDT cannot run until this is done. B2BDT setup creates an environment variables file. This can be in either of the following ways: Manually from the command line In lieu of loading environment variables automatically, they can be loaded manually from the command line. This must be done upon each log in before using B2BDT. For sh, ksh, or bash shell, the command is: .//setEnv.sh For csh or tcsh shell, the command is: source //setEnv.csh Substitute the installation path for as necessary. Automatically by inserting the appropriate command in the profile or in a script file To configure the system to load the environment variables file automatically upon log in: INFORMATICA CONFIDENTIAL

BEST PRACTICES

13 of 954

For the sh, ksh, or bash shell, insert the following line in the profile file. . //setEnv.sh For csh or tcsh shell, insert the following line in the login file. source //setEnv.csh On UNIX-type platforms, B2BDT uses the following environment variables. Environment Variable PATH

Required/Optional Purpose of the Variable Required

The environment variables file adds / bin to the paths. Note: In rare instances, the B2BDT Java document processors require that the JRE be added to the path.

On AIX: LIBPATH

Required

On Solaris and Linux: LD_LIBRARY_PATH

The environment variables file adds the installation directory () to the library path. It also adds the JVM directory of the JRE and its parent directory to the path, for example, / jre1.4/lib/sparc/server and /jre1.4/lib/ sparc. This value can be edited to use another compatible JRE.

On HP-UX: SHLIB_PATH and LD_LIBRARY_PATH

Required

The environment variables file adds / api/lib/CM_JavaAPI.jar to the Java class path.

Required

The environment variables file creates this variable, which points to the B2BDT installation directory ().

Optional

The path of the B2BDT configuration file. This for multiple configurations.

CLASSPATH

IFCONTENTMASTER_HOME

IFConfigLocation4

The following is an example of an environment variables file (setEnv.csh) on an AIX system. The variable names and values differ slightly on other UNIX-type operating systems. ## B2B Data Transformation Environment settings setenv IFCMPath /opt/Informatica/ComplexDataExchange setenv CMJAVA_PATH /opt/Informatica/ComplexDataExchange/jre1.4/jre/bin/classic: /opt/Informatica/ComplexDataExchange/jre1.4/jre/bin # Prepend B2B Data Transformation to the PATH if ( ! $?PATH ) then setenv PATH "" endif setenv PATH "${IFCMPath}/bin:${PATH}" # Add CM & java path to LIBPATH if ( ! $?LIBPATH ) then INFORMATICA CONFIDENTIAL

BEST PRACTICES

14 of 954

setenv LIBPATH "" endif setenv LIBPATH "${IFCMPath}/bin:${CMJAVA_PATH}:${LIBPATH}" # Update IFCONTENTMASTER_HOME. setenv IFCONTENTMASTER_HOME "${IFCMPath}" # Prepend CM path CLASSPATH if ( ! $?CLASSPATH ) then setenv CLASSPATH "" endif setenv CLASSPATH "${IFCMPath}/api/lib/CM_JavaAPI.jar:.:${CLASSPATH}"

Step 4: Configuration settings Directory Locations During the B2BDT setup, prompts were completed for the directory locations of the B2BDT repository, log files and JRE. If necessary, alter these locations by editing the following parameters: Parameter

Explanation

CM Configuration/ Directory services/ File system/Base Path

The B2BDT repository location, where B2BDT services are stored.

CM Configuration/ CM Engine/ JVM Location

On UNIX: This parameter is not available in the Configuration Editor on UNIX-type platforms. For more information about setting the JRE on UNIX, see UNIX Environment Variable Reference.

CM Configuration/ General/ Reports directory

The log path, also called the reports path, where B2BDT saves event logs and certain other types of reports.

CM Configuration/ CM Engine/ Invocation

These settings control whether B2BDT Engine runs in-process or out-of-process.

CM Configuration/ CM Engine/ CM Server

B2BDT has a Configuration Editor, for editing the parameters of a B2BDT installation. To open the Configuration Editor on UNIX in graphical mode: Enter the following command: /CMConfig Note: The Configuration Editor is not supported in a UNIX console mode. Some of the Configuration Editor settings are available for all B2BDT installations. Some additional settings vary depending on the B2BDT version and on the optional components that have been installed. The Configuration Editor saves the configuration in an XML file. By default, the file is . Note: Before editing the configuration save a backup copy of CMConfig.xml. In the event of a problem the backup can be restored. The file /CMConfig.bak is a backup of the original which the setup program INFORMATICA CONFIDENTIAL

BEST PRACTICES

15 of 954

created when B2BDT was installed. Restoring CMConfig.bak reverts B2BDT to its original configuration. OS environment variables are used to set aspects of the system such as the Java classpath to be used, location of the configuration file for a specific user, home location for the installed B2BDT instance to be used, library paths, etc. The following table lists some typical configuration items and where they are set: Type of configuration item

Where configured

Memory for Studio

B2BDT Configuration application

JVM / JRE usage

B2BDT Configuration application

Tuning parameters – threads, timeouts etc

B2BDT Configuration application

User specific settings

Use environment variable to point to different configuration file

Memory for runtime

B2BDT Configuration application

Workspace location

B2BDT Configuration application (B2BDT 4.3), B2BDT Studio (B2BDT 4.4)

Event generation

Set in project properties

Repository location

B2BDT Configuration application

In-Process or Out-of-Process Invocation Out-of-process invocation requires the use of the B2BDT Server application (which is already installed by the install process). The distinction is that running under server mode causes transformations to potentially run slower, but errors will be isolated from the calling process. For Web Services, sometime the use of Server mode is recommended as the lifetime of the host process then becomes independent of the life time of the process space allocated to run the web service. For example IIS can run web services in a mode where a process dies or is recycled after a call to web services. For B2BDT the first call after a process startup can take up to 3 seconds (subsequent calls are usually milliseconds) – hence it is not optimal to start host process on each invocation. Running in server mode keeps process lifetimes independent. TIP B2BDT Studio or the CM_console command always runs data transformations in-process.

Running out-of-process has the following advantages: ●

Allows 64-bit processes to activate 32-bit versions of B2BDT Engine

●

An Engine failure is less likely to disrupt the calling application

●

Helps prevent binary collisions with other modules that run in the process of the calling application.

In-process invocation has the following advantage: ●

Faster performance than out-of-process.

Thread pool settings INFORMATICA CONFIDENTIAL

BEST PRACTICES

16 of 954

The thread pool controls the maximum number of Engine threads that can run client requests concurrently per process. If the number of client requests exceeds the number of available threads, the Server queues the requests until a thread is available. The default setting is 4. Some recommendations are summarized in the table below. Actual needs vary depending upon requirements. Best practices and additional recommendations are part of Jumpstart and Base Line Architecture engagements. Contact an Informatica representative for additional information.

Step 5: Configure ODBC connectivity. Note: This step is only needed if the ODBC database support features of B2BDT will be used. In such case, an ODBC driver may need to be configured.

Step 6: Test the installation to confirm that B2BDT operates properly. Note: Tests are available to test the engine and document processor installation. Refer the directory \setupTests for B2BDT test projects testCME and testCMDP. Sample output would be similar to following: cd $IFCONTENTMASTER_HOME cp -R setupTests/testCME ServiceDB/ CM_console testCME Test Succeeded

B2BDT Integration with PowerCenter B2BDT does support using the runtime as a “server” process to be invoked from PowerCenter on the same physical machine (in addition to offering the ability to invoke the B2BDT runtime engine in-process with the calling environment). While this does constitute a “server” process in developer terminology, it does not provide full server administration or monitoring capabilities that are typical of enterprise server products. It is the responsibility of the calling environment to provide these features. Part of the overall solution architecture is to define how these facilities are mapped to a specific B2BDT implementation.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

17 of 954

Installation of B2BDT for PowerCenter is a straightforward process. All required plugins needed to develop and run B2BDT transformation are installed as part of a PowerCenter installation. However, certain B2BDT versions and B2BDT plugins need to be registered after the install process. Refer to the PowerCenter Installation Guide for details.

Note: A PowerCenter UDO Transformation can only be created if the UDO plug-in is successfully registered in the repository. If the UDO Option is installed correctly then UDO Transformations can be created in the PowerCenter Designer. Note: ODBC drivers provided for PowerCenter are not automatically usable from B2BDT as licensing terms prohibit this in some cases. Contact an Informatica support representative for further details. INFORMATICA CONFIDENTIAL

BEST PRACTICES

18 of 954

Additional Details for PowerCenter Integration

●

B2BDT is a Custom Transformation object within PowerCenter

●

INFA passes data via memory buffers to the B2BDT engine and retrieves that output via buffers.

●

The B2BDT engine runs IN-PROCESS with the PowerCenter engine.

●

The Custom Transformation object for Informatica can be dragged and dropped inside a PowerCenter mapping.

●

When using B2BDT transformation, PowerCenter does NOT process the input files directly, but instead takes a path and filename (from a text file). Then the engine processes the data through the B2BDT parser defined within the mapping. After this the data is returned to the PowerCenter B2BDT transformation for processing by other Informatica transformation objects. TIP Verify that the “Source filename” is the name of the text file where both the file path and the file name are present. It can not be the actual file being parsed by PowerCenter and B2BDT. This is the “direct” versus “indirect” sourcing of the file.

Useful Tips and Tricks Version Compatibility? ● ●

●

Ensure that version of B2BDT is compatible with PowerCenter. Otherwise many issues can manifest in different forms. In general B2BDT 4.4 is compatible with 8.5 and with 8.1.1 SP4 (and SP4 only), B2BDT 4.0.6 is compatible with 8.1.1, and B2BDT 3.2 is compatible with PC 7.x. For more information refer to the Product Availability Matrix.

Service Deployment? ● ●

Ensure that services are deployed on a remote machine where PowerCenter installed. Services deployed from studio show up in PowerCenter designer B2BDT transformation as a dropdown list (see screenshot below).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

19 of 954

Note: These services are only ones that are deployed on a local machine. If any services are deployed on remote machines the designer will not display them. As it is easy to mistake them for remote services manually ensure that the services for local and remote machines are in sync. ●

After making certain that the services are deployed in the remote server that also has PowerCenter installed, the B2BDT transformation can be specified from the UDO (8.1.1) or B2BDT (8.5) transformation on the Metadata Extensions tab.

Last updated: 02-Jun-08 16:23

INFORMATICA CONFIDENTIAL

BEST PRACTICES

20 of 954

B2B Data Transformation Installation (for Windows) Challenge Installing and configuring B2B Data Transformation (B2BDT) on new or existing hardware, either in conjunction with PowerCenter or co-existing with other host applications on the same server. Note: B2B Data Transformation was formerly called Complex Data Exchange (CDE). Any references to PowerExchange Complex Data Exchange in this document are now referred to as B2B Data Transformation (B2BDT).

Description Consider the following questions when determining what type of hardware to use for B2BDT: If the hardware already exists: 1. 2. 3. 4. 5.

Is the processor, operating system supported by B2BDT? Are the necessary operating system and patches applied? How many CPUs does the machine currently have? Can the CPU capacity be expanded? How much memory does the machine have? How much is available to the B2BDT application? Will B2BDT share the machine with other applications? If yes, what are the CPU and memory requirements of other applications?

If the hardware does not already exist: 1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the complex data transformation requirements for B2BDT. Among other factors, the hardware requirements for the B2BDT environment depend upon the data volumes, the number of concurrent users and the application server and operating system used. For exact sizing recommendations, contact Informatica Professional Services for a B2BDT Sizing and Baseline Architecture engagement.

Planning for the B2BDT Installation There are several variations of the hosting environment from which B2BDT services will be invoked. This has implications on how B2BDT is installed and configured.

Host Software Environment The most common configurations are: 1. B2BDT to be used in conjunction with PowerCenter 2. B2BDT as a stand alone configuration 3. B2BDT in conjunction with a non-PowerCenter integration using an adapter for other middleware software such as WebMethods or Oracle BPEL. INFORMATICA CONFIDENTIAL

BEST PRACTICES

21 of 954

B2BDT 4.4 includes a mechanism for exposing B2BDT services through web services so that they can be called from applications capable of calling web services. Depending on what host options are chosen, installation options may vary.

Installation of B2BDT for a PowerCenter Host Environment Be sure to have the necessary licenses and the additional plug-in to make PowerCenter work. Refer to the appropriate installation guide or contact Informatica support for details on the installation of B2BDT in PowerCenter environments.

Installation of B2BDT for a Standalone Environment When using B2BDT services in a standalone environment, it is expected that one of the invocation methods (e.g., Web Services, .Net, Java APIs, Command Line or CGI) will be used to invoke B2BDT services. Consult accompanying B2BDT documentation for use in these environments.

Non-PowerCenter Middleware Platform Integration Be sure to plan for additional agents to be installed. Refer to the appropriate installation guide or contact Informatica support for details for installing B2BDT in environments other than PowerCenter.

Other Decision Points Where will the B2BDT service repository be located? The choices for the location of the service repository are i) a path on the local file system or ii) use of a shared network drive. The justification for using a shared network drive is typically to simplify service deployment if two separate B2BDT servers want to share the same repository. While the use of a shared repository is convenient for a multi-server production environment it is not advisable for development as there could be a danger of multiple development teams potentially overwriting the same project files. When a repository is shared between multiple machines, if a service is deployed via the B2BDT Studio, the Service Refresh Interval setting controls how fast other installations of B2BDT that are currently running detect the deployment of a service. What are multi-user considerations? If multiple users share a machine (but not at same time) the environment variable “IFConfigLocation4” can be used to set the location of the configuration file to point to a different configuration file for each user.

Security Considerations As the B2BDT repository, workspace and logging locations are directory-based all directories to be used should be granted read and write permissions for the user identity under which the B2BDT service will run. The identity associated with the caller of the B2BDT services will also need to have permissions to execute the files installed in B2BDT binary directory. Special considerations should be given to environments such as web services where the user identify under which the B2BDT service runs is set to be different for the interactive user or the user associated with the calling INFORMATICA CONFIDENTIAL

BEST PRACTICES

22 of 954

application.

Log File and Tracing Locations Log files and tracing options should be configured for appropriate recycling policies. The calling application must have permissions to read, write and delete files to the path that is set for storing these files.

B2BDT Pre-install Checklist It is best to review the environment and record the information in a detailed checklist to facilitate the B2BDT install.

Minimum System Requirements Verify that the minimum requirements for the Operating System, Disk Space, Processor Speed and RAM are met and record them the checklist. ●

B2BDT Studio requires Microsoft .NET Framework, version 2.0.

●

If this version is not already installed, the installer will prompt for and install the framework automatically.

●

B2BDT requires a Sun Java 2 Runtime Environment, version 1.5.X or above.

●

●

B2BDT bundles with the appropriate JRE version. The installer can be pointed to an existing JRE or a JRE can be downloaded from Sun. To install the optional B2BDT libraries, reserve additional space (refer to documentation for additional information).

PowerCenter Integration Requirements Complete the checklist for integration if B2BDT will be integrated with PowerCenter. For an existing PowerCenter installation, the B2BDT client needs to be installed on at least one PC on which the PowerCenter client resides. B2BDT components also need to be installed on the PowerCenter server. If utilizing an existing PowerCenter installation ensure the following: ●

Which version of PowerCenter is being used (8.x required)?

●

Is the PowerCenter version 32 bit or 64 bit?

●

Are the PowerCenter client tools installed on the client PC?

●

Is the PowerCenter server installed on the server?

For new PowerCenter installations, the PowerCenter Pre-Install Checklist should be completed. Keep in mind that the same hardware will be utilized for PowerCenter and B2BDT. For windows Server, verify the following: ●

The login account used for the installation has local administrator rights.

●

500Mb or more of temporary workspace is available.

●

The Java 2 Runtime Environment version 1.5 or higher is installed and configured.

●

Microsoft .NET Framework, version 2.0 is installed.

Non-PowerCenter Integration Requirements INFORMATICA CONFIDENTIAL

BEST PRACTICES

23 of 954

In addition to the general B2BDT requirements, non-PowerCenter agents require that additional components are installed. B2BDT Agent for BizTalk - requires that Microsoft BizTalk Server (version 2004 or 2006) is installed on the same computer as B2BDT. If B2BDT Studio is installed on the same computer as BizTalk Server 2004, the Microsoft SP2 service pack for BizTalk Server must be installed. B2BDT Translator for Oracle BPEL - requires that BPEL 10.1.2 or above is installed. B2BDT Agent for WebMethods - requires that WebMethods 6.5 or above is installed. B2BDT Agent for WebSphere Business Integration Message Broker – requires that WBIMB 5.0 with CSD06 (or WBIMB 6.0) are installed. Also ensure that the Windows platform supports both the B2BDT Engine and WBIMB. A valid license key is needed to run a B2BDT project and must be installed before B2BDT services will run on the computer. Contact Informatica support to obtain a B2BDT license file (B2BDTLicense.cfg). B2BDT Studio can be used without installing a license file.

B2BDT Installation and Configuration The B2BDT installation consists of two main components - the B2BDT development workbench (Studio) and the B2BDT Server (which is an application deployed on a server). The installation tips apply to Windows environments. This section should be used as a supplement to the B2B Data Transformation Installation Guide. Before installing B2BDT complete the following steps: ●

Verify that the hardware meets the minimum system requirements for B2BDT.

●

Ensure that the combination of hardware and operating system are supported by B2BDT.

●

Ensure that sufficient space has been allocated for the B2BDT serviceDB.

●

Ensure that all necessary patches have been applied to the operating system.

●

Ensure that the B2BDT license file has been obtained from technical support.

●

Be sure to have administrative privileges for the installation user id.

Adhere to the following sequence of steps to successfully install B2BDT. 1. Complete the B2BDT pre-install checklist and obtain valid license keys. 2. Install B2BDT development workbench (studio) on the windows platform. 3. Install the B2BDT server on a server machine. When used in conjunction with PowerCenter, the server component must be installed on the same physical machine where PowerCenter resides. 4. Install necessary client agents when used in conjunction with WebSphere, WebMethods and BizTalk. In addition to the standard B2BDT components that are installed by default additional libraries can be installed. Refer to the B2BDT documentation for detailed information on these library components.

B2BDT Install Components The install package includes the following components. ●

B2B Data Transformation Studio

INFORMATICA CONFIDENTIAL

BEST PRACTICES

24 of 954

●

B2B Data Transformation Engine

●

Document Processors

●

Documentation

●

Optional agents

●

Optional libraries

The table below provides descriptions of each component: Component

Description

Engine

The runtime module that executes B2BDT data transformations. This module is required in all B2BDT installations.

Studio

The design and configuration environment for creating and deploying data transformations. B2BDT Studio is hosted within Eclipse on Windows platforms. The Eclipse setup is included in the B2BDT installation package.

Document Processors

A set of components that perform global processing operations on documents, such as transforming their file formats. All the document processors run on Windows platforms, and most of them run on UNIX-type platforms.

Optional Libraries

Libraries of predefined B2BDT data transformations, which can be used with industry messaging standards such as EDI, ACORD, HL7, HIPAA, and SWIFT. Each library contains parsers, serializers, and XSD schemas for the appropriate messaging standard. The libraries can be installed on Windows platforms. B2BDT Studio can be used to import the library components to projects in order to deploy the projects to Windows or UNIX-type platforms.

Documentation

An online help library, containing all the B2BDT documentation.

Install the B2BDT Studio and Engine Step 1: Run the Windows installation file from the software folder on the installation CD and follow the prompts. Follow the wizard to complete the install.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

25 of 954

TIP During the installation a language must be selected. If there are plans to change the language at a later point in time in the Configuration Editor, Informatica recommends that a non-English language is chosen for the initial setup. If English is selected and then later changed to another language some of the services that are required for other languages might not be installed.

● ●

●

The default installation path is C:\Informatica\ComplexDataExchange. The default Service Repository Path is /ServiceDB. This is the storage location for data transformations that are deployed as B2BDT services. The default Log path is /CMReports. The Log Path is the location where the B2BDT Engine stores its log files. The log path is also known as the reports path.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

26 of 954

The repository location, JRE path and Log path can be changed subsequent to the installation using environment variables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

27 of 954

Step 2: Install the license file. Verify the validity of the license file with the following command: CM_console –v The system displays information such as the location and validity of the license file.

Step 3: Configure the Environment Variables. The B2BDT setup assigns several environment variables which point to the installation directory and to other locations that the system needs. On Windows, the B2BDT setup creates or modifies the following environment variables: Environment Variable PATH

INFORMATICA CONFIDENTIAL

Required/Optional Purpose of the Variable Required

The environment variables file adds /bin to the paths. Note: In rare instances, the B2BDT Java document processors require that the JRE be added to the path.

BEST PRACTICES

28 of 954

CLASSPATH

Required

The setup adds \api\lib\CM_JavaAPI.jar to the path.

Required

The environment variables file adds /api/lib/ CM_JavaAPI.jar to the Java class path.

Required

The setup creates this environment variable, which points to the B2BDT installation directory ().

Optional

The path of the B2BDT configuration file.

CLASSPATH

IFCONTENTMASTER_HOME

IFConfigLocation4

Step 4: Configuration settings. The configuration application allows for the setting of properties such as JVM parameters, thread pool settings, memory available to the studio environment and many others. Consult the administrators guide for a full list of settings and their effects. Properties set using the B2BDT configuration application affect both the operation of the standalone B2BDT runtime environment and the behavior of the B2BDT studio environment. To open the Configuration Editor in Windows, from the Start menu choose – Informatica > B2BDT > Configuration

INFORMATICA CONFIDENTIAL

BEST PRACTICES

29 of 954

Some of the Configuration Editor settings are available for all B2BDT installations. Additional settings vary depending on the B2BDT version and the optional components installed. The Configuration Editor saves the configuration in an XML file. By default, the file is . The B2BDT studio environment should be installed on each developers machine or environment. While advances in virtualization technologies and technologies such as Windows remote desktop connections theoretically allow for multiple users to share the same B2BDT installation, the B2BDT studio environment does not implement mechanisms such as file locking during the authoring of transformations that are needed to secure multiple users from overwriting each others work. An environment variable can be defined called IFConfigLocation4. The value of the variable must be the path for a valid configuration file (i.e., c:\MyIFConfigLocation4\CMConfig1.xml). For example, if two users want to run B2BDT Engine with different configurations on the same platform, store their respective configuration files in their home directories. Both files must have the name CMConfig.xml. Alternately store a CMConfig.xml file in the home directory for one of the users. The other user will use the default configuration file (e.g., /CMConfig.xml). TIP Always save a backup copy of CMConfig.xml prior to editing. In the event of a problem the last known backup can be restored. The file /CMConfig.bak is a backup of the original which the setup program created when B2BDT was installed. Restoring CMConfig.bak reverts B2BDT to its original configuration.

OS environment variables are used to set aspects of the system such as the Java classpath to be used, location of the configuration file for a specific user, home location for the installed B2BDT instance to be used, library paths, etc. INFORMATICA CONFIDENTIAL

BEST PRACTICES

30 of 954

The following table lists some typical configuration items and where they are set: Type of configuration item

Where configured

Memory for Studio

B2BDT Configuration application

JVM / JRE usage

B2BDT Configuration application

Tuning parameters – threads, timeouts etc

B2BDT Configuration application

User specific settings

Use environment variable to point to different configuration file

Memory for runtime

B2BDT Configuration application

Workspace location

B2BDT Configuration application (B2BDT 4.3), B2BDT Studio (B2BDT 4.4)

Event generation

Set in project properties

Repository location

B2BDT Configuration application

In-Process or Out-of-Process Invocation Out-of-process invocation requires the use of the B2BDT Server application (which is already installed by the install process). The distinction is that running under server mode causes transformations to potentially run slower, but errors will be isolated from the calling process. For Web Services, sometime the use of Server mode is recommended as the lifetime of the host process then becomes independent of the life time of the process space allocated to run the web service. For example IIS can run web services in a mode where a process dies or is recycled after a call to web services. For B2BDT the first call after a process startup can take up to 3 seconds (subsequent calls are usually milliseconds) – hence it is not optimal to start host process on each invocation. Running in server mode keeps process lifetimes independent. TIP B2BDT Studio or the CM_console command, always runs data transformations in-process.

Running out-of-process has the following advantages: ●

Allows 64-bit processes to activate 32-bit versions of B2BDT Engine

●

An Engine failure is less likely to disrupt the calling application

●

Help prevent binary collisions with other modules that run in the process of the calling application.

In-process invocation has the following advantage:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

31 of 954

●

Faster performance than out-of-process.

Thread pool settings The thread pool controls the maximum number of Engine threads that can run client requests concurrently per process. If the number of client requests exceeds the number of available threads, the Server queues the requests until a thread is available. The default setting is 4. Some recommendations are summarized in the table below. Actual needs vary depending upon requirements. Best practices and additional recommendations are part of Jumpstart and Base Line Architecture engagements. Contact an Informatica representative for additional information. Key Settings

Parameters

Suggestions

Eclipse settings

Memory available to studio

By default Eclipse allocates up to 256MB to Java VM Set to –vmargs –Xmx512M to allocate 512mb

Log file locations

Location security needs to match identity of B2BDT engine

ServiceDB

Need to have read permissions for service db locations

Preprocessor buffer sizes

Change if running out of memory during source file processing

Service Refresh Interval

INFORMATICA CONFIDENTIAL

BEST PRACTICES

32 of 954

Step 5: Configure ODBC connectivity. Note: this step is only needed if the ODBC database support features of B2BDT will be used. In such case, an ODBC driver may need to be configured.

Step 6: Test the installation to confirm that B2BDT operates properly Note: Tests are available to test the engine and document processor installation. Refer the directory \setupTests for B2BDT test projects testCME and testCMDP.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

33 of 954

B2BDT Integration With PowerCenter B2BDT does support using the runtime as a “server” process to be invoked from PowerCenter on the same physical machine (in addition to offering the ability to invoke the B2BDT runtime engine in-process with the calling environment). While this does constitute a “server” process in developer terminology, it does not provide full server administration or monitoring capabilities that are typical of enterprise server products. It is the responsibility of the calling environment to provide these features. Part of the overall solution architecture is to define how these facilities are mapped to a specific B2BDT implementation.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

34 of 954

Installation of B2BDT for PowerCenter is a straightforward process. All required plugins needed to develop and run B2BDT transformation are installed as part of a PowerCenter installation. However, certain B2BDT versions and B2BDT plugins need to be registered after the install process. Refer to the PowerCenter Installation Guide for details. The repository option copies the B2BDT plug-ins to the Plugin directory. Register the B2BDT plug-ins in the PowerCenter repository.

PowerCenter 7.1.x Register the UDT.xml plug-in in the PowerCenter Repository Server installation Plugin directory. The B2BDT plug-in will appear under the repository in the Repository Server Administration Console.

PowerCenter 8.1.x Register the pmudt.xml plug-in in the Plugin directory of the PowerCenter Services installation. When the B2BDT plug-in is successfully registered in PowerCenter 8.1 it will appear in the Administration Console as follows:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

35 of 954

Note:A PowerCenter UDO Transformation can only be created if the UDO plug-in is successfully registered in the repository. If the UDO Option is installed correctly then UDO Transformations can be created in the PowerCenter Designer. Note: ODBC drivers provided for PowerCenter are not automatically usable from B2BDT as licensing terms prohibit this in some cases. Contact an Informatica support representative for further details.

Additional Details for PowerCenter Integration

●

B2BDT is a Custom Transformation object within PowerCenter ❍

INFA passes data via memory buffers to the B2BDT engine and retrieves that output via buffers.

❍

The B2BDT engine runs IN-PROCESS with the PowerCenter engine.

❍

The Custom Transformation object for Informatica can be dragged and dropped inside a PowerCenter mapping

INFORMATICA CONFIDENTIAL

BEST PRACTICES

36 of 954

●

When using B2BDT transformation, PowerCenter does NOT process the input files directly, but instead takes a path and filename (from a text file). Then the engine processes the data through the B2BDT parser defined within the mapping. After this the data is returned to the PowerCenter B2BDT transformation for processing by other Informatica transformation objects. TIP Verify that the “Source filename” is the name of the text file where both the file path and the file name are present. It can not be the actual file being parsed by PowerCenter and B2BDT. This is the “direct” versus “indirect” sourcing of the file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

37 of 954

Useful Tips and Tricks Can I use an existing Eclipse install with B2BDT? ●

●

Yes. But make sure it is compatible with the version of B2BDT installation. Check with product compatibility matrix for additional information. B2BDT can be made to work with a different version of Eclipse however it is not guaranteed.

Is there a silent install available for B2BDT on Windows? ●

As of B2BDT 4.4 there is no silent install mode. But there is likely be a future release.

Version Compatibility? ●

●

●

Ensure that version of B2BDT is compatible with PowerCenter. Otherwise many issues can manifest in different forms. In general B2BDT 4.4 is compatible with 8.5 and with 8.1.1 SP4 (and SP4 only), B2BDT 4.0.6 is compatible with 8.1.1, and B2BDT 3.2 is compatible with PC 7.x. For more information refer to the Product Availability Matrix.

Service Deployment? ● ●

Ensure that services are deployed on a remote machine where PowerCenter installed. Services deployed from studio show up in PowerCenter designer B2BDT transformation as a dropdown list (see screenshot below).

Note: These services are only ones that are deployed on a local machine. If any services are deployed on remote machines the designer will not display them. As it is easy to mistake them for remote services manually ensure that the services for local and remote machines are in sync. ●

After making certain that the services are deployed in the remote server that also has PowerCenter installed, the B2BDT transformation can be specified from the UDO (8.1.1) or B2BDT (8.5) transformation on the Metadata Extensions tab.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

38 of 954

Common Installation Troubleshooting Tips Problem Problem Description The following error occurs when opening B2BDT studio: There was a problem running ContentMaster studio, Please make sure …/CMConfig.XML is a valid configuration file (Error code=2) Solution To resolve this issue, do the following: Edit the CMConfig.xml file Add the below section of code after and before in the file: C:/Program Files/Itemfield/ContentMaster4/eclipse C:\Documents and Settings\kjatin.INFORMATICA\My Documents\Itemfield\ContentMaster\4.0 \workspace Note: Modify the path names as necessary to match the installation settings.

Problem Problem Description The Content Master studio fails to open with the following error: Failed to Initialize CM engine! CM license is limited to 1 CPU, and is not compatible with this machine's hardware. Please contact support. Cause The Content Master license is licensed for a fewer number of CPUs then what is on the machine. While registering incorrect information was entered for number of CPUs and so the license provided is for machine with lesser number of CPUs. Solution To resolve the issue do the registration again, enter the right number of CPU and send the new registration.txt to Informatica Support to get the new license. When the new license is received replace it over the existing one in the Content Master Installation directory.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

39 of 954

Problem Problem Description When launching the Designer after installing the Unstructured Data Option (UDO) option, the following error is displayed: Failed to load DLL: pmudtclient.dll for Plug-in: PC_UDT Cause This error occurs when Content Master has not been installed along with PowerCenter UDO. Solution To resolve this issue, install Content Master

Last updated: 31-May-08 19:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

40 of 954

Deployment of B2B Data Transformation Services Challenge Outline the steps and strategies for deploying B2B Data Transformation services.

Description Deployment is a process wherein a data transformation is made available as a service that is accessible to the B2B Data Transformation runtime engine. When a project is published to a specific transformation service, a directory name is created that corresponds to the published service name in the B2B Data Transformation Service DB which forms a runtime repository of services. A CMW file corresponding to the service name will be created in the same directory. The deployed service is stored in the Data Transformation service repository. On Windows platforms, the default repository location is: c:\Program Files\Informatica\ComplexDataExchange\ServiceDB On UNIX platforms, the default location is: /opt/Informatica/ComplexDataExchange/ServiceDB

Basics of B2B Data Transformation Service Deployment When running in the B2B Data Transformation studio environment, developers can test the service directly without deployment. However, in order to test integration with the host environment, platform agents or external code, it is necessary to deploy the service. Deploying the transformation service copies the service with its current settings to the B2B data transformation service repository (also known as the Service DB folder). Deploying a service also sets the entry point for the transformation service. Note: The location of the service repository is set using the B2B Transformation configuration utility If changes are made to the project options or to the starting point of the service, it is necessary to redeploy the service in order for the changes to take effect. When the service is deployed, all service script files, schemas, sample data and other project artifacts will be deployed to the service repository as specified by the B2B Data Transformation configuration options in effect in the studio environment from which the service is being deployed. A transformation service can be deployed multiple times under different service names with the same or different options for each deployed service. While Informatica recommends only deploying one service from each B2B data transformation project for production, it is useful to deploy a transformation service under different names when testing different option combinations.

Deployment for Test Purposes It is important to finish configuration and testing of data transformations before deploying it as a B2B Data Transformation service. Deploying the service allows the B2B Data Transformation runtime engine to access and run the project. When running in the B2B Data Transformation studio environment, developers can test the service INFORMATICA CONFIDENTIAL

BEST PRACTICES

41 of 954

directly without deployment. However, in order to test integration with the host environment, platform agents or external code, it is necessary to deploy the service.

Initial Production Deployment of B2B Data Transformation Services Deploying services in the production environment allows applications to run the transformation services on live data. B2B Data Transformation services can be deployed from the B2B Data Transformation Studio environment computer to a remote computer such as a production server. The remote computer can be a Windows or UNIX-type platform, where B2B Data Transformation Engine is installed. A service can be deployed to a remote computer by either a) directly deploying it to the remote computer or b) deploying the service locally and then copying the service to a remote computer. To deploy a service to a remote computer: 1. Deploy the service on the development computer. 2. Copy the deployed project directory from the B2B Data Transformation repository on the development computer to the repository on the remote computer 3. If you have added any custom components or files to the B2B Data Transformation autoInclude\user directory, you must copy them to the autoInclude\user directory on the remote computer. Alternatively, if the development computer can access the remote file system, you can change the B2B Data Transformation repository to the remote location and deploy directly to the remote computer.

Deployment of Production Updates to B2B Data Transformation Services B2B Data Transformation Studio cannot open a deployed project that is located in the repository. If you need to edit the data transformation, modify the original project and redeploy it. To edit and redeploy a project: 1. Open the development copy of the project in B2B Data Transformation Studio. Edit and test it as required. 2. Redeploy the service to the same location, under the same service name. You are prompted to overwrite the previously deployed version. Redeploying overwrites the complete service folder, including any output files or other files that you have stored in it. There is no versioning available in B2B Data Transformation. If previous versions of the deployed services are required, make a copy of the current service in a separate location, if desired (not in the service DB directory) or utilize a commercial or open source backup solution. Renaming the service folder is also possible. The project name has to be renamed as well. This is not a recommended practice for backing up of services or deploying a service multiple times. It is preferred to use the Studio environment to deploy a service multiple times as behaviors may change in future versions. For backup, there are many commercial and open source back up solutions available, and in order to quickly retain a copy of the service, the service should be copied to a directory outside of the Service DB folder. Important: There can be no more than one deployed service with the same service and project name. Project files contain configuration properties and indicate the transformation startup component. Having multiple services with identical project file names, even if the service names are different, will cause the service execution to fail.

Simple Service Deployment There are two ways to deploy a service. One way is to deploy it directly as service within Data Transformation Studio while the other is to deploy the service locally and copy the service folder to the appropriate ServiceDB. INFORMATICA CONFIDENTIAL

BEST PRACTICES

42 of 954

Single Service Deployment from Within B2B Data Transformation Studio Environment 1. In the B2B Data Transformation Explorer, select the project to be deployed.

2. On the B2B Data Transformation menu, click Project > Deploy.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

43 of 954

3. The Deploy Service window displays the service details. Edit the information as required. Click the Deploy button.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

44 of 954

4. Click OK.

5. At the lower right of the B2B Data Transformation Studio window, display the Repository view. The view lists the service that you have deployed, along with any other B2B Data Transformation services that have been deployed on the computer.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

45 of 954

Single Service Deployment Via File Movement Alternatively, the service folder can be copied directly into the ServiceDB folder. On Windows

To check if the service deployed is valid, run CM_Console in the command line. INFORMATICA CONFIDENTIAL

BEST PRACTICES

46 of 954

Alternatively, the cAPITest.exe can be used test the deployed service.

The B2B Data Transformation Engine determines whether any services have been revised by periodically examining the timestamp of a file called update.txt. By default, the timestamp is examined every thirty seconds. The update.txt file exists in the repository root directory which is, by default, the ServiceDB directory. The content of the file can be empty. If this is the first time a service is deployed to the remote repository, update.txt might not exist. If the file is missing, copy it from the local repository. If update.txt exists, update its timestamp as follows. ●

On Windows: Open update.txt in Notepad and save it

●

On UNIX: Open a command prompt, change to the repository directory, and enter the following command. touch update.txt

You can change the interval used to check for service updates by modifying the “Service refresh interval” in the B2B Data Transformation configuration editor.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

47 of 954

Multi-Service Deployment When a solution involves the use of multiple services, these may be authored as multiple independent B2B Data Transformation projects or as a single B2B Data Transformation project with multiple entry points to be deployed as multiple services under different names. For complex solutions, we recommend the use of multiple separate projects for independent services, reserving the use of multiple runnable components within the same project for test utilities, and trouble shooting items. While it is possible to deploy the set of services that make up a multi-service solution into production from the Studio environment, we recommend deploying these services to a test environment where the solution can be verified before deploying into production. In this way, mismatch between different versions of solution transformation services can be avoided. In particular when dependencies occur between services due to the use of B2B Data Transformation features such as TransformByService, or due to interdependencies in the calling system, it is necessary to avoid deploying mismatching versions of transformation services and to deploy services into production as a group. Simple batch files or shell scripts can be created to deploy the services as a group from the test environment to the production environment, and commercial enterprise system administration and deployment software will usually allow creation of a deployment package to facilitate scheduled unattended deployment and monitoring of deployment operations. As a best practice, creating a dependency matrix for each project to be deployed allows developers to identify the required services by each project to be deployed and which are commonly accessed by majority of the projects. This allows for better deployment strategies and helps to keep track of impacted services should there be changes made to them.

Deploying for Full Uptime Systems B2B Data Transformation has the ability to integrate into various applications allowing it to become a full uptime system. An integration component, called the B2B Data Transformation Agent, runs a B2B Data Transformation service that performs the data transformation. Integration systems capabilities are enhanced by supporting the INFORMATICA CONFIDENTIAL

BEST PRACTICES

48 of 954

conversion of many document formats it do not natively support. Deploying services for full uptime systems follows the same process as that of standalone B2B Data Transformation services. However, it is important to make sure that the user accounts used for the calling application have the necessary permissions to execute the B2B Data Transformation service and write to configuration to store error logs. After deploying the service, it may be necessary to stop and restart the work flow invoking the service. Make sure that the update.txt timestamp is updated. B2B Data Transformation Engine determines whether any services have been revised by periodically examining the timestamp update.txt. By default, the timestamp is examined every thirty seconds.

Multiple Server Deployment For enhanced performance, you can install B2B Data Transformation on multiple Windows or UNIX servers. The following discussion assumes that you use a load balancing module to connect to multiple, identically configured servers. The servers should share the same B2B Data Transformation services. There are two ways to implement a multiple server deployment. ●

Shared file system Store a single copy of the B2B Data Transformation repository on a shared disk. Configure all the servers to access the shared repository.

●

Replicated file system Configure each server with its own B2B Data Transformation repository. Use an automatic file deployment tool to mirror the B2B Data Transformation repository from a source location to the individual servers.

If the second approach is adopted, it is a must to replicate or touch the file update.txt, which exists in the repository directory. The timestamp of this file notifies B2B Data Transformation Engine when the last service update was performed.

Designing B2B Data Transformation Services for Deployment Identifying Versions Currently Deployed Whenever a service is deployed through B2B Data Transformation Studio, the user is prompted to set the options shown in the table below. Option

INFORMATICA CONFIDENTIAL

Description

BEST PRACTICES

49 of 954

Service Name

The name of the service. By default, this is the project name. To ensure cross-platform compatibility, the name must contain only English letters (A-Z, a-z), numerals (0-9), spaces, and the following symbols: %&+-=@_{} B2B Data Transformation creates a folder having the service name, in the repository location.

Label

A version identifier. The default value is a time stamp indicating when the service was deployed.

Startup Component

The runnable component that the service should start.

Author

The person who developed the project.

Description

A description of the service.

Although version tracking is not available in the current version of B2B Data Transformation, deployment does take into account the service deployment timestamps. The deployment options are stored in a log file called deploy.log. It keeps a history of all deployments options made through the B2B Data Transformation Studio. The option settings entered in the Deploy Service window are appended to the log file.

Deploying services to different servers through file copying or FTP will not update the deployment log file. It has to be manually updated if added information is required.

Security and User Permissions User permissions are required by users who install and use B2B Data Transformation Studio and Engine. Depending on the B2B Data Transformation application the organization runs, and the host environment used to invoke the services, additional permissions might be required. To configure data transformations in B2B Data Transformation Studio, users must have the following permissions: ●

Read and write permission for the Eclipse workspace location

INFORMATICA CONFIDENTIAL

BEST PRACTICES

50 of 954

●

Read and execute permission for the B2B Data Transformation installation directory and for all its subdirectories

●

Read and write permission for the B2B Data Transformation repository, where the services are deployed

●

Read and write permissions for the log application

For applications running B2B Data Transformation Engine, a user account with the following permissions is required. ●

Read and execute permission for the B2B Data Transformation installation directory and for its subdirectories

●

Read for the B2B Data Transformation repository

●

Read and write permission for the B2B Data Transformation log path, or for any other location where B2B Data Transformation applications are configured to store error logs

Aside from user permissions, it is important to identify the user types that would be assigned work with B2B Data Transformation. In Windows setup, an administrator or limited user can be registered in the Windows Control Panel. Windows users who have administrative privileges can perform all B2B Data Transformation operations. However, limited users have the following restrictions do not have write permissions for the B2B Data Transformation program directory and are NOT allowed to perform the following: ●

Install or uninstall the B2B Data Transformation software

●

Install a B2B Data Transformation license file

●

Deploy services to the default B2B Data Transformation repository

●

Add custom components such as document processors or transformers

●

Change the setting values in the Configuration Editor

Backup Requirements It is necessary to make regular backups of several B2B Data Transformation directories and files. In production environment where B2B Data Transformation runs, it is important to backup three locations – the Configuration File, Service Repository, and AutoInclude\User directory. For development environment, we recommend using a commercial or open source-source control system such as “Subversion” to manage backup and versioning of the B2B Data Transformation Studio workspaces of the developers in the organization. In addition, backup the same locations listed above for production environment. If you use identical configurations on multiple servers, back up only a single copy of these items. In the event of a server failure, B2B Data Transformation can be re-installed in the same location as on the failed server and restore the backup.

Failure Handling If a B2B Data Transformation service fails to execute successfully, it returns a failure status to the calling application. It is the responsibility of the calling application to handle the error. For example, the application can transmit failed input data to a failure queue. The application can package related inputs in a transaction to ensure that important data is not lost.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

51 of 954

In the event of a failure, the B2B Data Transformation Engine will generate an event log if event logging has been enabled for the project. To view the contents of the event file, drag the *.cme file into the events pane in the B2B Data Transformation Studio. The method used to invoke a B2B Transformation service will affect how and if events are generated. The follow table compares the effect of each invocation method on the generation of events: API / invocation method

Event generation

CM_Console

Service deployed with events will produce events. Service deployed without events will not produce events

Java API

Service runs without events. In case of error, service is rerun with events

C# / .Net

Same as Java

Agents

No events unless error irrespective of how service was deployed. Same behavior is used for PowerCenter

While the events log provides a simple mechanism for error handling, it also has a high cost in resources such as memory and disk space for storing the event logs. For anything other than the simplest of projects, it is recommended to design an error handling mechanism into your transformations and calling logic to handle errors and the appropriate alerting needed when errors occur. In many production scenarios, the event log will need to be switched off for optimal performance and resource usage.

Updating Deployed Services B2B Data Transformation Studio cannot directly update a deployed project in the transformation service repository. To perform updates on the data transformation, the modifications must be made to the original transformation project and the project then needs to be redeployed. Note: A different project can be used which may be deployed under the existing service name, so technically it does not have to be exactly the original project. If it is required to track all deployed versions of the data transformation, make a copy of the current service in a separate location, or alternatively, consider the use of a source control system such as “Subversion”. Redeploying overwrites the complete service folder, including any output files or other files that you have stored in it. It is important to test the deployed service following any modifications. While the Studio environment will catch some errors and block deployment if the transformation is invalid, some types of runtime errors cannot be caught by the studio environment prior to deployment.

Upgrading B2B Data Transformation Software (Studio and Runtime Environment) When upgrading from a previous B2B Data Transformation release, existing projects and deployed services can also be upgraded to the current release. The upgrade of projects from B2B Data Transformation version 3.1 or higher is INFORMATICA CONFIDENTIAL

BEST PRACTICES

52 of 954

automatic. Individual projects can be opened or imported in the B2B Data Transformation Studio with the developer prompted to upgrade the project, if necessary. Test the project and confirm that it runs correctly once upgrade is completed. Deploy the service to production environment. Another way to upgrade the services is through the syntax conversion tool that comes with B2B Data Transformation. It allows upgrade of multiple projects and services quickly, in an automated operation. It is also used to upgrade global TGP script files, which are stored in the B2B Data Transformation autoInclude\user directory. Syntax conversion tool supports upgrade of project or service from 3.1 and higher on Windows while release 4 on UNIX-type platforms. Before the upgrade, the tool creates an automatic backup of your existing projects and files. It creates a log file and reports any upgrade errors that it detects. In case of an error, restore the backup, correct the problem, and run the tool again. It is necessary to organize the projects before running the tool. The tool operates on projects or services that are stored in a single parent directory. It can operate on: ●

A B2B Data Transformation Studio version 4 workspace

●

A B2B Data Transformation repository

●

Any other directory that contains B2B Data Transformation Studio projects or services

Within the parent directory, the projects must be at the top level of nesting, for example:

If the projects are not currently stored in a single parent directory, re-organize them before running the tool. Alternatively, the tool can be run separately on the individual parent directories. To run the syntax conversion tool in Windows, go the B2B Data Transformation folder from the Start menu then click Syntax Conversion Tool. The tool is a window with several tabs, where the upgrade settings can be configured.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

53 of 954

After the service upgrade is complete, change the repository location to the new location using the Configuration Editor. Test the projects and services to confirm that they work correctly and that their behavior has not changed. On UNIX platforms, run the command /bin/CM_DBConverter.sh. Only 4.x is supported. Optionally, you can run the syntax conversion tool from the command line, without displaying the graphical user interface. In an open console, change to the B2B Data Transformation bin directory and run the following command: ●

On Windows: CM_DBConverter.bat

●

On UNIX: CM_DBConverter.sh –console

Following each switch, leave a space and type the value. If a path contains spaces, you must enclose it in quotation marks. The are listed in the following table.

Switch

Required/Optional

Description

-v

Required

Version from which you are upgrading (3 or 4). On UNIX, only 4 is supported.

-s

Required

Path of the source directory, containing projects or services.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

54 of 954

-d

Optional

Path of the target directory. If you omit this switch, the tool overwrites the existing directory.

-si

Optional

Path of the source autoInclude\user directory. If you omit this switch, the tool does not upgrade global TGP files.

-di

Optional

Path of the target autoInclude\user directory. If you omit this switch, the tool overwrites the existing directory.

-l

Optional

Path of the upgrade log file. The default is \SyntaxConversionLog.txt.

-b

Optional

Path of the backup directory, where the tool backs up the original projects or services prior to the upgrade. The default is the value of the -s switch concatenated with the suffix _OLD_Backup.

-e

Optional

Path of the error directory, where the tool stores any projects or services that it cannot upgrade due to an error. The default is the value of the -s switch concatenated with the suffix _OLD_Failure.

Last updated: 29-May-08 16:47

INFORMATICA CONFIDENTIAL

BEST PRACTICES

55 of 954

Establishing a B2B Data Transformation Development Architecture Challenge Establish a development architecture that ensures support for team development of B2B Data Transformation solutions; establishes strategies for common development tasks such as error handling and the styles of B2B Data Transformation service authoring; and plans for the subsequent clean migration of solutions between development, test, quality assurance (QA) and production environments that can scale to handle additional users and applications as the business and development needs evolve.

Description In this Best Practice the term development architecture means establishing a development environment and establishing strategies for error handling, version control, naming conventions, mechanisms for integration with the host environment and other aspects of developing B2B Data Transformation services not specific to a particular solution. Planning for the migration of the completed solution is closely related to the development architecture. This can include transfer of finished and work in progress solutions between different members of the same team, between different teams such as development, QA, and production teams and between development, test and production environments. Deciding how to structure the development environment for one or more projects depends upon several factors. These include technical factors such as choices for hosting software and host environments and organizational factors regarding the project team makeup and interaction with operations, support and external test organizations. Technical factors: ●

What host environment is used to invoke the B2B Data Transformation services?

●

What are the OS platform(s) for development, test and production?

●

What software versions are being used for both B2B Data Transformation and for host environment software?

●

How much memory is available on development, test and production platforms?

●

Are there previous versions of the B2B Data Transformation software in use?

●

●

The use of shared technical artifacts such as XML schemas shared between projects, services, applications and developers. What environments are expected to be used for development, test and production environments? (Typically development is performed on windows, test and production may be AIX, Solaris, Linux etc).

Organizational Factors: ●

How do development, test, production and operations teams interact?

●

Do individual developers work on more than one application at a time?

●

Are the developers focused on a single project, application or project component?

●

How are transformations in progress shared between developers?

●

What source code control system, if any, is used by the developers?

●

● ●

●

Are development machines shared between developers either through sequential use of a physical machine, through the use of virtual machines or through use of technologies such as Remote Desktop Access? How are different versions of a solution, application or project managed? What is the current stage of the project life cycle? For example has the service being modified already been deployed to production? Do developers maintain or create B2B Data Transformation services for multiple versions of B2B Data

INFORMATICA CONFIDENTIAL

BEST PRACTICES

56 of 954

Transformation on products? Each of these factors plays a role in determining the most appropriate development environment for a B2B Data Transformation project. In some cases, it may be necessary to create different approaches for different development groups according to their needs B2B Data Transformation, together with the B2BDT Studio environment, offers flexible development configuration options that can be adapted to fit the need of each project or application development team. This Best Practice is intended to help the development team decide what techniques are most appropriate for the project. The following sections discuss various options that are available, based on the environment and architecture selected.

Terminology B2B Data Transformation (abbreviated as “B2BDT”) is used as a generic term for the parsing, transformation and serialization technologies provided in Informatica’s B2B Data Exchange products. These technologies have been made available through the Unstructured Data Option for PowerCenter, and as standalone products such as B2B Data Transformation and its earlier versions known respectively as B2B Data Transformation, PowerExchange for Complex Data (and formerly known as ItemField ContentMaster). The B2B Data Transformation development environment uses the concepts of workspaces, projects and services to organize its transformation services. The overall business solution or solutions may impose additional structure requirements such as organizing B2BDT services into logical divisions such as solutions, applications, projects and business services corresponding to the needs of the business. There may be multiple B2BDT services corresponding to these logical solution elements. We will use the terms B2BDT service to refer to a single Complex Exchange transformation service, and B2BDT project to refer to the B2B Data Transformation project construct as exposed within the B2BDT Studio environment. Through out this document we use the term developers to refer to team members who create B2BDT services, irrespective of their actual roles in the organization. Actual roles may include business analysts, technical staff in a project or application development teams, members of test and QA organizations, or members of IT support and helpdesk operations who create new B2BDT transformations or maintain existing B2BDT transformations.

Fundamental Aspects of B2BDT Transformation Development There are a number of fundamental concepts and aspects to development of B2BDT transformations that affect design of the development architecture and distinguish B2BDT development architecture from other development architectures.

B2BDT is an Embedded Platform When B2BDT transformations are placed into production, the runtime is typically used in conjunction with other enterprise application or middleware platforms. The B2BDT runtime is typically invoked from other platform software (such as PowerCenter, BizTalk, WebMethods or other EAI or application server software) through the use of integration platform adapters, custom code or some other means. While it is also possible to invoke B2BDT services from a command line utility (“CM_Console”) without requiring the use of additional platform software, this is mainly provided for quick testing and troubleshooting purposes. “CM_Console” does not provide access to all available system memory or scale across multiple CPUs. Specifically, restrictions on the CM_Console application include always running the B2BDT transformation engine in-process and use of the local directory for event output. B2BDT does support using the runtime as a “server” process to be invoked from other software on the same physical machine (in addition to offering the ability to invoke the B2BDT runtime engine in-process with the calling environment). While this does constitute a “server” process in developer terminology, it does not provide full server administration or INFORMATICA CONFIDENTIAL

BEST PRACTICES

57 of 954

monitoring capabilities that are typical of enterprise server products. It is the responsibility of the calling environment to provide these features and part of the overall solution architecture is to define how these facilities are mapped to a specific B2BDT implementation. B2BDT Deployment with PowerCenter

While the B2BDT runtime is usually deployed on the same machine as the host EAI environment, it is possible to locate the B2BDT services (stored in a file based repository) on the same machine or a remote machine. It is also possible to deploy B2BDT services to be exposed as a set of web services, and in this case the hosting web/ application server forms the server platform that provides these server software services. The web service platform in turn will invoke the B2BDT runtime either in-process with the web service stack or as a separate server process on the same machine. Note: Modern application servers often support mechanisms for process, application and thread pooling which blurs the distinctions between the effects of in-process vs. server invocation modes. In process invocation can be thought of as running the B2BDT transformation engine as a shared library within the calling process. B2BDT Deployed as Web Services Used With Web Application

INFORMATICA CONFIDENTIAL

BEST PRACTICES

58 of 954

Sample Data for Parse by Example and Visual Feedback During the process of authoring a B2BDT transformation, sample data may be used to perform actual authoring through drag-and-drop and other UI metaphors. Sample data is also used to provide visual confirmation at transformation design time of what elements in the data are being recognized, mapped and omitted. Without sample data, there is no way to verify correctness of a transformation or get feedback on the progress of a transformation during authoring. For these reasons, establishing a set of sample data for use during authoring is an important part of planning for the development of B2BDT transformations. Sample data to be used for authoring purposes should be representative of actual data used during production transformations but sized to avoid excessive memory requirements on the studio environment. While the studio environment does not impose specific limits on data to be processed, the cumulative effects of using document preprocessors within the studio environment in conjunction with use of the B2BDT event reporting can impose excessive memory requirements.

Eclipse-Based Service Authoring Environment The B2BDT authoring environment, B2BDT Studio is based on the widely supported Eclipse platform. This has two implications: 1. Many switches, configuration options, techniques and methods of operation that affect the Eclipse environment are also available in B2BDT studio. These include settings for memory usage, version of JVM used by the studio environment etc. 2. Eclipse plug-ins that support additional behaviors and / or integration of other applications such as source code control software can be used with the B2BDT Studio environment. While the additional features offered by these plug-ins may not be available in the B2BDT authoring perspective, by switching perspectives, B2BDT developers can often take advantages of the features and extensions provided by these plug-ins. Note: An Eclipse perspective is a task oriented arrangement of views, menu options commands etc. For example while using the B2BDT authoring perspective, features for creation of Java programs or source control will not be visible but they may be accessed by changing perspectives. Some features of other perspectives may be incompatible with use of INFORMATICA CONFIDENTIAL

BEST PRACTICES

59 of 954

the B2BDT authoring perspective There are a number of features in B2BDT that may only run on Windows and some custom components such as custom COM based actions or transformations are Windows specific also. This means it is possible to create a transformation within the Studio environment that will only run on the development environment and may not be deployed into production on a non Windows platform.

Service Authoring Environment Only Supported on Windows OS Variants While B2BDT services may be deployed and placed into production on many environments such a variety of Linux implementations, AIX, Solaris and Windows Server OS variations, the B2BDT Studio environment used to author B2BDT services only runs on Windows OS variants such as Windows 2000 and Windows XP. There are a number of features in B2BDT that may only run on Windows and some custom components such as custom COM based actions or transformations are Windows specific also. This means it is possible to create a transformation within the Studio environment that will only run on the development environment and may not be deployed into production on a non-Windows platform.

File System Based Repository for Authoring and Development B2BDT uses a file system based repository for runtime deployment of B2BDT services and a similar file based workspace model for the physical layout of services. This means that mechanisms for sharing of source artifacts such as schemas and test data; projects and scripts and deployed solutions must be created using processes and tools external to B2BDT. These might include use of software such as source control systems for sharing of transformation sources, third party application deployment software and processes implemented either manually or through scripting environments for management of shared artifacts, deployment of solutions etc.

Support for Learn-by-Example Authoring Techniques Authoring of a B2BDT solution may optionally use supplied sample data to determine how to extract or parse data from a representative source data construct. Under this mechanism, a transformation developer may elect to let the B2BDT runtime system decide how to extract or parse data from a sample data input. When this mechanism is used, the sample data itself becomes a source artifact for the transformation and changes to the sample data can affect how the system determines the extraction of appropriate data. Use of Learn by Example in B2BDT

INFORMATICA CONFIDENTIAL

BEST PRACTICES

60 of 954

When using learn-by-example transformations, the source data used as an example of the data must be deployed with the B2BDT project as part of the production B2BDT service. It is recommended in many cases to use the learn by example mechanism as a starting point only and to use specific transformation (non learn-by- example) mechanisms for data transformation with systems requiring a high degree of fine control over the transformation process. If learn by example mechanisms are employed, changes to the sample data should be treated as requiring the same degree of test verification as changes to the transformation scripts.

Support for Specification Driven Transformation Authoring Techniques As B2BDT transformations are also represented as a series of text files, it is possible to parse a specification (in a Microsoft Word, Microsoft Excel , Adobe PDF or other format document) to determine how to generate a transformation. Under this style of development, the transformation developer would parse one or more specifications rather than the actual source data and generate one or more B2BDT transformations as output. This can be used instead of or in addition to, standard transformation authoring techniques. Many of the Informatica supplied B2BDT libraries are built in this fashion. Note: Typically at least one transformation will be created manually in order to get an approximation of the target transformation desired.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

61 of 954

In these cases, specifications should be treated as source artifacts and changes to specifications should be verified and tested (in conjunction with the spec driven transformation services) in the same manner as changes to the transformations.

B2B Data Transformation Project Structure The B2B Data Transformation Studio environment provides the user interface for the development of B2B Data Transformation services. It is based on the open source Eclipse environment and inherits many of its characteristics regarding project organization. From a solution designer’s viewpoint, B2B Data Transformation solutions are organized as one or more B2B Data Transformation projects in the B2BDT Studio workspace. Studio Environment Indicating Main Views

INFORMATICA CONFIDENTIAL

BEST PRACTICES

62 of 954

The B2BDT workspace defines the overall set of transformation projects that may be available to a developer working in a single studio session. Developers may have multiple workspaces, but only one workspace is active within the studio environment at any one time. All artifacts such as scripts, project files and other project elements are stored in the file system as text files and can be versioned using traditional version control systems. Each B2BDT project can be used to publish one or more B2BDT services. Typically a single project is only used to publish a single primary service although it may be desirable to publish debug or troubleshooting variants of a project under different service names. Note: The same B2BDT project can be published multiple times specifying different entry point or configuration parameters. The syntax displayed in the studio environment differs from the text representation of the script files such as TGP files, which make up the B2B Data Transformation project. This will be discussed further when reviewing considerations for multi-person team development. From a physical disk storage viewpoint, the workspace is a designated file system location where B2BDT Studio stores a set of B2BDT projects. By default, there is a single B2B Data Transformation workspace, which is located in the directory My Documents\Informatica\ComplexDataExchange\4.0\workspace All projects in the current B2B Data Transformation Studio workspace are displayed in the Explorer view.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

63 of 954

Note: It is possible to have other workspaces for Java projects etc. These are not visible in the Complex Data Authoring perspective in B2B Data Transformation Studio. Optionally, it is possible to create more than one workspace. For example, a solution designer might have multiple workspaces for different sets of B2B Data Transformation projects. TIP For B2BDT Studio 4.3 and earlier releases, Use the B2BDT Studio\Eclipse\Workspace setting in the B2BDT configuration editor to change the workspace. In B2BDT 4.4, you may change the workspace by using the “File | Switch Workspace” menu option.

Each B2B Data Transformation project holds the business rules and operations for one or more transformation services. Once completed or while under development, the project may be published to the B2B Data Transformation repository to produce a deployable transformation service. During the publication of a transformation service, an entry point to the service is identified and a named transformation service is produced that specifies a particular transformation project along with a well known entry point where initial execution of the transformation service will take place. It is possible to publish the same project multiple times with different names, identifying a different entry point on each deployment or even to publish the same project multiple times with the same entry point under different names. Published B2B Data Transformation services are published to the runtime repository of services. In B2B Data Transformation, this takes the form of a file system directory (typically c:\program files\Informatica\ComplexDataExchange \ServiceDB) known as the service DB. This may be located on a local or network accessible file system Once the development version of a transformation service has been published, it may then be copied from the service database location by copying the corresponding named directory from the service DB location. This service directory can then be deployed by copying it to the service db directory on a production machine.

File System View of Workspace The workspace is organized as a set of sub-directories, with one sub-directory representing each project. A specially designated directory named “.metadata” is used to hold metadata about the current workspace. Each subdirectory is named with the project name for that project. Workspace Layout

INFORMATICA CONFIDENTIAL

BEST PRACTICES

64 of 954

Behind the scenes (by default) B2B Data Transformation creates a new project in a directory corresponding to the project name rooted in the Eclipse workspace. (In B2BDT 4.3, this can be overridden at project creation time to create projects outside of the workspace; while in B2BDT 4.4, the studio environment will determine whether it needs to copy a project into the workspace. If the path specified for the imported project is already within a workspace, B2BDT will simply add the project to the list of available projects in the workspace). A .cmw file with the same primary project name will also be created within the project directory – the cmw file defines what schemas, scripts and other artifacts make up the project. When a project is published to a specific transformation service, a directory name is created that corresponds to the published service name in the B2B Data Transformation Service DB which forms a runtime repository of services. A CMW file corresponding to the service name will be created in the same directory. Creating a new project while in the studio environment will cause changes to be made to the metadata directory in order for the project to be discoverable in the B2BDT Studio environment.

File System View of Service DB The service database is organized as a set of sub-directories under the service database root project, with one subdirectory representing each deployed service. When a service is deployed, the service will be copied along with the settings in effect at the time of deployment. Subsequent changes to the source project will not affect deployed services, unless a project is redeployed under the same service name. It is possible to deploy the same B2BDT project under multiple different service names. TIP If a project contains a sample data file with the extension .cmw it can cause the B2BDT runtime to detect an error with that deployed service. This can prevent all services being detected by the runtime. If it is necessary to have a sample data file with the extension .cmw use a different extension for the sample data and adjust scripts accordingly. This scenario can commonly occur with specification driven transformations.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

65 of 954

Solution Organization The organizational structure of B2B Data Transformation solutions is summarized below. Element

Parent

Service Repository

None. This is the top level organization structure for published Complex Data services.

Published Complex Data Service

Repository. There may be multiple projects in a repository.

Project

Workspace. There may be multiple projects in a studio workspace.

TGP Script

Project.

XML Schema

Project.

Parser, Mapper, Serializer

TGP Script. However naming is global to a project and not qualified by the TGP script name.

Global Variables, Actions, Markers

TGP Script. However naming is global to a project and not qualified by the TGP script name.

Planning for B2BDT Development While the overall solution life cycle may encompass project management, analysis, architecture, design, implementation, test, deployment and operation following a methodology such as Informatica’s Velocity methodology, from a development architecture perspective we are mainly concerned with facilitating actual implementation of transformations, subsequent test and deployment of those transformations.

Pre-Implementation Development Environment Requirements During the analysis phase we are mainly concerned with identifying the business needs for data transformations, the related data characteristics (data format, size, volume, frequency, performance constraints) and any existing constraints on target host environments (if candidate target host environments have already been identified). Due to the nature of how B2BDT transformations are built with their utilization of sample data during the authoring process, we also need to plan for obtaining data and schema samples as part of the requirements gathering and architecture phases. Other considerations include identification of any security constraints on the use of storage of data and identification of the need to split data, any sizing and scaling of the eventual system which will depend to a large extent on the volume of data, performance constraints, responsiveness targets etc. For example, HIPAA specifications include privacy restrictions and constraints in addition to defining message and transactions formats.

Pre-Development Checklist While many of these requirements address the solution or solutions as a whole, rather than the development environment INFORMATICA CONFIDENTIAL

BEST PRACTICES

66 of 954

specifically, there are a number of criteria that have direct impact on the development architecture: ●

●

●

●

●

●

●

●

What sample data will be used in creation of the B2BDT services? The size of the sample data used to create the B2BDT services will determine some of the memory requirements for development environments. Will specific preprocessors be required for B2BDT transformation authoring? Some preprocessors such as Excel or Word preprocessors require additional software such as Microsoft Office to be deployed to the development environments. In some cases, custom preprocessors and / or transformers may need to be created to facilitate authoring of transformation solutions. Are there specific libraries being used such as the B2BDT Accord, EDI or HIPAA libraries? Use of specific libraries will have an impact on how transformations are created and on the specific licenses required for development usage of the Complex Data Transformation tools. Are custom components being created such as custom actions, transformers, preprocessors that will be shared among developers of B2BDT transformations? In many cases, these custom components will need to be deployed to each B2BDT studio environment and a process needs to be defined for handling updates and distribution of these components Are there any privacy or security concerns? Will data need to be encrypted / decrypted? Will cleansed data be needed for use with learn-by-example based transformations? How will the B2BDT runtime be invoked? Via a platform adapter, custom code, command line, web services, HTTP etc.? Each of these communication mechanisms may impose specific development requirements with regard to testing of work in progress, licensing of additional B2BDT components, performance implications and design choices? Will data splitting be needed? Depending on the choice of 32 bit vs. 64 bit B2B Data Transformation runtimes, and both host software platform and underlying OS and hardware platform, data may need to be split through the use of B2BDT Streaming capabilities, custom transformations or preprocessors How are B2BDT transformations created? What artifacts affect their creation? What is the impact of changes to specifications, schemas, sample data, etc? In some cases such as spec driven transformation, changes to specifications go beyond design change requests but may require actual rerunning of transformations that produce other executable artifacts, documentation, test scripts etc.

Establishing the Development Environment B2B Data Transformation services are defined and designed in the B2B Data Transformation Studio environment. The B2B Data Transformation Studio application is typically installed on the developer’s local machines and allows the visual definition of transformations, the usage of libraries and use of import processes to build one or more B2B Data Transformation services. All extensions used during authoring such as custom transformations, preprocessors, actions etc., must be installed in each B2BDT Studio installation. While preprocessors are provided with the studio environment to support manipulation of files types such as Excel, Word and PDF files within the studio environment. For some formats it may be necessary to create custom preprocessors to optimize usage of source data within the B2BDT studio environment. Note: In some cases, additional optional studio features may need to be licensed in order to access necessary preprocessors and / or libraries During transformation authoring, B2BDT services are organized as a set of B2BDT projects within a B2BDT workspace. Each B2BDT project consists of a set of transformation scripts, XML schema definitions, and sample data used in authoring and / or runtime of the transformation. B2BDT projects and workspaces use file system based artifacts for all aspects of the definition of a B2BDT project. Due to the use of file based artifacts for all B2BDT transformation components, traditional source code controls systems may be used to share work in progress.

Development Environment Checklist

INFORMATICA CONFIDENTIAL

BEST PRACTICES

67 of 954

Many of the implementation issues will be specific to the particular solution. However there are a number of common issues for most B2BDT development projects: ●

What is the host environment and what tools are required to develop and test against that environment? While the B2BDT studio is a Windows only environment, addition consideration may need to be given to the ultimate host environment regarding what tools and procedures are required to deploy the overall solution and troubleshoot it on the host environment.

●

What is the communication mechanism with the host environment? How does the host environment invoke B2BDT transformations? Is it required for work in progress testing or can the invocation method be simulated through the use of command line tools, scripts or other means?

●

What are security needs during development? Deployment? Test? How will they affect the development architecture?

●

What are memory and resource constraints for the development, test and production environments?

●

What other platform tools are needed during development?

●

What naming conventions should be used?

●

How will work be shared between developers?

●

How will different versions of transformations be handled?

●

●

Where or how are intermediate XML schemas defined and disseminated? Are they specific to individual services? Shared between services? Externally defined either by other project teams or by external standards bodies? What is the folder and workspace layout for B2BDT projects?

Supporting Multiple Users The B2BDT studio environment is intended to be installed on each developers own machine or environment. While advances in virtualization technologies and technologies such as Windows remote desktop connections theoretically allow for multiple users to share the same B2BDT installation, the B2BDT studio environment does not implement mechanisms such as file locking during authoring of transformations that are needed to secure multiple users from overwriting each others work. TIP If For PowerCenter users, it is important to note that B2BDT does not implement a server based repository environment for work in progress, and other mechanisms are needed to support sharing of work in progress. The service database may be shared between different production instances of B2BDT by locating it on a shared file system mechanism such as a network file share or SAN. The B2BDT development environment should be installed on each B2BDT transformation author’s private machine

The B2BDT Studio environment does support multiple user usage of the same development environment. However, each user should be assigned a separate workspace. As the workspace, along with many other default B2BDT configuration parameters, is stored in the configuration file, the environment needs to be configured to support multiple configuration files, with one being assigned to each user

INFORMATICA CONFIDENTIAL

BEST PRACTICES

68 of 954

Creating Multiple Configurations To create multiple configurations, you can edit and copy the default configuration file.

1. Make a backup copy of the default configuration file, . At the end of the procedure, you must restore the backup to the original CMConfig.xml location. 2. Use the Configuration Editor to edit the original copy of CMConfig.xml. Save your changes. 3. Copy the edited CMConfig.xml to another location or another filename. 4. Repeat steps 2 and 3, creating additional versions of the configuration file. In this way, you can define as many configurations as you need. 5. Restore the backup that you created in step 1. This ensures that the default configuration remains as before.

Selecting the Configuration at Runtime You can set the configuration file that B2B Data Transformation Engine should use in any of the following ways: 1. Define an environment variable called IFConfigLocation4. The value of the variable must be the path of a valid configuration file, for example: 2. c:\MyIFConfigLocation4\CMConfig1.xml 3. On Unix only: Store the configuration file under the name CMConfig.xml, in the user's home directory. 4. Use the default configuration file, /CMConfig.xml. When B2B Data Transformation Engine starts, it searches these locations in sequence. It uses the first configuration file that it finds. Example 1 Suppose you want to run two applications, which run B2B Data Transformation Engine with different configuration files. Each application should set the value of IFConfigLocation4 before starts B2B Data Transformation Engine. Example 2 Two users want to run B2B Data Transformation Engine with different configurations, on the same Unix-type platform. Store their respective configuration files in their home directories. Both files must have the name CMConfig.xml. Alternatively, store a CMConfig.xml file in the home directory of one of the users. The other user uses the default configuration file, /CMConfig.xml. Multiple JREs On Windows platforms, the JVM Location parameter of the configuration file defines the JRE that B2B Data Transformation should use. By using multiple configuration files, you can switch JREs. On Unix-type systems, the configuration file does not contain a JVM Location parameter. To switch JREs, you must load a different environment-variable file. Running Multiple Configurations Concurrently B2B Data Transformation Engine loads the configuration file and the environment variables when it starts. After it starts, changing the configuration file or the environment variables has no effect. This means that two applications can use different configurations concurrently. Each application uses the configuration that was in affect when its instance of B2B Data Transformation Engine started.

While this can theoretically allow Windows based sharing mechanisms such as Remote Desktop Connection to share the same installation of B2BDT, it is important to specify different workspaces for each user due to the possibility of files being overwritten by different users. INFORMATICA CONFIDENTIAL

BEST PRACTICES

69 of 954

As a best practice, it is recommended that each user of B2BDT Studio is provided with a separate install of the B2BDT Studio environment on a dedicated machine. Sharing of work in progress should be accomplished through the use of a source control system rather than multiple users using the same workspace simultaneously. In this manner, each transformation author’s environment is kept separate while allowing multiple authors to create transformations, and share them between each author’s environment.

Using Source Code Control for Development As B2BDT transformations are all defined as text based artifacts – scripts, XML schema definitions, project files etc., B2BDT transformation authoring lends itself to good integration with traditional source code control systems. There are a number of suitable source code control systems on the market and open-source source code control environments such as CVSNT and Subversion both have Eclipse plug-ins available that simplify the process. While source code control is a good mechanism for sharing of work between multiple transformation authors, it also serves as a good mechanism for reverting to previous versions of a code base, keeping track of milestones and other change control aspects of a project. Hence it should be considered for all but the most trivial of B2B Data Transformation projects, irrespective of the number of transformation authors. What should be placed under source code control? All project files that make up a transformation should be checked in when a transformation project is checked in. These include sample data files, TGP script files, B2BDT project files (ending with the extension .CMW), and XML Schema definition files (ending with the extension .XSD). During test execution of transformations, the B2BDT engine and Studio environment will generate a results subdirectory in the project source directory. The files contained in this directory include temporary files generated during project execution under the studio environment (typically “output.xml”), and the Studio events file (ending in .CME). These should not be checked in, and should be treated as temporary files. When a service is deployed, a deploy. log file is generated in the project directory. While it may seem desirable to keep track of the deployment information, a different deployment log file will be generated on each user’s machine. What are the effects of different authoring changes? The following table describes the file system changes that occur when different actions are taken: Action

Change

Creating a new B2BDT project

New B2BDT project directory created in workspace

Importing an XML schema

Schema and dependencies copied to B2BDT project directory

Adding a new script

New TGP file in B2BDT project directory Modifications to CMW file

INFORMATICA CONFIDENTIAL

BEST PRACTICES

70 of 954

Adding a new test data file

Files copies to B2BDT project directory

Running a transformation within the studio environment

Changes to the results directory. New entries in the Events. CME file

Modifications to the project preferences

Changes to the CMW file

Modifications to the studio preferences

Modifications to the B2BDT project file Modifications to the meta data directory in workspace

During test execution of transformations, the B2BDT engine and Studio environment will generate a results subdirectory in the project source directory. The files contained in this directory include temporary files generated during project execution under the studio environment (typically “output.xml”), and the Studio events file (ending in .CME). These should not be checked in, and should be treated as temporary files. When a service is deployed, a deploy.log file is generated in the project directory. While it may seem desirable to keep track of the deployment information, a different deployment log file will be generated on each user’s machine.

Special Considerations for Spec Driven Transformation Spec driven transformation is the use of a B2BDT transformation to generate a different B2BDT transformation based on one or more inputs that form the “specification” for a transformation. As a B2BDT transformation is itself a set of text files, it is possible to automate the generation of B2BDT transformations with or without subsequent user modification. Specifications may include Excel files that define mappings between source and target data formats, PDF files that are generated by some standards body or a variety of custom specification formats. As the specification itself becomes part of what determines the transformation scripts generated, these should be placed under source code control. In other cases, the time taken to generate the transformation may be too great to regenerate the transformations on every team merge event, or it may be necessary to preserve the generated transformation for compliance with auditing procedures. In these cases, it is necessary to place the generated transformations under source code control also.

Sharing B2BDT Services In addition to defining production transformations, B2BDT supports the creation of shared components such as shared library transformations that may be shared using the "autoinclude" mechanism. B2BDT also supports the creation of custom transformers, preprocessors and other components that may be shared across users and B2BDT projects. These should all be placed under source code control and must also be deployed to production B2BDT environments, if used by production code. Note: For PowerCenter users, these can be thought to be the B2B Data Transformation equivalent of Mapplets and Worklets and offer many of the same advantages.

Sharing Metadata Between B2BDT Projects Many B2BDT solutions are comprised of multiple transformation projects. Often there is shared metadata such as XML schema definitions, and other shared artifacts.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

71 of 954

When an XML schema is added to a project, local copies of the schemas along with any included schemas will be placed in the project directory. If one or more schemas are used in multiple projects, they must be copied to each project when a change occurs to the schema. One recommendation for sharing of schemas is to place the schemas and other artifacts into a dummy project and when the schema changes, transformation authors should sync that project and copy the schemas from the dummy project to each of the other projects. This copy mechanism can be added to synchronization scripts. In these cases, the local copy of the shared schema should not be placed under source control.

Using Multiple Workspaces A typical B2BDT solution may be comprised of multiple B2BDT transformations, shared components, schemas, and other artifacts. These transformations and other components may be all part of the same logical aspect of a B2B Data Transformation solution, or may form separate logical aspects of a B2B Data Transformation solution. Some B2B Data Transformation solutions will result in the production of 100s of transformation services, parsers and B2BDT components. When the B2BDT Studio environment is launched, it will attempt to load all transformation projects into memory. While B2BDT Studio allows closing a project to conserve memory and system resources (by right clicking on the project and selecting the close option), large numbers of B2BDT projects can make use of the studio environment unwieldy. Closing a Project Within B2BDT Studio

Reopen Project Using Open Option

INFORMATICA CONFIDENTIAL

BEST PRACTICES

72 of 954

There may also be a need (due to complexity, security or other reasons) to separate work between different developers so that only some projects need to be opened within a workspace of a given developer. For these reasons – number of transformations, separation of logical aspects of solution, enforcement of change control, it may be appropriate to use separate workspaces to partition the projects.

Staging Development Environments When there are multiple developers on a project, and / or large numbers of transformations, it is recommended to have a staging development environment where all transformations are assembled prior to deployment to test environments. While it is possible to have each developer transfer their work to the staging development environment directly, it is recommended that the staging development environment is synchronized from the source code control system. This enforces good check in practices as only those transformations checked in will be propagated to the staging development environment. It is also possible to require that each developer publishes their working services to a local service DB on their machine and use source code control to check in their published services. If this approach is chosen, it should be considered in addition to, not instead of, using source code control to manage work in progress. In Agile development methodologies, one of the core concepts is always having a working build available at any time. By using source code control to manage working copies of deployed services, it is possible to enforce this concept. When the target platform is a non Windows platform, it is also necessary to consider where the version of the services for non Windows platforms should be assembled. For example you can assemble the version of B2BDT solution for non Windows platforms on the staging development machine and either transfer the transformation services to the QA environment manually or use additional check in/check out procedures to perform the transfer.

Synchronization of Changes from Source Code Control System If a synchronization operation in a source code control system adds an additional project to the workspace, it is necessary to use the file import command in the B2BDT studio environment to import the project into the project workspace. If a change occurs to a schema while the studio environment is open, it is sometimes necessary to switch to the schema view in the studio environment to detect the schema change. INFORMATICA CONFIDENTIAL

BEST PRACTICES

73 of 954

Best Practices for Multi-Developer or Large Transformation Solutions ●

DO install a separate instance of the B2BDT Studio environment on each authors machine

●

DO use a source code control system to synchronize and share work in progress

●

DO consider using a dummy project to share shared meta data and artifacts such as XML schemas

●

DON’T rely on Remote Desktop Connection to share simultaneous usage of B2BDT Studio for the same workspace

●

DO use a separate workspace location for each user on the same machine.

●

DO place shared components under version control

●

DO define scripts to aid with synchronization of changes to shared resources such as schemas

●

●

DO consider use of a staging development environment for projects with a large number of transformations , multiple transformation authors or non windows target platforms DO consider having identical folder structure, if each developer has dedicated machine

Configuring the B2BDT Environment B2BDT supports setting of configuration settings through a number of means. These include the B2BDT Configuration application (which modifies the CMConfig.xml configuration file), setting of global properties in the B2BDT Studio configuration environment, setting of project specific properties on a B2BDT project and through the use of platform environment variables. The B2BDT Configuration application allows setting of global B2BDT properties through a GUI based application. Changing property settings through the configuration application causes changes to be made to the CMConfig.XML file (once saved). B2BDT Configuration Application

The configuration application allows setting of properties such as JVM parameters, thread pool settings, memory available to the studio environment and many other settings. Consult the administrators guide for a full list of settings and their effects. Properties set using the B2BDT configuration application affect both the operation of the standalone B2BDT runtime environment in addition to affecting the behavior of the B2BDT studio environment. INFORMATICA CONFIDENTIAL

BEST PRACTICES

74 of 954

Within the B2BDT studio environment, properties may be changed for the Studio environment as a whole, and for on a project specific basis. B2BDT Studio Preferences

The B2BDT studio preferences allow customization of properties that affect all B2BDT projects such as what events are generated for trouble shooting, logging settings, auto save settings and other B2BDT Studio settings. B2BDT Project Properties

INFORMATICA CONFIDENTIAL

BEST PRACTICES

75 of 954

Project properties may be set in the B2BDT Studio environment specific to a B2BDT project. These include settings such as the encoding being used; namespaces used for XML Schemas, control over the XML generation, control over the output from a project and other project specific settings. Finally, OS environment variables are used to set aspects of the system such as the Java classpath to be used, location of the configuration file for a specific user, home location for the installed B2BDT instance to be used, library paths etc. The following table lists some typical configuration items and where they are set: Type of configuration item

Where configured

Memory for Studio

B2BDT Configuration application

JVM / JRE usage

B2BDT Configuration application

Tuning parameters – threads, timeouts etc

B2BDT Configuration application

User specific settings

Use environment variable to point to different configuration file

Memory for runtime

B2BDT Configuration application

INFORMATICA CONFIDENTIAL

BEST PRACTICES

76 of 954

Transformation encoding, output , event generation settings

Project properties

Workspace location

B2BDT Configuration application (B2BDT 4.3 formerly known as PowerExchange for Complex Data), B2BDT Studio (B2BDT 4.4)

Event generation

Set in project properties

Repository location

B2BDT Configuration application

Development Configuration Settings The following settings need to be set up for correct operation of the development environment: ●

Java home directory – set using “CMConfiguration | General | Java “ setting in Configuration editor

●

Java maximum heap size – set using “CMConfiguration | General | Java “ setting in Configuration editor

●

Repository location – needed to deploy projects from within Studio

●

JVM path for use with the Studio environment

●

●

B2BDT Studio Eclipse Command line parameters – used to set memory available in the studio environment. Use –Xmx nnMB to set the max allocation pool to size nn MB. Use –Xms nnMB to set the initial allocation pool to size of nn MB. Control over project output – by default, automatic output is enabled. This needs to switched off for most production quality transformations

●

Use of event files – disable for production

●

Use of working encoding

For most development scenarios, a minimum of 2GB memory is recommended for authoring environments.

Development Security Considerations The user under which the studio environment is running needs to have write access enabled to the directories where logging occurs, where event files are placed, read and write access to the workspace locations, and read and execute access to JVMs and any tools used in operation of preprocessors. The B2BDT transformation author needs read and execute permissions for the B2BDT install directory and all of its subdirectories. In some circumstances, the user under which a transformation is run differs from the logged in user. This is especially true when running under the control of a application integration platform such as BizTalk, or under a Web services host environment. Note: Under IIS, the default user identity for a web service is the local ASPNet user. This can be configured in the AppPool settings, in the .Net configuration settings and in the web service configuration files.

Best Practices – Workspace Organization As B2B Data Transformation Studio will load all projects in the current workspace into the studio environment, keeping all INFORMATICA CONFIDENTIAL

BEST PRACTICES

77 of 954

projects under design in a single workspace leads to both excessive memory usage and logical clutter between transformations belonging to different, possibly unrelated, solutions. Note: B2B Data Transformation Studio allows for the closing of projects to reduce memory consumption. While this aids with memory consumption it does not address the logical organization aspects of using separate workspaces. TIP Right click on the B2BDT project node in the explorer view to open or close a B2BDT project in the workspace. Closing a project reduces the memory requirements in the studio environment.

Separate Workspaces for Separate Solutions For distinct logical solutions, it is recommended to use separate workspaces to organize B2BDT projects relating to separate solutions. The B2B Data Transformation Studio configuration editor may be used to set the current workspace:

Separate Transformation Projects for Each Distinct Service From a logical organization perspective, it is easier to manage Complex Data solutions if only one primary service is published from each project. Secondary services from the same project should be reserved for the publication of test or troubleshooting variations of the same primary service. The one exception to this should be where multiple services are substantially the same with the same transformation code but with minor differences to inputs. One alternative to publishing multiple services from the same project is to publish a shared service which is then called by the other services in order to perform the common transformation routines. For ease of maintenance, it is often desirable to name the project after the primary service which it publishes. While these do not have to be the same, it is a useful convention and simplifies the management of projects.

Implementing B2BDT Transformations INFORMATICA CONFIDENTIAL

BEST PRACTICES

78 of 954

There are a number of considerations to be taken into account when looking at the actual implementation of B2BDT transformation services. ●

Naming standards for B2BDT components

●

Determining need and planning for data splitting

●

How will B2BDT be invoked at runtime

●

Patterns of data input and output

●

Error handling strategies

●

Initial deployment of B2BDT transformations

●

Testing of B2BDT transformations

Naming Standards for Development While naming standards for B2BDT development are subject of a separate best practice, the key points can be summarized as follows: ●

B2BDT service names must be unique

●

B2BDT project names must be unique

●

●

●

●

Avoid use of file system names for B2BDT artifacts. For example do not use names such as CON: as it may conflict with file system names Avoid use of names inconsistent with programming models Consider that the B2BDT service name or service parameter may need to be passed as a web service parameter Consider that the B2BDT service name or service parameter may drive the naming of an identifier in Java, C, C# or C++ Avoid names invalid as command line parameters As authors may need to use command line tools to test the service, use names that may be passed as unadorned command line arguments. Don’t use spaces, ‘>’ etc. Only expose one key B2BDT service per project Only expose additional services for debug and troubleshooting purposes.

Data Splitting There are a number of factors influencing whether source data should be split, how it can be split or indeed whether a splitting strategy is necessary. First of all, let’s consider when data may need to be split. For many systems, the fundamental characteristic to consider is the size of the inbound data. For many EAI platforms, files or blobs in excess of 10mb can impose problems. For example PowerCenter, Process Server and BizTalk impose limits on how much XML can be processed. This depends on what operations are needed on the XML files (do they need to be parsed or are they just passed as files), the version of the platform software (64bit vs. 32) and other factors. A midrange B2BDT system can typically handle 100s MB of data for the same system that may only handle 10 MB on other systems. But there are additional considerations to take into account: ●

Converting flat file or binary data can result in 5x size for resulting XML

●

Excel files > 10 MB can result in very large XML files depending on the choice of document processor in B2BDT

●

B2BDT generates very large event files for file sources such as Excel files

In general files of < 10 MB in size can be processed in B2BDT without problem without splitting. INFORMATICA CONFIDENTIAL

BEST PRACTICES

79 of 954

When we consider use of the 64 bit version of B2BDT, we can handle a much greater volume of data without splitting. For example, existing solutions handle 1.6gb of XML input data on a dual processor machine with 16gb of ram at one customer (using X86 based 64 bit RHEL). Average processing time was 20 minutes per file. 32 bit Windows environments are often limited to 3 GB of memory (2 GB available to applications) so this can limit what may be processed. For development environments, much less memory will be available to process the file (especially when event generation is turned on). It is common practice to use much smaller files as data samples when operating in the Studio environment especially for files that require large amounts of memory to preprocess. For Excel files, sample files of 2mb or less are recommended, depending on file contents. B2BDT provides a built in streaming mechanism which supports splitting of files (although it does not support splitting of Excel files in the current release). Considerations for splitting using the streaming capabilities include: ●

Is there an natural boundary to split on? For example EDI functional groups, transactions and other constructs can be used to provide a natural splitting boundary. Batch files composed of multiple distinct files also provide natural splitting boundaries.

●

In general we cannot split a file if a custom document preprocessor is required to split the file.

In some cases, disabling the event generation mechanism will alleviate the need for splitting.

How Will B2BDT Be Invoked at Run Time? B2BDT supports a variety of mechanisms for invocation: Invocation method

Considerations

Command line

Command line tools are intended mainly for troubleshooting and testing. Use of command line tools does not span multiple CPU cores for transformations and always generate the event file in the current directory.

HTTP (via CGI)

Supports exposing B2BDT transformation via web server

Web services

B2BDT services may be hosted in J2EE based web service environment. Service assets in progress will support hosting of B2BDT services as IIS based web services

APIS (C++, C, Java, .Net)

Offer great flexibility. Calling program needs to organize parallel calls to B2BDT to optimize throughput

INFORMATICA CONFIDENTIAL

BEST PRACTICES

80 of 954

EAI agents

Agents exist for BizTalk, WebMethods and many other platforms

PowerCenter

Through use of UDO, B2BDT services may be included as a transformation within a PowerCenter workflow

In addition, B2BDT supports two modes of activation – server and in-process operation. In process

Server

B2BDT call runs in process space of caller

B2BDT service call results in call into other process

Can result in excessive initialization costs as each call may result in overhead for initialization especially with custom code client

Slower overall communication but can avoid initial startup overhead as process possibly remains alive between invocations

Fault in B2BDT service may result in failure in caller

In practice, web service invocation is sped up by use of server invocation

In measurements for custom BizTalk based system (not via standard agent), initial call took 3 seconds, subsequent calls .1 second. But if process is not kept alive, initial 3 second hit was incurred multiple times

No effect for studio or command line invocation

Not supported for some APIs

Can allow 64 bit process to activate 32 bit B2BDT runtime or vice versa

Patterns of Data Input and Output There are a number of patterns of inputs and outputs used commonly in B2BDT transformations:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

81 of 954

Pattern

Description

Direct data

Under the direct data pattern, the data to be transformed is passed directly to the transformation and the output data is returned directly. Under this mechanism the output data format needs to allow for returning errors, or errors need to be returned through well known error file locations or some other preagreed mechanism.

Indirect via file

The transformation receives a string that designates a file to process and the transformation reads the real data from that file. A slightly more complex version of this may include passing of input, output and error file paths as semi-colon delimited strings or some similar mechanism

Indirect via digest or envelope file

Under the digest file mechanism, the data passed to the transformation specifies a wide range of parameters as a single file in a similar manner to a SOAP envelope. This digest file could contain many input file paths, output file paths, parameters to services, error handling arguments, performance characteristics etc. The processing of the digest file becomes much more complex but it is essential when many input files must be processed. It avoids much of the overhead of the host system having to load the data files into memory However transaction semantics offered by host systems cannot be utilized in these scenarios. This offers a great means for implementing custom error handling strategies also.

Error Handling Strategies B2BDT offers the following error handling features: Feature

INFORMATICA CONFIDENTIAL

Description

BEST PRACTICES

82 of 954

B2BDT event log

This is a B2BDT specific event generation mechanism where each event corresponds to an action taken by a transformation such as recognizing a particular lexical sequence. It is useful in troubleshooting of work in progress but event files can grow very large, hence it is not recommended for production systems. It is distinct from the event system offered by other B2BDT products and from the OS based event system. Custom events can be generated within transformation scripts. Event based failures are reported as exceptions or other errors in the calling environment.

B2BDT Trace files

Trace files are controlled by the B2BDT configuration application. Automated strategies may be applied for recycling of trace files

Custom error information

At the simplest levels custom errors can be generated as B2BDT events (using the AddEventAction). However if the event mechanism is disabled for memory or performance reasons, these are omitted. Other alternatives include generation of custom error files, integration with OS event tracking mechanisms and integration with 3rd party management platform software. Integration with OS eventing or 3rd party platform software requires custom extensions to B2BDT

Overall the B2BDT event mechanism is the simplest to implement. But for large or high volume production systems, the event mechanism can create very large event files, and it offers no integration with popular enterprise software administration platforms. It is recommended that B2BDT Events are used for troubleshooting purposed during development only. In some cases, performance constraints may determine the error handling strategy. For example updating an external event system may cause performance bottlenecks or producing a formatted error report can be time consuming. In some cases operator interaction may be required which could potentially block a B2BDT transformation from completing. Finally it is worth looking at whether some part of the error handling can be offloaded outside of B2BDT to avoid performance bottlenecks. When using custom error schemes, consider the following: ●

Multiple invocations of the same transformation may execute in parallel

●

Don’t hardwire error file paths

●

Don’t assume a single error output file

INFORMATICA CONFIDENTIAL

BEST PRACTICES

83 of 954

●

Avoid use of B2BDT event log for productions especially when processing Excel files

Effects of API on event generation: API / invocation method

Event generation

CM_Console

Service deployed with events will produce events. Service deployed without events will not produce events

Java API

Service runs without events. In case of error, service is rerun with events

C# / .Net

Same as Java

Agents

No events unless error irrespective of how service was deployed. Same behavior is used for PowerCenter

Testing A full test of B2BDT services is covered by a separate best practice document. For simple cases and as a first step in most B2BDT transformation development projects, the B2BDT development environment offers a number of features that can be used to verify the correctness of B2BDT transformations. Initial testing of many transformations can be accomplished using these features alone. 1. The B2BDT studio environment provides visual feedback on what components of the input data are recognized by a parser. This can be viewed in the data browser window of a B2BDT project and the B2BDT studio environment will automatically mark up the first set of occurrences of patterns matched and literals found. Through the use of a simple menu option, all recognized occurrences of matched data can be marked up within the B2BDT studio authoring environment. 2. The B2BDT studio environment exposes a structured event log mechanism to allow developers to browser the flow of a transformation which can be used to verify execution of various components of a transformation. 3. The B2BDT studio environment supports specification of additional sources to perform a transformation on in order to verify the transformation execution against a set of sample or test data inputs. This is accomplished inside the studio design environment by simply setting the “sources to extract” property to point to the test data, either as specific files or as a directory search for data files matching a file pattern. The unit test can also be automated using the command line API. Results of transformations executed can be previewed in the studio environment, along with events generated during the transformation. In many production scenarios, the B2BDT transformation is called from an overall workflow process (EAI, ETL, MSMQ, etc), and this integrated environment is what is typically reflected in a lab environment (Dev/Test/QA). .

Deployment Published B2BDT services are stored in a B2BDT repository which is a designated file system location where the B2BDT runtime looks for services when requested to invoke a transformation service. This may be on a shared file system location such as a Network share or SAN based mechanism facilitating the sharing of services between multiple INFORMATICA CONFIDENTIAL

BEST PRACTICES

84 of 954

production servers. A B2BDT project may be published within the B2BDT Studio environment to deploy a single B2BDT service to the B2BDT repository. A project can be used to deploy multiple B2BDT services by setting different options such as the transformation entry point (the same identical service can even be deployed under multiple B2BDT service names). At the simplest level, a B2BDT transformation may be deployed through one of two options. Direct The transformation deployment target directory is set using the CMConfiguration Editor. If the CM repository is set to a location such as a network share which is referenced by a production or QA environment, publishing the service will have the effect of making it available directly to the QA or production environment. Note: The refresh interval B2BDT configuration setting will determine how often a runtime instance checks the file system for updated services

Indirect The B2BDT transformation deployment target directory is set (via the CM repository configuration setting) to the developer specific directory. This directory is subsequently copied to the QA/Production environment using other mechanisms outside of the B2BDT Studio environment (simple copy or source management environments such as CVS, Source safe etc). Use of staging environments may be employed where it is necessary to assemble multiple dependent services prior to deployment to a test environment. In the section on source code control, we covered a number of strategies for deployment of services using version control. Other alternatives may include the use of custom scripts, setup creation tools (such as InstallShield) Configuration Settings Affecting Deployment The following configuration setting affects how soon a newly deployed service is detected: Service refresh interval Further Considerations for Deployment More detailed descriptions of deployment scenarios will be provided in a separate best practice. Some of the considerations to be taken into account include: Last updated: 30-May-08 19:24

INFORMATICA CONFIDENTIAL

BEST PRACTICES

85 of 954

Testing B2B Data Transformation Services Challenge Establish a testing process that ensures support for team development of B2B Data Transformation (B2BDT) solutions, strategies for verification of scaling and performance requirements, testing for transformation correctness and overall unit and system test procedures as business and development needs evolve.

Description When testing B2B Data Transformation services, the goal to keep in mind throughout the process is achieving the ability to test transformations for measurable correctness, performance and scalability. The testing process is broken into three main functions which are addressed through the test variants. The testing process scenarios addressed in this document include finding bugs/defects, achieving the ability to test and ensure functional compliance with desired specifications and ensuring compliance with industry standards/certifications. The success of the testing process should be based on a standard of measurable milestones that provide an assessment of overall transformation completion.

Finding Defects The first topic to address within the QA process is the ability to find defects within the transformation and to test them against specifications for compliance. This process has a number of options available. Choose the best method to fulfill testing requirements. Based upon time and resource constraints In the testing process, the QA cycle refers to the ability to find, fix or defer errors and retest them until the error count reaches 0 (or specified target). To ensure compliance with defined specifications during the QA process, test basic functionality and ensure that outlying transformation cases behave as defined. For these types of tests ensure the behavior of failure cases fail as expected in addition to ensuring that transformation succeeds as expected.

Ensuring Compliance Another integral part of the testing process with B2B Data Transformations is the validation of transformations against industry standards such as HIPAA. In order to test standardized output there needs to be a validation of well formed inputs and outputs such as HIPAA levels 1-6 and testing against a publicly available data set. An optimally tested solution can be ensured through use of 3rd party verification software, validation support in the B2B Data Transformation libraries that verify data compliance or through B2BDT transformations created in the course of a project specifically for test purposes.

Performance Performance and stress testing are additional components used within the testing methodology for B2BDT transformations. To effectively test performance, compare the effects of different configurations on the Informatica server. To achieve this, compare the effects of configurations parameters based on server and machine configurations. Based on data sizes and the complexity of transformations, optimize server configurations for best and worst case scenarios. One way to track benchmarking results is to create a reference spreadsheet. This should define the amount of time needed for each source file to process through the transformation based upon file size.

Setting Measurable Milestones

INFORMATICA CONFIDENTIAL

BEST PRACTICES

86 of 954

In order to track the progress of testing transformations it is best to set milestones to gauge the overall efficiency of the development and QA processes. Best practices include tracking failure rates for different builds. This builds a picture of pass/failure rate over time which can be used to determine expected delays and to gauge achievements in development over time.

Testing Practices The Basics This section focuses on the initial testing of a B2B Data Transformation. For simple cases and as a first step in most transformation development projects, the studio development environment offers a number of features that can be used to verify the correctness of B2B Data transformations. The initial testing of many transformations can be accomplished using these features alone. It is useful to create small sample data files that are representative of the actual data to ensure quick load times and responsiveness. 1. The B2B Data Transformation Studio environment provides visual feedback on which components of the input data are recognized by a parser. This can be viewed in the data browser window of a B2B Data Transformation project. The Studio environment will automatically mark up the first set of occurrences of patterns matched and literals found. Through the use of the “mark all” menu option or button, all recognized occurrences of matched data can be marked up within the Studio authoring environment. This provides for a quick verification of correct operations. As shown in the figure below, the color coding indicates which data was matched.

2. The Studio environment exposes a structured event log mechanism that allows developers to browse the flow of a transformation which can then be used to verify the execution of various components of a transformation. Reviewing the event log after running the transformation often provides an indication for the error.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

87 of 954

3. Viewing the results file provides a quick indication of which data was matched. By default it contains parsed XML data. Through the use of DumpValues statements and WriteValue statements in the transformation, the contents of the results files can be customized.

4. The Studio environment supports the specification of additional sources to perform a transformation on in order to verify the transformations execution against a set of sample or test data inputs. This is accomplished inside the Studio Design environment by simply setting the “sources to extract” property to point to the test data, either as specific files or as a directory search for data files matching a file pattern. The unit test can also be automated using the command line API. Results of transformations executed can be previewed in the Studio environment, along with events generated during the transformation. When running through the initial test process, the Studio environment provides a basic indication about the overall integrity of the transformation. These tests allow for simple functional checks to see whether the transformation failed or not and if the correct output was produced. The events navigation pane provides a visual description of transformation processing. An illustration of the events view log within Studio is shown below.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

88 of 954

In the navigation pane, blue flags depict warnings that can be tested for functional requirements whereas red flags indicate fatal errors. Event logs are available when running a transformation from the Studio environment. Once a service has been deployed (with event output turned on) event logs are written to the directory from which CM_Console is run (when testing a service with CM_Console). When invoking a service with other invocation mechanisms the following rules apply for event log generation.

Effects of API on Event Generation API / invocation method

Event generation

CM_Console

Service deployed with events enabled will produce events. Service deployed without events enabled will not produce events

Java API

Service runs without events. In case of error, service is rerun automatically with events enabled

C# / .Net

Same as Java

Agents

No events unless error irrespective of how service was deployed. Same behavior is used for PowerCenter

INFORMATICA CONFIDENTIAL

BEST PRACTICES

89 of 954

To view the error logs, use the Studio event pane to scan through the log for specific events. To view an external event file (usually named events.cme), drag and drop the file from the Windows Explorer into the B2BDT Studio events pane. To view the error logs, use the Studio event pane to scan through the log for specific events. To view an external event file (usually named events.cme), drag and drop the file from the Windows Explorer into the B2BDT Studio events pane. It is also possible to create a B2BDT transformation to look for specific information within the event file.

Other Troubleshooting Output B2B Data Transformation services can be configured to produce trace files that can be examined for troubleshooting purposes. Trace file generation is controlled by the B2BDT configuration application. Automated strategies may be applied for the recycling of trace files. For other forms of troubleshooting output the following options are available: ●

●

●

Simple (non dynamic) custom errors can be generated as B2BDT events (using the AddEventAction). However if the event mechanism is disabled for memory or performance reasons, these are omitted. A transformation could be used to keep track of errors in the implementation of the transformation and output these to a custom error file. Through the use of the external code integration APIs for Java, COM (and .Net), integration with OS event tracking mechanisms and integration with 3rd party management platform software are possible through custom actions and custom data transformations.

Other Test Methods Additional checks that can be performed include a comparison with well known input and expected output, the use of validation tools and transformations as well as the use of reverse transformations and spot checks to verify the expected data subsets. The sections below provide information on how each of these different testing options work along with descriptions of their overall efficiencies and deficiencies for the QA process.

Comparing Inputs and Outputs For many transformations, comparing the data output from known good input data with expected output data generated through other means provides a valuable mechanism for testing the correctness of a transformation. However, this process requires that adequate sample input data is available as well as examples of output data for these inputs. While in some cases simple binary comparison between the generated output and the correct output is sufficient, it may be necessary to use 3rd party tools to perform comparison where the output is XML or where the order of output can vary. Another test that is valid for some transformations is to test if the output data contains a subset of the expected data. This is useful if only part of the expected output is known. Comparison techniques may need to ignore time and date stamp data in files unless they are expected to be the same in the output. If no comparison tools are available due to the complexity of the data, it is also possible to create a B2BDT service that performs the comparison and writes the results of the comparison to the results file or to a specific output file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

90 of 954

In the event that there is no sample data output available, one solution is to run well known good data through the transformation and create a set of baseline outputs. These should be verified for correctness either through manual examination or another method. This baseline output data can subsequently be used for comparison techniques and for the creation of further variations of expected output data. While this does not verify the correctness of the initial execution of the data transformation, the saved baseline output data can be used to verify that expected behavior has not been broken by maintenance changes. Tools that can be used for the comparison of inputs and outputs include 3rd party software applications such as KDiff3, an open source comparison tool. This application is good for the comparison of XML as well as text files (see an example in the figure below).

Validation Transformations For some types of data, validation software is available commercially or may already exist in an organization. In the absence of available commercial or in-house validation software Informatica recommends creating B2BDT services that provide validation of the data. The developers assigned to create the validation transformations should be different from those that created the original transformations. A strict “no code sharing” rule should be enforced to ensure that the validation is not simply a copy of the transformation.

Reverse Transformations: Another option for testing practices is to use reverse transformations – that is, a transformation that performs a reverse transformation on the output which would create the input data. This could then be used as the basis for comparison techniques. Running the output data from B2B Data transformations through an independently created reverse transformation is optimal. The reason for the independent creation of a reverse transformation is because an auto generated reverse INFORMATICA CONFIDENTIAL

BEST PRACTICES

91 of 954

transformations has a tendency to propagate additional bugs. Partial or full compares of input against the output of the reverse transformation can be performed using this strategy. While this allows for testing of functional compliance, the downside is a reduction in the effectiveness of auto generated functions as they require a high time cost to fully implement.

Spot Checking In some cases it may not be feasible to perform a full comparison test on outputs. Creating a set of spot check transformations provides some measure of quality assurance. The basic concept is that one or more transformations are created that perform spot checks on the output data using B2BDT services. As new issues arise in QA, enhance the spot checks to detect new problems and to look for common mistakes in the output. As time progresses a library of checks should be enhanced. Programmatic checks can be embedded within the transformation itself such as inserting actions to self test output using the AddEventAction feature. If the B2B Data Transformation service is being called through an API, exceptions within the calling code can be checked for as well. This is a subset of spot checking which can assist within the testing process. An error tracking layer can also be applied to the XML output and through the use of programmatic checks all errors associated with the transformation can be written to the output XML. The figure below illustrates how to embed programmatic checks within the transformation.

In the example above, flags are set and error codes are assigned to the specific XML error fields that were defined in the XML Schema definition earlier. In the event the “ensure condition” fails, then the error flags are set and reported to the output XML stream.

Unit Testing The concept behind unit testing is to avoid using a traditional QA cycle to find many basic defects in the transformation in order to reduce the cost in time and effort. Unit tests are sets of small tests that are run by the developer of a transformation before signing off on code changes. Unit tests optimally should be created and maintained by the developer and should be used for regression control and functionality testing. Unit tests are often used with a test-first development methodology. It is important to note that unit tests are not a replacement for full QA processes but provide a way for developers to quickly verify that functionality has not been broken by changes. Unit tests may be programmatic or manual tests, although implementing unit tests as a programmatic set of tests necessitates running of the unit test cases after every change. INFORMATICA CONFIDENTIAL

BEST PRACTICES

92 of 954

Testing Transformations Integrated with PowerCenter When testing B2B Data Transformations using PowerCenter, it is best to initially test the transformation using the aforementioned test processes before utilizing the transformation within the mapping. However, using B2B Data Transformations with PowerCenter has its advantages as data output within a PC mapping can actually be visualized as it comes out of each transformation during the debugging process. When using a combination of PC with B2B Data Transformations, write the output to a flat file to allow for quick spot check testing practices.

Design Practices to Facilitate Testing Use of Indirect Pattern for Parameters When initiating the testing process for B2B Data Transformations one way to induce the testing process is through the use of indirect pattern for parameters. This is similar to referencing the source input in a parameter file for testing purposes. In this instance set input to the transformation service as a request file specified by host location. This request file has the flexibility to indicate where to read the input and where to place the output and reports on the status of executing transformations. This can be done through an XML file input which can be managed by the local administrator. This method can result in the reduction of the host environment footprint. Staging areas for inputs and outputs can be created which provide a way to easily track completed transformations. During the mapping process, the request file is processed to determine the actual data to be mapped along with the target locations, etc. When these have been read, control is passed to the transformation which will perform the actual mapping.The figures below demonstrate this strategy.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

93 of 954

In the mapper illustrated above, the main service input and output data takes the form of references (provided as individual service parameters or combined into a single XML block) which refer to the real input and output data located by paths to specific files and/or collections of files designated by a path to an accessible directory. Alternately, a collection of files may be referred to using a sequence of individual paths. However, the latter approach does limit the parallel operation of some of the transformation. Last updated: 30-May-08 23:55

INFORMATICA CONFIDENTIAL

BEST PRACTICES

94 of 954

Configuring Security Challenge Configuring a PowerCenter security scheme to prevent unauthorized access to folders, sources and targets, design objects, run-time objects, global objects, security administration, domain administration, tools access, and data in order to ensure system integrity and data confidentiality.

Description Security is an often overlooked area within the Informatica ETL domain. However, without paying close attention to the domain security, one ignores a crucial component of ETL code management. Determining an optimal security configuration for a PowerCenter environment requires a thorough understanding of business requirements, data content, and end-user access requirements. Knowledge of PowerCenter's security functionality and facilities is also a prerequisite to security design. Implement security with the goals of easy maintenance and scalability. When establishing domain security, keep it simple. Although PowerCenter includes the utilities for a complex web of security, the more simple the configuration, the easier it is to maintain. Securing the PowerCenter environment involves the following basic principles: ●

Create users and groups

●

Define access requirements

●

Grant privileges, roles and permissions

Before implementing security measures ask and answer the following questions: ● ●

● ●

Who will administer the domain? How many projects need to be administered? Will the administrator be able to manage security for all PowerCenter projects or just a select few? How many environments will be supported in the domain? Who needs access to the domain objects (e.g., repository service, reporting service, etc.)? What do they need the ability to do?

●

How will the metadata be organized in the repository? How many folders will be required?

●

Where can we limit repository service privileges by granting folder permissions instead?

●

Who will need Administrator or Super User-type access?

After you evaluate the needs of the users, you can create appropriate user groups and assign repository service privileges and folder permissions. In most implementations, the administrator takes care of maintaining the repository. Limit the number of administrator accounts for PowerCenter. While this concept is important in a development/unit test environment, it is critical for protecting the production environment.

Domain Repository Overview All of the PowerCenter Advanced Edition applications are centrally administered through the administration console and the settings are stored in the domain repository. User and group information, permissions and role definitions for domain objects are managed through the administration console and are stored in the domain repository.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

95 of 954

Although privileges and roles are assigned to users and group centrally from the administration console, they are also stored in each application repository. Periodically the domain synchronizes this information (when an assignment is made) to each application repository. Individual applications object permissions are also managed and stored within each application repository.

PowerCenter Repository Security Overview A security system needs to properly control access to all sources, targets, mappings, reusable transformations, tasks, and workflows in both the test and production repositories. A successful security model needs to support all groups in the project lifecycle and also consider the repository structure. Informatica offers multiple layers of security, which enables you to customize the security within your data warehouse environment. Metadata level security controls access to PowerCenter repositories, which contain objects grouped by folders. Access to metadata is determined by the privileges granted to the user or to a group of users and the access permissions granted on each folder. Some privileges do not apply by folder, as they are granted by privilege alone (i. e., repository-level tasks). Just beyond PowerCenter authentication is the connection to the repository database. All client connectivity to the repository is handled by the PowerCenter Repository Service over a TCP/IP connection. The particular database account and password is specified at installation and during the configuration of the Repository Service. Developers need not have knowledge of this database account and password; they should only use their individual repository user ids and passwords. This information should be restricted to the administrator. Other forms of security available in PowerCenter include permissions for connections. Connections include database, FTP, and external loader connections. These permissions are useful when you want to limit access to schemas in a relational database and can be set-up in the Workflow Manager when source and target connections are defined. Occasionally, you may want to restrict changes to source and target definitions in the repository. A common way to approach this security issue is to use shared folders, which are owned by an Administrator or Super User. Granting read access to developers on these folders allows them to create read-only copies in their work folders.

PowerCenter Security Architecture The following diagram, Informatica PowerCenter Security, depicts PowerCenter security, including access to the repository, Repository Service, Integration Service and the command-line utilities pmrep and pmcmd. As shown in the diagram, the repository service is the central component for repository metadata security. It sits between the PowerCenter repository and all client applications, including GUI tools, command line tools, and the Integration Service. Each application must be authenticated against metadata stored in several tables within the repository. Each Repository Service manages a single repository database where all security data is stored as part of its metadata; this is a second layer of security. Only the Repository Service has access to this database; it authenticates all client applications against this metadata.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

96 of 954

Repository Service Security Connection to the PowerCenter repository database is one level of security. The Repository Service uses native drivers to communicate with the repository database. PowerCenter Client tools and the Integration Service communicate with the Repository Service over TCP/IP. When a client application connects to the repository, it connects directly to the Repository Service process. You can configure a Repository Service to run on multiple machines, or nodes, in the domain. Each instance running on a node is called a Repository Service process. This process accesses the database tables and performs most repository-related tasks. When the Repository Service is installed, the database connection information is entered for the metadata repository. At this time you need to know the database user id and password to access the metadata repository. The database user id must be able to read and write to all tables in the database. As a developer creates, modifies, executes mappings and sessions, this information is continuously updating the metadata in the repository. Actual database security should be controlled by the DBA responsible for that database, in conjunction with the PowerCenter Repository Administrator. After the Repository Service is installed and started, all subsequent client connectivity is automatic. The database id and password are transparent at this point.

Integration Service Security Like the Repository Service, the Integration Service communicates with the metadata repository when it executes workflows or when users are using Workflow Monitor. During configuration of the Integration Service, the repository database is identified with the appropriate user id and password. Connectivity to the repository is made using native INFORMATICA CONFIDENTIAL

BEST PRACTICES

97 of 954

drivers supplied by Informatica. Service, the repository database is identified with the appropriate user id and password. Connectivity to the repository is made using native drivers supplied by Informatica. Certain permissions are also required to use the pmrep and pmcmd command line utilities.

Encrypting Repository Passwords You can encrypt passwords and create an environment variable to use with pmcmd and pmrep. For example, you can encrypt the repository and database passwords for pmrep to maintain security when using pmrep in scripts. In addition, you can create an environment variable to store the encrypted password. Use the following steps as a guideline to use an encrypted password as an environment variable: 1. Use the command line program pmpasswd to encrypt the repository password. 2. Configure the password environment variable to set the encrypted value. To configure a password as an environment variable on UNIX: 1. At the command line, type: pmpasswd pmpasswd returns the encrypted password. 2. In a UNIX C shell environment, type: setenv In a UNIX Bourne shell environment, type: = export You can assign the environment variable any valid UNIX name. To configure a password as an environment variable on Windows: 1. At the command line, type: pmpasswd pmpasswd returns the encrypted password. 2. Enter the password environment variable in the Variable field. Enter the encrypted password in the Value field.

Setting the Repository User Name For pmcmd and pmrep, you can create an environment variable to store the repository user name. INFORMATICA CONFIDENTIAL

BEST PRACTICES

98 of 954

To configure a user name as an environment variable on UNIX: 1. In a UNIX C shell environment, type: setenv 2. In a UNIX Bourne shell environment, type: = export = You can assign the environment variable any valid UNIX name. To configure a user name as an environment variable on Windows: 1. Enter the user name environment variable in the Variable field. 2. Enter the repository user name in the Value field.

Connection Object Permissions Within Workflow Manager, you can grant read, write, and execute permissions to groups and/or users for all types of connection objects. This controls who can create, view, change, and execute workflow tasks that use those specific connections, providing another level of security for these global repository objects. Users with ‘Use Workflow Manager’ permission can create and modify connection objects. Connection objects allow the PowerCenter server to read and write to source and target databases. Any database the server can access requires a connection definition. As shown below, connection information is stored in the repository. Users executing workflows need execution permission on all connections used by the workflow. The PowerCenter server looks up the connection information in the repository, and verifies permission for the required action. If permissions are properly granted, the server reads and writes to the defined databases, as specified by the workflow.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

99 of 954

Users Users are the fundamental objects of security in a PowerCenter environment. Each individual logging into the PowerCenter domain or its services should have a unique user account. Informatica does not recommend creating shared accounts; unique accounts should be created for each user. Each domain user needs a user name and password, provided by the Informatica Administrator, to access the domain. Users are created and managed through the administration console. Users should change their passwords from the default immediately after receiving the initial user id from the Administrator. When you create a PowerCenter repository, the repository automatically creates two default repository users within the domain: ●

Administrator - The default password for Administrator is Administrator.

●

Database user - The username and password used when you created the repository.

These default users are in the Administrators user group, with full privileges within the repository. They cannot be deleted from the repository, nor have their group affiliation changed. To administer repository users, you must have one of the following privileges: ●

Administer Repository

●

Super User

LDAP (Lightweight Directory Access Protocol) In addition to default domain user authentication, LDAP can be used to authenticate users. Using LDAP authentication, the domain maintains an association between the domain user and the external login name. When a user logs into the domain services, the security module authenticates the user name and password against the external directory. The domain maintains a status for each user. Users can be enabled or disabled by modifying this status. Prior to implementing LDAP, the administrator must know: ●

Domain username and password

●

An administrator or superuser user name and password for the domain

●

An external login name and password

To configure LDAP, follow these steps: 1. Edit ldap_authen.xml, modify the following attributes: ●

NAME – the .dll that implements the authentication

●

OSTYPE – Host operating system

2. Register ldap_authen.xml in the Domain Administration Console. 3. In the domain Administration Console, configure the authentication module.

Privileges Seven categories of privileges have been defined. Depending on the category, each privilege controls various actions for a particular object type. The categories are: INFORMATICA CONFIDENTIAL

BEST PRACTICES

100 of 954

●

Folders -- Create, Copy, Manage Versions

●

Sources & Targets -- Edit, Create and Delete, Manage Versions

●

Design Objects -- Edit, Create and Delete, Manage Versions

●

Run-time Objects -- Edit, Create and Delete, Manage Versions, Monitor, Manage Execution

●

Global Objects (Queries, Labels, Connections, Deployment Groups) – Create

●

Security Administration -- Manage, Grant Privileges and Permissions

●

Domain Administration (Nodes, Grids, Services) – Execute, Manage, Manage Execution

●

Tools Access – Designer, Workflow Manager, Workflow Monitor, Administration Console, Repository Manager

Assigning Privileges A user must have permissions to grant privileges and roles (as well as administration console privileges in the domain) in order to assign privileges. The user must also have permission for the service to which the privileges apply. Only a user who has permissions to the domain can assign privileges in the domain. For PowerCenter, only a user who has permissions to the repository service can assign privileges for that repository service. For Metadata Manager and Data Analyzer, only a user who has permissions to the corresponding metadata or reporting service can assign privileges in that application. Privileges are assigned per repository or application instance. For example, you can assign a user create, edit, and delete privilege for runtime and design objects in a development repository but not in the production repository.

Roles A user needs to have privileges to manage users, groups and roles (and administration console privileges in the domain) in order to define custom roles. Once roles are defined they can be assigned to users or groups for specific services. Just like privileges, roles are assigned per repository or application instance. For example, the developer role (with its associated privileges) can be assigned to a user only in the development repository; but not the test or production repository. A must have permissions to grant privileges and roles (as well as administration console privileges in the domain) in order to assign roles. The user must also have permission for the services to which the roles are to be applied. Only a user who has permissions to the domain can assign roles in the domain. For PowerCenter, only a user who has permissions to the repository service can assign roles for that repository service. For Metadata Manager and Data Analyzer, only a user who has permissions to the corresponding metadata or reporting service can assign roles in that application.

Domain Administrator Role The domain administrator role is essentially a super-user for not only the domain itself, but also for all of the services/ applications in the domain. This role has permissions to all objects in the domain (including the domain itself) and all available privileges in the domain. As a result, the super-user role has privileges to manage users, groups and roles as well as to assign privileges and roles privileges. Because of these privileges and permissions for all objects in the domain this role can grant itself the administrator role on all services and therefore, become the super-user for all services in the domain. The domain administrator role also has implicit privileges that include:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

101 of 954

●

Configuring a node as a gateway node

●

Creating, editing, and deleting the domain

●

Configuring SMTP

●

Configuring service levels in the domain

●

Shutting down domain

●

Receiving domain alerts

●

Exporting and truncating domain logs

●

Configuring restart of service processes

Audit Trails You can track changes to Repository users, groups, privileges, and permissions by selecting the SecurityAuditTrail configuration option in the Repository Service properties in the PowerCenter Administration Console. When you enable the audit trail, the Repository Service logs security changes to the Repository Service log. The audit trail logs the following operations: ●

Changing the owner, owner's group, or permissions for a folder.

●

Changing the password of another user.

●

Adding or removing a user.

●

Adding or removing a group.

●

Adding or removing users from a group.

●

Changing global object permissions.

●

Adding or removing user and group privileges.

Sample Security Implementation 1. The following steps provide an example of how to establish users, groups, permissions, and privileges in your environment. Again, the requirements of your projects and production systems should dictate how security is established. 2. Identify users and the environments they will support (e.g., Development, UAT, QA, Production, Production Support, etc.). 3. Identify the PowerCenter repositories in your environment (this may be similar to the basic groups listed in Step 1; for example, Development, UAT, QA, Production, etc.). 4. Identify which users need to exist in each repository. 5. Define the groups that will exist in each PowerCenter Repository. 6. Assign users to groups. 7. Define privileges for each group. The following table provides an example of groups and privileges that may exist in the PowerCenter repository. This example assumes one PowerCenter project with three environments co-existing in one PowerCenter repository. GROUP NAME

FOLDER

FOLDER PERMISSIONS

PRIVILEGES

ADMINISTRATORS

All

All

Super User (all privileges)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

102 of 954

DEVELOPERS

Individual development folder; integrated development folder

Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager

DEVELOPERS

UAT

Read

Use Designer, Browse Repository, Use Workflow Manager

UAT

UAT working folder Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager

UAT

Production

Read

Use Designer, Browse Repository, Use Workflow Manager

OPERATIONS

Production

Read, Execute

Browse Repository, Workflow Operator

PRODUCTION SUPPORT

Production maintenance folders

Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager

PRODUCTION SUPPORT

Production

Read

Browse Repository

Informatica PowerCenter Security Administration As mentioned earlier, one individual should be identified as the Informatica Administrator. This individual is responsible for a number of tasks in the Informatica environment, including security. To summarize, here are the security-related tasks an administrator is responsible for: ●

Creating user accounts.

●

Defining and creating groups.

●

Defining and granting permissions.

●

Defining and granting privileges and roles.

●

Enforcing changes in passwords.

●

Controlling requests for changes in privileges.

●

●

Creating and maintaining database, FTP, and external loader connections in conjunction with database administrator. Working with operations group to ensure tight security in production environment.

Summary of Recommendations When implementing your security model, keep the following recommendations in mind: Last updated: 04-Jun-08 15:34

INFORMATICA CONFIDENTIAL

BEST PRACTICES

103 of 954

Data Analyzer Security Challenge Using Data Analyzer's sophisticated security architecture to establish a robust security system to safeguard valuable business information against a range of technologies and security models. Ensuring that Data Analyzer security provides appropriate mechanisms to support and augment the security infrastructure of a Business Intelligence environment at every level.

Description Four main architectural layers must be completely secure: user layer, transmission layer, application layer and data layer. Users must be authenticated and authorized to access data. Data Analyzer integrates seamlessly with the following LDAP-compliant directory servers:

SunOne/iPlanet Directory Server

INFORMATICA CONFIDENTIAL

4.1

BEST PRACTICES

104 of 954

Sun Java System Directory Server

5.2

Novell eDirectory Server 8.7 IBM SecureWay Directory

3.2

IBM SecureWay Directory

4.1

IBM Tivoli Directory Server

5.2

Microsoft Active Directory 2000 Microsoft Active Directory 2003

In addition to the directory server, Data Analyzer supports Netegrity SiteMinder for centralizing authentication and access control for the various web applications in the organization.

Transmission Layer The data transmission must be secure and hacker-proof. Data Analyzer supports the standard security protocol Secure Sockets Layer (SSL) to provide a secure environment.

Application Layer Only appropriate application functionality should be provided to users with associated privileges. Data Analyzer provides three basic types of application-level security: ●

● ●

Report, Folder and Dashboard Security. Restricts access for users or groups to specific reports, folders, and/or dashboards. Column-level Security. Restricts users and groups to particular metric and attribute columns. Row-level Security. Restricts users to specific attribute values within an attribute column of a table.

Components for Managing Application Layer Security Data Analyzer users can perform a variety of tasks based on the privileges that you grant them. Data Analyzer provides the following components for managing application layer security: ●

Roles. A role can consist of one or more privileges. You can use system roles or create custom roles. You can grant roles to groups and/or individual users. When you edit a custom role, all

INFORMATICA CONFIDENTIAL

BEST PRACTICES

105 of 954

groups and users with the role automatically inherit the change. ●

●

Groups. A group can consist of users and/or groups. You can assign one or more roles to a group. Groups are created to organize logical sets of users and roles. After you create groups, you can assign users to the groups. You can also assign groups to other groups to organize privileges for related users. When you edit a group, all users and groups within the edited group inherit the change. Users. A user has a user name and password. Each person accessing Data Analyzer must have a unique user name. To set the tasks a user can perform, you can assign roles to the user or assign the user to a group with predefined roles.

Types of Roles ●

●

System roles - Data Analyzer provides a set of roles when the repository is created. Each role has sets of privileges assigned to it. Custom roles - The end user can create and assign privileges to these roles.

Managing Groups Groups allow you to classify users according to a particular function. You may organize users into groups based on their departments or management level. When you assign roles to a group, you grant the same privileges to all members of the group. When you change the roles assigned to a group, all users in the group inherit the changes. If a user belongs to more than one group, the user has the privileges from all groups. To organize related users into related groups, you can create group hierarchies. With hierarchical groups, each subgroup automatically receives the roles assigned to the group it belongs to. When you edit a group, all subgroups contained within it inherit the changes. For example, you may create a Lead group and assign it the Advanced Consumer role. Within the Lead group, you create a Manager group with a custom role Manage Data Analyzer. Because the Manager group is a subgroup of the Lead group, it has both the Manage Data Analyzer and Advanced Consumer role privileges.

Belonging to multiple groups has an inclusive effect. For example, if group 1 has access to something but group 2 is excluded from that object, a user belonging to both groups 1 and 2 will have access to the object.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

106 of 954

Preventing Data Analyzer from Updating Group Information If you use Windows Domain or LDAP authentication, you typically modify the users or groups in Data Analyzer. However, some organizations keep only user accounts in the Windows Domain or LDAP directory service, but set up groups in Data Analyzer to organize the Data Analyzer users. Data Analyzer provides a way for you to keep user accounts in the authentication server and still keep the groups in Data Analyzer. Ordinarily, when Data Analyzer synchronizes the repository with the Windows Domain or LDAP directory service, it updates the users and groups in the repository and deletes users and groups that are not found in the Windows Domain or LDAP directory service. To prevent Data Analyzer from deleting or updating groups in the repository, you can set a property in the web.xml file so that Data Analyzer updates only user accounts, not groups. You can then create and manage groups in Data Analyzer for users in the Windows Domain or LDAP directory service. The web.xml file is in stored in the Data Analyzer EAR file. To access the files in the Data Analyzer EAR file, use the EAR Repackager utility provided with Data Analyzer. Note: Be sure to back-up the web.xml file before you modify it. To prevent Data Analyzer from updating group information in the repository: 1. In the directory where you extracted the Data Analyzer EAR file, locate the web.xml file in the following directory: /custom/properties 2. Open the web.xml file with a text editor and locate the line containing the following property: enableGroupSynchronization The enableGroupSynchronization property determines whether Data Analyzer updates the groups in the repository. INFORMATICA CONFIDENTIAL

BEST PRACTICES

107 of 954

3. To prevent Data Analyzer from updating group information in the Data Analyzer repository, change the value of the enableGroupSynchronization property to false: InfSchedulerStartup.com.informatica.ias. scheduler.enableGroupSynchronization false When the value of enableGroupSynchronization property is false, Data Analyzer does not synchronize the groups in the repository with the groups in the Windows Domain or LDAP directory service. 4. Save the web.xml file and add it back to the Data Analyzer EAR file. 5. Restart Data Analyzer. When the enableGroupSynchronization property in the web.xml file is set to false, Data Analyzer updates only the user accounts in Data Analyzer the next time it synchronizes with the Windows Domain or LDAP authentication server. You must create and manage groups, and assign users to groups in Data Analyzer.

Managing Users Each user must have a unique user name to access Data Analyzer. To perform Data Analyzer tasks, a user must have the appropriate privileges. You can assign privileges to a user with roles or groups. Data Analyzer creates a System Administrator user account when you create the repository. The default user name for the System Administrator user account is admin. The system daemon, ias_scheduler/ padaemon, runs the updates for all time-based schedules. System daemons must have a unique user name and password in order to perform Data Analyzer system functions and tasks. You can change the password for a system daemon, but you cannot change the system daemon user name via the GUI. Data Analyzer permanently assigns the daemon role to system daemons. You cannot assign new roles to system daemons or assign them to groups. To change the password for a system daemon, complete the following steps: 1. Change the password in the Administration tab in Data Analyzer 2. Change the password in the web.xml file in the Data Analyzer folder. 3. Restart Data Analyzer.

Access LDAP Directory Contacts INFORMATICA CONFIDENTIAL

BEST PRACTICES

108 of 954

To access contacts in the LDAP directory service, you can add the LDAP server on the LDAP Settings page. After you set up the connection to the LDAP directory service, users can email reports and shared documents to LDAP directory contacts. When you add an LDAP server, you must provide a value for the BaseDN (distinguished name) property. In the BaseDN property, enter the Base DN entries for your LDAP directory. The Base distinguished name entries define the type of information that is stored in the LDAP directory. If you do not know the value for BaseDN, contact your LDAP system administrator.

Customizing User Access You can customize Data Analyzer user access with the following security options: ●

●

●

Access permissions. Restrict user and/or group access to folders, reports, dashboards, attributes, metrics, template dimensions, or schedules. Use access permissions to restrict access to a particular folder or object in the repository. Data restrictions. Restrict user and/or group access to information in fact and dimension tables and operational schemas. Use data restrictions to prevent certain users or groups from accessing specific values when they create reports. Password restrictions. Restrict users from changing their passwords. Use password restrictions when you do not want users to alter their passwords.

When you create an object in the repository, every user has default read and write permissions for that object. By customizing access permissions for an object, you determine which users and/or groups can read, write, delete, or change access permissions for that object. When you set data restrictions, you determine which users and groups can view particular attribute values. If a user with a data restriction runs a report, Data Analyzer does not display the restricted data to that user.

Types of Access Permissions Access permissions determine the tasks that you can perform for a specific repository object. When you set access permissions, you determine which users and groups have access to the folders and repository objects. You can assign the following types of access permissions to repository objects: ● ●

Read. Allows you to view a folder or object. Write. Allows you to edit an object. Also allows you to create and edit folders and objects within a folder.

●

Delete. Allows you to delete a folder or an object from the repository.

●

Change permission. Allows you to change the access permissions on a folder or object.

By default, Data Analyzer grants read and write access permissions to every user in the repository. You can use the General Permissions area to modify default access permissions for an object, or turn off default access permissions.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

109 of 954

Data Restrictions You can restrict access to data based on the values of related attributes. Data restrictions are set to keep sensitive data from appearing in reports. For example, you may want to restrict data related to the performance of a new store from outside vendors. You can set a data restriction that excludes the store ID from their reports. You can set data restrictions using one of the following methods: ●

●

Set data restrictions by object. Restrict access to attribute values in a fact table, operational schema, real-time connector, and real-time message stream. You can apply the data restriction to users and groups in the repository. Use this method to apply the same data restrictions to more than one user or group. Set data restrictions for one user at a time. Edit a user account or group to restrict user or group access to specified data. You can set one or more data restrictions for each user or group. Use this method to set custom data restrictions for different users or groups

Types of Data Restrictions You can set two kinds of data restrictions: ●

●

Inclusive. Use the IN option to allow users to access data related to the attributes you select. For example, to allow users to view only data from the year 2001, create an “IN 2001” rule. Exclusive. Use the NOT IN option to restrict users from accessing data related to the attributes you select. For example, to allow users to view all data except from the year 2001, create a “NOT IN 2001” rule.

Restricting Data Access by User or Group You can edit a user or group profile to restrict the data the user or group can access in reports. When you edit a user profile, you can set data restrictions for any schema in the repository, including operational schemas and fact tables. You can set a data restriction to limit user or group access to data in a single schema based on the attributes you select. If the attributes apply to more than one schema in the repository, you can also restrict the user or group access from related data across all schemas in the repository. For example, you may have a Sales fact table and Salary fact table. Both tables use the Region attribute. You can set one data restriction that applies to both the Sales and Salary fact tables based on the region you select. To set data restrictions for a user or group, you need the following role or privilege: ●

System Administrator role

●

Access Management privilege

When Data Analyzer runs scheduled reports that have provider-based security, it runs reports against the data restrictions for the report owner. However, if the reports have consumer-based security, the Data Analyzer Server creates a separate report for each unique security profile. INFORMATICA CONFIDENTIAL

BEST PRACTICES

110 of 954

The following information applies to the required steps for changing admin user for weblogic only.

To change the Data Analyzer system administrator username on Weblogic 8.1(DA 8.1) ●

●

Repository authentication. You must use the Update System Accounts utility to change the system administrator account name in the repository. LDAP or Windows Domain Authentication. Set up the new system administrator account in Windows Domain or LDAP directory service. Then use the Update System Accounts utility to change the system administrator account name in the repository.

To change the Data Analyzer default users from admin, ias_scheduler/padaemon 1. Back up the repository. 2. Go to the Web Logic library directory: .\bea\wlserver6.1\lib 3. Open the file ias.jar and locate the file entry called InfChangeSystemUserNames.class 4. Extract the file "InfChangeSystemUserNames.class" into a temporary directory (example: d: \temp) 5. This extracts the file as 'd:\temp\repository tils\Refresh\InfChangeSystemUserNames.class' 6. Create a batch file (change_sys_user.bat) with the following commands in the directory D:\Temp \Repository Utils\Refresh\ REM To change the system user name and password REM ******************************************* REM Change the BEA home here REM ************************ set JAVA_HOME=E:\bea\wlserver6.1\jdk131_06 set WL_HOME=E:\bea\wlserver6.1 set CLASSPATH=%WL_HOME%\sql set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\jconn2.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\classes12.zip set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\weblogic.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias_securityadapter.jar set CLASSPATH=%CLASSPATH%;%WL_HOME%\infalicense REM Change the DB information here and also REM the user Dias_scheduler and -Dadmin to values of your choice REM ************************************************************* %JAVA_HOME%\bin\java-Ddriver=com.informatica.jdbc.sqlserver.SQLServerDriver-Durl=jdbc: informatica:sqlserver://host_name:port;SelectMethod=cursor;DatabaseName=database_name Duser=userName -Dpassword=userPassword -Dias_scheduler=pa_scheduler -Dadmin=paadmin repositoryutil.refresh.InfChangeSystemUserNames REM END OF BATCH FILE

INFORMATICA CONFIDENTIAL

BEST PRACTICES

111 of 954

7. Make changes in the batch file as directed in the remarks [REM lines] 8. Save the file and open up a command prompt window and navigate to D:\Temp\Repository Utils \Refresh\ 9. At the prompt, type change_sys_user.bat and press Enter. The user "ias_scheduler" and "admin" will be changed to "pa_scheduler" and "paadmin", respectively. 10. Modify web.xml, and weblogic.xml (located at .\bea\wlserver6.1\config\informatica\applications\ias \WEB-INF) by replacing ias_scheduler with 'pa_scheduler' 11. Replace ias_scheduler with pa_scheduler in the xml file weblogic-ejb-jar.xml This file is in iasEjb.jar file located in the directory .\bea\wlserver6.1\config\informatica\applications\ To edit the file Make a copy of the iasEjb.jar: ●

mkdir \tmp

●

cd \tmp

●

jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF

●

cd META-INF

●

Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler

●

cd \

●

jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .

Note: There is a tailing period at the end of the command above. 12. Restart the server.

Last updated: 04-Jun-08 15:51

INFORMATICA CONFIDENTIAL

BEST PRACTICES

112 of 954

Database Sizing Challenge Database sizing involves estimating the types and sizes of the components of a data architecture. This is important for determining the optimal configuration for the database servers in order to support the operational workloads. Individuals involved in a sizing exercise may be data architects, database administrators, and/or business analysts.

Description The first step in database sizing is to review system requirements to define such things as: ●

Expected data architecture elements (will there be staging areas? operational data stores? centralized data warehouse and/or master data? data marts?) Each additional database element requires more space. This is even more true in situations where data is being replicated across multiple systems, such as a data warehouse maintaining an operational data store as well. The same data in the ODS will be present in the warehouse as well, albeit in a different format.

●

Expected source data volume It is useful to analyze how each row in the source system translates into the target system. In most situations the row count in the target system can be calculated by following the data flows from the source to the target. For example, say a sales order table is being built by denormalizing a source table. The source table holds sales data for 12 months in a single row (one column for each month). Each row in the source translates to 12 rows in the target. So a source table with one million rows ends up as a 12 million row table.

●

Data granularity and periodicity Granularity refers to the lowest level of information that is going to be stored in a fact table. Granularity affects the size of a database to a great extent, especially for aggregate tables. The level at which a table has been aggregated increases or decreases a table's row count. For example, a sales order fact table's size is likely to be greatly affected by whether the table is being aggregated at a monthly level or at a quarterly level. The granularity of fact tables is determined by the dimensions linked to that table. The number of dimensions that are connected to the fact tables affects the granularity of the table and hence the size of the table.

●

Load frequency and method (full refresh? incremental updates?)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

113 of 954

Load frequency affects the space requirements for the staging areas. A load plan that updates a target less frequently is likely to load more data at one go. Therefore, more space is required by the staging areas. A full refresh requires more space for the same reason. Estimated growth rates over time and retained history.

Determining Growth Projections One way to estimate projections of data growth over time is to use scenario analysis. As an example, for scenario analysis of a sales tracking data mart you can use the number of sales transactions to be stored as the basis for the sizing estimate. In the first year, 10 million sales transactions are expected; this equates to 10 million fact-table records. Next, use the sales growth forecasts for the upcoming years for database growth calculations. That is, an annual sales growth rate of 10 percent translates into 11 million fact table records for the next year. At the end of five years, the fact table is likely to contain about 60 million records. You may want to calculate other estimates based on five-percent annual sales growth (case 1) and 20-percent annual sales growth (case 2). Multiple projections for best and worst case scenarios can be very helpful.

Oracle Table Space Prediction Model Oracle (10g and onwards) provides a mechanism to predict the growth of a database. This feature can be useful in predicting table space requirements. Oracle incorporates a table space prediction model in the database engine that provides projected statistics for space used by a table. The following Oracle 10g query returns projected space usage statistics:

SELECT * FROM TABLE(DBMS_SPACE.object_growth_trend ('schema','tablename','TABLE')) ORDER BY timepoint; The results of this query are shown below: TIMEPOINT SPACE_USAGE SPACE_ALLOC QUALITY ------------------------------ ----------- ----------- -------------------11-APR-04 02.55.14.116000 PM 12-APR-04 02.55.14.116000 PM 13-APR-04 02.55.14.116000 PM 13-MAY-04 02.55.14.116000 PM 14-MAY-04 02.55.14.116000 PM 15-MAY-04 02.55.14.116000 PM 16-MAY-04 02.55.14.116000 PM

6372 6372 6372 6372 6372 6372 6372

65536 INTERPOLATED 65536 INTERPOLATED 65536 INTERPOLATED 65536 PROJECTED 65536 PROJECTED 65536 PROJECTED 65536 PROJECTED

The QUALITY column indicates the quality of the output as follows:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

114 of 954

●

●

●

GOOD - The data for the timepoint relates to data within the AWR repository with a timestamp within 10 percent of the interval. INTERPOLATED - The data for this timepoint did not meet the GOOD criteria but was based on data gathered before and after the timepoint. PROJECTED - The timepoint is in the future, so the data is estimated based on previous growth statistics.

Baseline Volumetric Next, use the physical data models for the sources and the target architecture to develop a baseline sizing estimate. The administration guides for most DBMSs contain sizing guidelines for the various database structures such as tables, indexes, sort space, data files, log files, and database cache. Develop a detailed sizing using a worksheet inventory of the tables and indexes from the physical data model, along with field data types and field sizes. Various database products use different storage methods for data types. For this reason, be sure to use the database manuals to determine the size of each data type. Add up the field sizes to determine row size. Then use the data volume projections to determine the number of rows to multiply by the table size. The default estimate for index size is to assume same size as the table size. Also estimate the temporary space for sort operations. For data warehouse applications where summarizations are common, plan on large temporary spaces. The temporary space can be as much as 1.5 times larger than the largest table in the database. Another approach that is sometimes useful is to load the data architecture with representative data and determine the resulting database sizes. This test load can be a fraction of the actual data and is used only to gather basic sizing statistics. You then need to apply growth projections to these statistics. For example, after loading ten thousand sample records to the fact table, you determine the size to be 10MB. Based on the scenario analysis, you can expect this fact table to contain 60 million records after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB * (60,000,000/10,000)]. Don't forget to add indexes and summary tables to the calculations.

Guesstimating When there is not enough information to calculate an estimate as described above, use educated guesses and “rules of thumb” to develop as reasonable an estimate as possible. ●

●

If you don’t have the source data model, use what you do know of the source data to estimate average field size and average number of fields in a row to determine table size. Based on your understanding of transaction volume over time, determine your growth metrics for each type of data and calculate out your source data volume (SDV) from table size and growth metrics. If your target data architecture is not completed so that you can determine table sizes, base your estimates on multiples of the SDV: ❍

If it includes staging areas: add another SDV for any source subject area that you will

INFORMATICA CONFIDENTIAL

BEST PRACTICES

115 of 954

stage multiplied by the number of loads you’ll retain in staging. ❍

❍

❍

❍

❍

If you intend to consolidate data into an operational data store, add the SDV multiplied by the number of loads to be retained in the ODS for historical purposes (e.g., keeping one year’s worth of monthly loads = 12 x SDV) Data warehouse architectures are based on the periodicity and granularity of the warehouse; this may be another SDV + (.3n x SDV where n = number of time periods loaded in the warehouse over time) If your data architecture includes aggregates, add a percentage of the warehouse volumetrics based on how much of the warehouse data will be aggregated and to what level (e.g., if the rollup level represents 10 percent of the dimensions at the details level, use 10 percent). Similarly, for data marts add a percentage of the data warehouse based on how much of the warehouse data is moved into the data mart. Be sure to consider the growth projections over time and the history to be retained in all of your calculations.

And finally, remember that there is always much more data than you expect so you may want to add a reasonable fudge-factor to the calculations for a margin of safety.

Last updated: 19-Jul-07 14:14

INFORMATICA CONFIDENTIAL

BEST PRACTICES

116 of 954

Deployment Groups Challenge In selectively migrating objects from one repository folder to another, there is a need for a versatile and flexible mechanism that can overcome such limitations as confinement to a single source folder.

Description Regulations such as Sarbanes-Oxley (SOX) and HIPAA require tracking, monitoring, and reporting of changes in information technology systems. Automation of change control processes using deployment groups and pmrep commands provide organizations with a means to comply with regulations for configuration management of software artifacts in a PowerCenter repository. Deployment Groups are containers that hold references to objects that need to be migrated. This includes objects such as mappings, mapplets, reusable transformations, sources, targets, workflows, sessions and tasks, as well as the object holders (i.e., the repository folders). Deployment groups are faster and more flexible than folder moves for incremental changes. In addition, they allow for migration “rollbacks” if necessary. Migrating a deployment group involves moving objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. When copying a deployment group, individual objects to be copied can be selected as opposed to the entire contents of a folder. There are two types of deployment groups - static and dynamic. ●

●

Static deployment groups contain direct references to versions of objects that need to be moved. Users explicitly add the version of the object to be migrated to the deployment group. If the set of deployment objects is not expected to change between deployments, static deployment groups can be created. Dynamic deployment groups contain a query that is executed at the time of deployment. The results of the query (i.e., object versions in the repository) are then selected and copied to the deployment group. If the set of deployment objects is expected to change frequently between deployments, dynamic deployment groups should be used.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

117 of 954

Dynamic deployment groups are generated from a query. While any available criteria can be used, it is advisable to have developers use labels to simplify the query. For more information, refer to the “Strategies for Labels” section of Using PowerCenter Labels. When generating a query for deployment groups with mappings and mapplets that contain non-reusable objects, in addition to specific selection criteria, a query condition should be used. The query must include a condition for Is Reusable and use a qualifier of either Reusable and Non-Reusable. Without this qualifier, the deployment may encounter errors if there are non-reusable objects held within the mapping or mapplet. A deployment group exists in a specific repository. It can be used to move items to any other accessible repository/folder. A deployment group maintains a history of all migrations it has performed. It tracks what versions of objects were moved from which folders in which source repositories, and into which folders in which target repositories those versions were copied (i.e., it provides a complete audit trail of all migrations performed). Given that the deployment group knows what it moved and to where, then if necessary, an administrator can have the deployment group “undo” the most recent deployment, reverting the target repository to its pre-deployment state. Using labels (as described in the Using PowerCenter Labels Best Practice) allows objects in the subsequent repository to be tracked back to a specific deployment. It is important to note that the deployment group only migrates the objects it contains to the target repository/folder. It does not, itself, move to the target repository. It still resides in the source repository.

Deploying via the GUI Migrations can be performed via the GUI or the command line (pmrep). In order to migrate objects via the GUI, simply drag a deployment group from the repository it resides in onto the target repository where the referenced objects are to be moved. The Deployment Wizard appears and steps the user through the deployment process. Once the wizard is complete, the migration occurs, and the deployment history is created.

Deploying via the Command Line Alternatively, the PowerCenter pmrep command can be used to automate both Folder Level deployments (e.g., in a non-versioned repository) and deployments using Deployment Groups. The commands DeployFolder and DeployDeploymentGroup in pmrep are used respectively for these purposes. Whereas deployment via the GUI requires stepping through a wizard and answering a series of questions to deploy, the command-line deployment requires an XML control file that contains the same

INFORMATICA CONFIDENTIAL

BEST PRACTICES

118 of 954

information that the wizard requests. This file must be present before the deployment is executed. The following steps can be used to create a script to wrap pmrep commands and automate PowerCenter deployments: 1. Use pmrep ListObjects to return the object metadata to be parsed in another pmrep command. 2. Use pmrep CreateDeploymentGroup to create a dynamic or static deployment group. 3. Use pmrep ExecuteQuery to output the results to a persistent input file. This input file can also be used for AddToDeploymentGroup command. 4. Use DeployDeploymentGroup to copy a deployment group to a different repository. A control file with all the specifications is required for this command. Additionally, a web interface can be built for entering/approving/rejecting code migration requests. This can provide additional traceability and reporting capabilities to the automation of PowerCenter code migrations.

Considerations for Deployment and Deployment Groups Simultaneous Multi-Phase Projects If multiple phases of a project are being developed simultaneously in separate folders, it is possible to consolidate them by mapping folders appropriately through the deployment group migration wizard. When migrating with deployment groups in this way, the override buttons in the migration wizard are used to select specific folder mappings.

Rolling Back a Deployment Deployment groups help to ensure that there is a back-out methodology and that the latest version of a deployment can be rolled back. To do this: In the target repository (where the objects were migrated to), go to: Versioning>>Deployment>>History>>View History>>Rollback. The rollback purges all objects (of the latest version) that were in the deployment group. Initiate a rollback on a deployment in order to roll back only the latest versions of

INFORMATICA CONFIDENTIAL

BEST PRACTICES

119 of 954

the objects. The rollback ensures that the check-in time for the repository objects is the same as the deploy time. Also, pmrep command RollBackDeployment can be used for automating rollbacks. Remember that you cannot rollback part of the deployment, you will have to rollback all the objects in a deployment group.

Managing Repository Size As objects are checked in and objects are deployed to target repositories, the number of object versions in those repositories increases, as does the size of the repositories. In order to manage repository size, use a combination of Check-in Date and Latest Status (both are query parameters) to purge the desired versions from the repository and retain only the very latest version. Also all the deleted versions of the objects should be purged to reduce the size of the repository. If it is necessary to keep more than the latest version, labels can be included in the query. These labels are ones that have been applied to the repository for the specific purpose of identifying objects for purging.

Off-Shore, On-Shore Migration In an off-shore development environment to an on-shore migration situation, other aspects of the computing environment may make it desirable to generate a dynamic deployment group. Instead of migrating the group itself to the next repository, a query can be used to select the objects for migration and save them to a single XML file which can be then be transmitted to the on-shore environment through alternative methods. If the on-shore repository is versioned, it activates the import wizard as if a deployment group was being received.

Code Migration from Versioned Repository to a Non-Versioned Repository In some instances, it may be desirable to migrate objects to a non-versioned repository from a versioned repository. Note that when migrating in this manner, this changes the wizards used, and that the export from the versioned repository must take place using XML export.

Last updated: 27-May-08 13:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

120 of 954

Migration Procedures - PowerCenter Challenge Develop a migration strategy that ensures clean migration between development, test, quality assurance (QA), and production environments, thereby protecting the integrity of each of these environments as the system evolves.

Description Ensuring that an application has a smooth migration process between development, QA, and production environments is essential for the deployment of an application. Deciding which migration strategy works best for a project depends on two primary factors. ●

●

How is the PowerCenter repository environment designed? Are there individual repositories for development, QA, and production or are there just one or two environments that share one or all of these phases. How has the folder architecture been defined?

Each of these factors plays a role in determining the migration procedure that is most beneficial to the project. PowerCenter offers flexible migration options that can be adapted to fit the need of each application. PowerCenter migration options include repository migration, folder migration, object migration, and XML import/export. In versioned PowerCenter repositories, users can also use static or dynamic deployment groups for migration, which provides the capability to migrate any combination of objects within the repository with a single command. This Best Practice is intended to help the development team decide which technique is most appropriate for the project. The following sections discuss various options that are available, based on the environment and architecture selected. Each section describes the major advantages of its use, as well as its disadvantages.

Repository Environments The following section outlines the migration procedures for standalone and distributed repository environments. The distributed environment section touches on several migration architectures, outlining the pros and cons of each. Also, please note that any methods described in the Standalone section may also be used in a Distributed environment.

Standalone Repository Environment In a standalone environment, all work is performed in a single PowerCenter repository that serves as the metadata store. Separate folders are used to represent the development, QA, and production workspaces and segregate work. This type of architecture within a single repository ensures seamless migration from development to QA, and from QA to production. The following example shows a typical architecture. In this example, the company has chosen to create separate development folders for each of the individual developers for development and unit test purposes. A single shared or common development folder, SHARED_MARKETING_DEV, holds all of the common objects, such as sources, targets, and reusable mapplets. In addition, two test folders are created for QA purposes. The first contains all of the unit-tested mappings from the development folder. The second is a common or shared folder that contains all of the tested shared objects. Eventually, as the following paragraphs explain, two production folders will also be built.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

121 of 954

Proposed Migration Process – Single Repository DEV to TEST – Object Level Migration Now that we've described the repository architecture for this organization, let's discuss how it will migrate mappings to test, and then eventually to production. After all mappings have completed their unit testing, the process for migration to test can begin. The first step in this process is to copy all of the shared or common objects from the SHARED_MARKETING_DEV folder to the SHARED_MARKETING_TEST folder. This can be done using one of two methods: ●

●

The first, and most common method, is object migration via an object copy. In this case, a user opens the SHARED_MARKETING_TEST folder and drags the object from the SHARED_MARKETING_DEV into the appropriate workspace (i.e., Source Analyzer, Warehouse Designer, etc.). This is similar to dragging a file from one folder to another using Windows Explorer. The second approach is object migration via object XML import/export. A user can export each of the objects in the SHARED_MARKETING_DEV folder to XML, and then re-import each object into the SHARED_MARKETING_TEST via XML import. With the XML import/export, the XML files can be uploaded to a third-party versioning tool, if the organization has standardized on such a tool. Otherwise, versioning can be enabled in PowerCenter. Migrations with versioned PowerCenter repositories is covered later in this document.

After you've copied all common or shared objects, the next step is to copy the individual mappings from each development folder into the MARKETING_TEST folder. Again, you can use either of the two object-level migration methods described above to copy the mappings to the folder, although the XML import/export method is the most intuitive method for resolving shared object conflicts. However, the migration method is slightly different here when you're copying the mappings because you must ensure that the shortcuts in the mapping are associated with the SHARED_MARKETING_TEST folder. Designer prompts the user to choose the correct shortcut folder that you created in the previous example, which point to the SHARED_MARKETING_TEST (see image below). You can then continue the migration process until all mappings have been successfully migrated. In PowerCenter 7 and later versions, you can export multiple objects into a single XML file, and then import them at the same time.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

122 of 954

The final step in the process is to migrate the workflows that use those mappings. Again, the object-level migration can be completed either through drag-and-drop or by using XML import/export. In either case, this process is very similar to the steps described above for migrating mappings, but differs in that the Workflow Manager provides a Workflow Copy Wizard to guide you through the process. The following steps outline the full process for successfully copying a workflow and all of its associated tasks. 1. The Wizard prompts for the name of the new workflow. If a workflow with the same name exists in the destination folder, the Wizard prompts you to rename it or replace it. If no such workflow exists, a default name is used. Then click “Next” to continue the copy process. 2. The next step for each task is to see if it exists (as shown below). If the task is present, you can rename or replace the current one. If it does not exist, then the default name is used (see below). Then click “Next.”

3. Next, the Wizard prompts you to select the mapping associated with each session task in the workflow. Select the mapping and continue by clicking “Next".

INFORMATICA CONFIDENTIAL

BEST PRACTICES

123 of 954

4. If connections exist in the target repository, the Wizard prompts you to select the connection to use for the source and target. If no connections exist, the default settings are used. When this step is completed, click "Finish" and save the work.

Initial Migration – New Folders Created The move to production is very different for the initial move than for subsequent changes to mappings and workflows. Since the repository only contains folders for development and test, we need to create two new folders to house the production-ready objects. Create these folders after testing of the objects in SHARED_MARKETING_TEST and MARKETING_TEST has been approved. The following steps outline the creation of the production folders and, at the same time, address the initial test to production migration. 1. Open the PowerCenter Repository Manager client tool and log into the repository. 2. To make a shared folder for the production environment, highlight the SHARED_MARKETING_TEST folder, drag it, and drop it on the repository name. 3. The Copy Folder Wizard appears to guide you through the copying process.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

124 of 954

4. The first Wizard screen asks if you want to use the typical folder copy options or the advanced options. In this example, we'll use the advanced options.

5. The second Wizard screen prompts you to enter a folder name. By default, the folder name that appears on this screen is the folder name followed by the date. In this case, enter the name as “SHARED_MARKETING_PROD.”

INFORMATICA CONFIDENTIAL

BEST PRACTICES

125 of 954

6. The third Wizard screen prompts you to select a folder to override. Because this is the first time you are transporting the folder, you won’t need to select anything.

7. The final screen begins the actual copy process. Click "Finish" when the process is complete.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

126 of 954

Repeat this process to create the MARKETING_PROD folder. Use the MARKETING_TEST folder as the original to copy and associate the shared objects with the SHARED_MARKETING_PROD folder that you just created. At the end of the migration, you should have two additional folders in the repository environment for production: SHARED_MARKETING_PROD and MARKETING_ PROD (as shown below). These folders contain the initially migrated objects. Before you can actually run the workflow in these production folders, you need to modify the session source and target connections to point to the production environment.

When you copy or replace a PowerCenter repository folder, the Copy Wizard copies the permissions for the folder owner to the target folder. The wizard does not copy permissions for users, groups, or all others in the repository to the target folder. Previously, the Copy Wizard copied the permissions for the folder owner, owner’s group, and all users in the repository to the target folder.

Incremental Migration – Object Copy Example Now that the initial production migration is complete, let's take a look at how future changes will be migrated into the folder.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

127 of 954

Any time an object is modified, it must be re-tested and migrated into production for the actual change to occur. These types of changes in production take place on a case-by-case or periodically-scheduled basis. The following steps outline the process of moving these objects individually. 1. Log into PowerCenter Designer. Open the destination folder and expand the source folder. Click on the object to copy and drag-and-drop it into the appropriate workspace window. 2. Because this is a modification to an object that already exists in the destination folder, Designer prompts you to choose whether to Rename or Replace the object (as shown below). Choose the option to Replace the object.

3. In PowerCenter 7 and later versions, you can choose to compare conflicts whenever migrating any object in Designer or Workflow Manager. By comparing the objects, you can ensure that the changes that you are making are what you intend. See below for an example of the mapping compare window.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

128 of 954

4. After the object has been successfully copied, save the folder so the changes can take place. 5. The newly copied mapping is now tied to any sessions that the replaced mapping was tied to. 6. Log into Workflow Manager and make the appropriate changes to the session or workflow so it can update itself with the changes.

Standalone Repository Example In this example, we look at moving development work to QA and then from QA to production, using multiple development folders for each developer, with the test and production folders divided into the data mart they represent. For this example, we focus solely on the MARKETING_DEV data mart, first explaining how to move objects and mappings from each individual folder to the test folder and then how to move tasks, worklets, and workflows to the new area. Follow these steps to copy a mapping from Development to QA: 1. If using shortcuts, first follow these steps; if not using shortcuts, skip to step 2 ❍ Copy the tested objects from the SHARED_MARKETING_DEV folder to the SHARED_MARKETING_TEST folder. ❍

❍

Drag all of the newly copied objects from the SHARED_MARKETING_TEST folder to MARKETING_TEST. Save your changes.

2. Copy the mapping from Development into Test. ❍ In the PowerCenter Designer, open the MARKETING_TEST folder, and drag and drop the mapping from each development folder into the MARKETING_TEST folder. ❍

When copying each mapping in PowerCenter, Designer prompts you to either Replace, Rename, or Reuse the object, or Skip for each reusable object, such as source and target definitions. Choose to Reuse the object for all shared objects in the mappings copied into the MARKETING_TEST folder.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

129 of 954

❍

Save your changes.

3. If a reusable session task is being used, follow these steps. Otherwise, skip to step 4. ❍ In the PowerCenter Workflow Manager, open the MARKETING_TEST folder and drag and drop each reusable session from the developers’ folders into the MARKETING_TEST folder. A Copy Session Wizard guides you through the copying process. ❍

❍

❍

Open each newly copied session and click on the Source tab. Change the source to point to the source database for the Test environment. Click the Target tab. Change each connection to point to the target database for the Test environment. Be sure to double-check the workspace from within the Target tab to ensure that the load options are correct. Save your changes.

4. While the MARKETING_TEST folder is still open, copy each workflow from Development to Test. ❍ Drag each workflow from the development folders into the MARKETING_TEST folder. The Copy Workflow Wizard appears. Follow the same steps listed above to copy the workflow to the new folder. ❍

❍

As mentioned earlier, in PowerCenter 7 and later versions, the Copy Wizard allows you to compare conflicts from within Workflow Manager to ensure that the correct migrations are being made. Save your changes.

5. Implement the appropriate security. ❍ In Development, the owner of the folders should be a user(s) in the development group. ❍

In Test, change the owner of the test folder to a user(s) in the test group.

❍

In Production, change the owner of the folders to a user in the production group.

❍

Revoke all rights to Public other than Read for the production folders.

Rules to Configure Folder and Global Object Permissions Rules in 8.5

Rules in Previous Versions

The folder or global object owner or a user assigned the Administrator role for the Repository Service can grant folder and global object permissions.

Users with the appropriate repository privileges could grant folder and global object permissions.

Permissions can be granted to users, groups, and all others in Permissions could be granted to the owner, owner’s group, the repository. and all others in the repository. The folder or global object owner and a user assigned the Administrator role for the Repository Service have all permissions which you cannot change.

You could change the permissions for the folder or global object owner.

Disadvantages of a Single Repository Environment The biggest disadvantage or challenge with a single repository environment is migration of repository objects with respect to database connections. When migrating objects from Dev to Test to Prod you can’t use the same database connection as those that will be pointing to dev or test environment. A single repository structure can also create confusion as the same users and groups exist in all environments and the number of folders can increase exponentially.

Distributed Repository Environment INFORMATICA CONFIDENTIAL

BEST PRACTICES

130 of 954

A distributed repository environment maintains separate, independent repositories, hardware, and software for development, test, and production environments. Separating repository environments is preferable for handling development to production migrations. Because the environments are segregated from one another, work performed in development cannot impact QA or production. With a fully distributed approach, separate repositories function much like the separate folders in a standalone environment. Each repository has a similar name, like the folders in the standalone environment. For instance, in our Marketing example we would have three repositories, INFADEV, INFATEST, and INFAPROD. In the following example, we discuss a distributed repository architecture. There are four techniques for migrating from development to production in a distributed repository architecture, with each involving some advantages and disadvantages. ●

Repository Copy

●

Folder Copy

●

Object Copy

●

Deployment Groups

Repository Copy So far, this document has covered object-level migrations and folder migrations through drag-and-drop object copying and object XML import/export. This section discusses migrations in a distributed repository environment through repository copies. The main advantages of this approach are: ●

●

●

The ability to copy all objects (i.e., mappings, workflows, mapplets, reusable transformation, etc.) at once from one environment to another. The ability to automate this process using pmrep commands, thereby eliminating many of the manual processes that users typically perform. The ability to move everything without breaking or corrupting any of the objects.

This approach also involves a few disadvantages. INFORMATICA CONFIDENTIAL

BEST PRACTICES

131 of 954

●

● ●

●

The first is that everything is moved at once (which is also an advantage). The problem with this is that everything is moved -- ready or not. For example, we may have 50 mappings in QA, but only 40 of them are production-ready. The 10 untested mappings are moved into production along with the 40 production-ready mappings, which leads to the second disadvantage. Significant maintenance is required to remove any unwanted or excess objects. There is also a need to adjust server variables, sequences, parameters/variables, database connections, etc. Everything must be set up correctly before the actual production runs can take place. Lastly, the repository copy process requires that the existing Production repository be deleted, and then the Test repository can be copied. This results in a loss of production environment operational metadata such as load statuses, session run times, etc. High-performance organizations leverage the value of operational metadata to track trends over time related to load success/failure and duration. This metadata can be a competitive advantage for organizations that use this information to plan for future growth.

Now that we've discussed the advantages and disadvantages, we'll look at three ways to accomplish the Repository Copy method: ●

Copying the Repository

●

Repository Backup and Restore

●

PMREP

Copying the Repository Copying the Test repository to Production through the GUI client tools is the easiest of all the migration methods. First, ensure that all users are logged out of the destination repository and then connect to the PowerCenter Repository Administration Console (as shown below).

If the Production repository already exists, you must delete the repository before you can copy the Test repository. Before you can delete the repository, you must run the repository in the ‘exclusive mode’. 1. Click on the “INFA_PROD Repository on the left pane to select it and change the running mode to “exclusive mode’ by clicking on the edit button on the right pane under the properties tab. INFORMATICA CONFIDENTIAL

BEST PRACTICES

132 of 954

2. Delete the Production repository by selecting it and choosing “Delete” from the context menu.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

133 of 954

3. Click on the Action drop-down list and choose Copy contents from

INFORMATICA CONFIDENTIAL

BEST PRACTICES

134 of 954

4. In the new window, choose the domain name, repository service “INFA_TEST” from the drop-down menu. Enter the username and password of the Test repository.

5. Click OK to begin the copy process. 6. When you've successfully copied the repository to the new location, exit from the PowerCenter Administration INFORMATICA CONFIDENTIAL

BEST PRACTICES

135 of 954

Console. 7. In the Repository Manager, double-click on the newly copied repository and log-in with a valid username and password. 8. Verify connectivity, then highlight each folder individually and rename them. For example, rename the MARKETING_TEST folder to MARKETING_PROD, and the SHARED_MARKETING_TEST to SHARED_MARKETING_PROD. 9. Be sure to remove all objects that are not pertinent to the Production environment from the folders before beginning the actual testing process. 10. When this cleanup is finished, you can log into the repository through the Workflow Manager. Modify the server information and all connections so they are updated to point to the new Production locations for all existing tasks and workflows.

Repository Backup and Restore Backup and Restore Repository is another simple method of copying an entire repository. This process backs up the repository to a binary file that can be restored to any new location. This method is preferable to the repository copy process because if any type of error occurs, the file is backed up to the binary file on the repository server. From 8.5 onwards, security information is maintained at the domain level. Before you back up a repository and restore it in a different domain, verify that users and groups with privileges for the source Repository Service exist in the target domain. The Service Manager periodically synchronizes the list of users and groups in the repository with the users and groups in the domain configuration database. During synchronization, users and groups that do not exist in the target domain are deleted from the repository. You can use infacmd to export users and groups from the source domain and import them into the target domain. Use infacmd ExportUsersAndGroups to export the users and groups to a file. Use infacmd ImportUsersAndGroups to import the users and groups from the file to a different PowerCenter domain The following steps outline the process of backing up and restoring the repository for migration. 1. Launch the PowerCenter Administration Console, and highlight the INFA_TEST repository service. Select Action -> Backup Contents from the drop-down menu.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

136 of 954

2. A screen appears and prompts you to supply a name for the backup file as well as the Administrator username and password. The file is saved to the Backup directory within the repository server’s home directory.

3. After you've selected the location and file name, click OK to begin the backup process. 4. The backup process creates a .rep file containing all repository information. Stay logged into the Manage Repositories screen. When the backup is complete, select the repository connection to which the backup will be restored to (i.e., the Production repository). INFORMATICA CONFIDENTIAL

BEST PRACTICES

137 of 954

5. The system will prompt you to supply a username, password, and the name of the file to be restored. Enter the appropriate information and click OK. When the restoration process is complete, you must repeat the steps listed in the copy repository option in order to delete all of the unused objects and renaming of the folders.

PMREP Using the PMREP commands is essentially the same as the Backup and Restore Repository method except that it is run from the command line rather than through the GUI client tools. pmrep is installed in the PowerCenter Client and PowerCenter Services bin directories. PMREP utilities can be used from the Informatica Server or from any client machine connected to the server. Refer to the Repository Manager Guide for a list of PMREP commands. PMREP backup backs up the repository to the file specified with the -o option. You must provide the backup file name. Use this command when the repository is running. You must be connected to a repository to use this command. The BackUp command uses the following syntax: backup -o [-d ] [-f (overwrite existing output file)] [-b (skip workflow and session logs)] [-j (skip deploy group history)] [-q (skip MX data)] [-v (skip task statistics)] The following is a sample of the command syntax used within a Windows batch file to connect to and backup a repository. Using this code example as a model, you can write scripts to be run on a daily basis to perform functions INFORMATICA CONFIDENTIAL

BEST PRACTICES

138 of 954

such as connect, backup, restore, etc: backupproduction.bat REM This batch file uses pmrep to connect to and back up the repository Production on the server Central @echo off echo Connecting to Production repository... “\Server\bin\pmrep” connect -r INFAPROD -n Administrator -x Adminpwd – h infarepserver –o 7001 echo Backing up Production repository... “\Server\bin\pmrep” backup -o c:\backup\Production_backup.rep Alternatively, the following steps can be used: 1. Use infacmd commands to run repository service in ‘Exclusive’ mode 2. Use pmrep backup command to backup the source repository 3. Use pmrep delete command to delete the content of target repository (if contect already exists in the target repository) 4. Use pmrep restore command to restore the backup file into target repostiory

Post-Repository Migration Cleanup After you have used one of the repository migration procedures to migrate into Production, follow these steps to convert the repository to Production: 1. Disable workflows that are not ready for Production or simply delete the mappings, tasks, and workflows. ❍

❍

Disable the workflows not being used in the Workflow Manager by opening the workflow properties, then checking the Disabled checkbox under the General tab. Delete the tasks not being used in the Workflow Manager and the mappings in the Designer

2. Modify the database connection strings to point to the production sources and targets. ❍ ❍

❍

In the Workflow Manager, select Relational connections from the Connections menu. Edit each relational connection by changing the connect string to point to the production sources and targets. If you are using lookup transformations in the mappings and the connect string is anything other than $SOURCE or $TARGET, you will need to modify the connect strings appropriately.

3. Modify the pre- and post-session commands and SQL as necessary. ❍

In the Workflow Manager, open the session task properties, and from the Components tab make the required changes to the pre- and post-session scripts.

4. Implement appropriate security, such as: INFORMATICA CONFIDENTIAL

BEST PRACTICES

139 of 954

❍

In Development, ensure that the owner of the folders is a user in the development group.

❍

In Test, change the owner of the test folders to a user in the test group.

❍

In Production, change the owner of the folders to a user in the production group.

❍

Revoke all rights to Public other than Read for the Production folders.

Folder Copy Although deployment groups are becoming a very popular migration method, the folder copy method has historically been the most popular way to migrate in a distributed environment. Copying an entire folder allows you to quickly promote all of the objects located within that folder. All source and target objects, reusable transformations, mapplets, mappings, tasks, worklets and workflows are promoted at once. Because of this, however, everything in the folder must be ready to migrate forward. If some mappings or workflows are not valid, then developers (or the Repository Administrator) must manually delete these mappings or workflows from the new folder after the folder is copied. The three advantages of using the folder copy method are: ●

●

●

The Repository Managers Folder Copy Wizard makes it almost seamless to copy an entire folder and all the objects located within it. If the project uses a common or shared folder and this folder is copied first, then all shortcut relationships are automatically converted to point to this newly copied common or shared folder. All connections, sequences, mapping variables, and workflow variables are copied automatically.

The primary disadvantage of the folder copy method is that the repository is locked while the folder copy is being performed. Therefore, it is necessary to schedule this migration task during a time when the repository is least utilized. Remember that a locked repository means than no jobs can be launched during this process. This can be a serious consideration in real-time or near real-time environments. The following example steps through the process of copying folders from each of the different environments. The first example uses three separate repositories for development, test, and production. 1. If using shortcuts, follow these sub steps; otherwise skip to step 2: ●

Open the Repository Manager client tool.

●

Connect to both the Development and Test repositories.

●

Highlight the folder to copy and drag it to the Test repository.

●

The Copy Folder Wizard appears to step you through the copy process.

●

When the folder copy process is complete, open the newly copied folder in both the Repository Manager and Designer to ensure that the objects were copied properly.

2. Copy the Development folder to Test. If you skipped step 1, follow these sub-steps: ●

Open the Repository Manager client tool.

●

Connect to both the Development and Test repositories.

●

Highlight the folder to copy and drag it to the Test repository. The Copy Folder Wizard will appear.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

140 of 954

3. Follow these steps to ensure that all shortcuts are reconnected. ●

Use the advanced options when copying the folder across.

●

Select Next to use the default name of the folder

4. If the folder already exists in the destination repository, choose to replace the folder.

The following screen appears to prompt you to select the folder where the new shortcuts are located.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

141 of 954

In a situation where the folder names do not match, a folder compare will take place. The Copy Folder Wizard then completes the folder copy process. Rename the folder as appropriate and implement the security. 5. When testing is complete, repeat the steps above to migrate to the Production repository. When the folder copy process is complete, log onto the Workflow Manager and change the connections to point to the appropriate target location. Ensure that all tasks updated correctly and that folder and repository security is modified for test and production.

Object Copy Copying mappings into the next stage in a networked environment involves many of the same advantages and disadvantages as in the standalone environment, but the process of handling shortcuts is simplified in the networked environment. For additional information, see the earlier description of Object Copy for the standalone environment. One advantage of Object Copy in a distributed environment is that it provides more granular control over objects. Two distinct disadvantages of Object Copy in a distributed environment are: ●

Much more work to deploy an entire group of objects

●

Shortcuts must exist prior to importing/copying mappings

Below are the steps to complete an object copy in a distributed repository environment: 1. If using shortcuts, follow these sub-steps, otherwise skip to step 2: ● ●

In each of the distributed repositories, create a common folder with the exact same name and case. Copy the shortcuts into the common folder in Production, making sure the shortcut has the exact same name.

2. Copy the mapping from the Test environment into Production. INFORMATICA CONFIDENTIAL

BEST PRACTICES

142 of 954

●

● ●

In the Designer, connect to both the Test and Production repositories and open the appropriate folders in each. Drag-and-drop the mapping from Test into Production. During the mapping copy process, PowerCenter 7 and later versions allow a comparison of this mapping to an existing copy of the mapping already in Production. Note that the ability to compare objects is not limited to mappings, but is available for all repository objects including workflows, sessions, and tasks.

3. Create or copy a workflow with the corresponding session task in the Workflow Manager to run the mapping (first ensure that the mapping exists in the current repository). ● ●

If copying the workflow, follow the Copy Wizard. If creating the workflow, add a session task that points to the mapping and enter all the appropriate information.

4. Implement appropriate security. ●

In Development, ensure the owner of the folders is a user in the development group.

●

In Test, change the owner of the test folders to a user in the test group.

●

In Production, change the owner of the folders to a user in the production group.

●

Revoke all rights to Public other than Read for the Production folders.

Deployment Groups For versioned repositories, the use of Deployment Groups for migrations between distributed environments allows the most flexibility and convenience. With Deployment Groups, you can migrate individual objects as you would in an object copy migration, but can also have the convenience of a repository- or folder-level migration as all objects are deployed at once. The objects included in a deployment group have no restrictions and can come from one or multiple folders. Additionally, for additional convenience, you can set up a dynamic deployment group that allows the objects in the deployment group to be defined by a repository query, rather than being added to the deployment group manually. Lastly, because deployment groups are available on versioned repositories, they also have the ability to be rolled back, reverting to the previous versions of the objects, when necessary.

Advantages of Using Deployment Groups ●

Backup and restore of the Repository needs to be performed only once.

●

Copying a Folder replaces the previous copy.

●

Copying a Mapping allows for different names to be used for the same object.

●

Uses for Deployment Groups ❍

Deployment Groups are containers that hold references to objects that need to be migrated.

❍

Allows for version-based object migration.

❍

Faster and more flexible than folder moves for incremental changes.

❍

Allows for migration “rollbacks”

❍

Allows specifying individual objects to copy, rather than the entire contents of a folder.

Types of Deployment Groups ●

Static ❍

Contain direct references to versions of objects that need to be moved.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

143 of 954

❍

●

Users explicitly add the version of the object to be migrated to the deployment group.

Dynamic ❍ ❍

Contain a query that is executed at the time of deployment. The results of the query (i.e. object versions in the repository) are then selected and copied to the target repository

Pre-Requisites Create required folders in the Target Repository

Creating Labels A label is a versioning object that you can associate with any versioned object or group of versioned objects in a repository. ●

●

●

Advantages ❍

Tracks versioned objects during development.

❍

Improves query results.

❍

Associates groups of objects for deployment.

❍

Associates groups of objects for import and export.

Create label ❍

Create labels through the Repository Manager.

❍

After creating the labels, go to edit mode and lock them.

❍

The "Lock" option is used to prevent other users from editing or applying the label.

❍

This option can be enabled only when the label is edited.

❍

Some Standard Label examples are: ■

Development

■

Deploy_Test

■

Test

■

Deploy_Production

■

Production

Apply Label ❍

Create a query to identify the objects that are needed to be queried.

❍

Run the query and apply the labels.

Note: By default, the latest version of the object gets labeled.

Queries A query is an object used to search for versioned objects in the repository that meet specific conditions. ●

Advantages ❍

Tracks objects during development

INFORMATICA CONFIDENTIAL

BEST PRACTICES

144 of 954

●

❍

Associates a query with a deployment group

❍

Finds deleted objects you want to recover

❍

Finds groups of invalidated objects you want to validate

Create a query ❍

●

The Query Browser allows you to create, edit, run, or delete object queries

Execute a query ❍ ❍

Execute through Query Browser EXECUTE QUERY: ExecuteQuery -q query_name -t query_type -u persistent_output_file_name -a append -c column_separator -r end-of-record_separator -l end-oflisting_indicator -b verbose

Creating a Deployment Group Follow these steps to create a deployment group: 1. Launch the Repository Manager client tool and log in to the source repository. 2. Expand the repository, right-click on “Deployment Groups” and choose “New Group.”

3. In the dialog window, give the deployment group a name, and choose whether it should be static or dynamic. In this example, we are creating a static deployment group. Click OK.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

145 of 954

Adding Objects to a Static Deployment Group Follow these steps to add objects to a static deployment group: 1. In Designer, Workflow Manager, or Repository Manger, right-click an object that you want to add to the deployment group and choose “Versioning” -> “View History.” The “View History” window appears.

2. In the “View History” window, right-click the object and choose “Add to Deployment Group.” INFORMATICA CONFIDENTIAL

BEST PRACTICES

146 of 954

3. In the Deployment Group dialog window, choose the deployment group that you want to add the object to, and click OK.

4. In the final dialog window, choose whether you want to add dependent objects. In most cases, you will want to add dependent objects to the deployment group so that they will be migrated as well. Click OK.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

147 of 954

NOTE: The “All Dependencies” option should be used for any new code that is migrating forward. However, this option can cause issues when moving existing code forward because “All Dependencies” also flags shortcuts. During the deployment, PowerCenter tries to re-insert or replace the shortcuts. This does not work, and causes the deployment to fail. The object will be added to the deployment group at this time. Although the deployment group allows the most flexibility, the task of adding each object to the deployment group is similar to the effort required for an object copy migration. To make deployment groups easier to use, PowerCenter allows the capability to create dynamic deployment groups.

Adding Objects to a Dynamic Deployment Group Dynamic Deployment groups are similar in function to static deployment groups, but differ in the way that objects are added. In a static deployment group, objects are manually added one by one. In a dynamic deployment group, the contents of the deployment group are defined by a repository query. Don’t worry about the complexity of writing a repository query, it is quite simple and aided by the PowerCenter GUI interface. Follow these steps to add objects to a dynamic deployment group: 1. First, create a deployment group, just as you did for a static deployment group, but in this case, choose the dynamic option. Also, select the “Queries” button.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

148 of 954

2. The “Query Browser” window appears. Choose “New” to create a query for the dynamic deployment group.

3. In the Query Editor window, provide a name and query type (Shared). Define criteria for the objects that should be migrated. The drop-down list of parameters lets you choose from 23 predefined metadata categories. In this case, the developers have assigned the “RELEASE_20050130” label to all objects that need to be migrated, so the query is defined as “Label Is Equal To ‘RELEASE_20050130’”. The creation and application of labels are discussed in Using PowerCenter Labels.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

149 of 954

4. Save the Query and exit the Query Editor. Click OK on the Query Browser window, and close the Deployment Group editor window.

Executing a Deployment Group Migration A Deployment Group migration can be executed through the Repository Manager client tool, or through the pmrep command line utility. With the client tool, you simply drag the deployment group from the source repository and drop it on the destination repository. This opens the Copy Deployment Group Wizard, which guides you through the stepby-step options for executing the deployment group.

Rolling Back a Deployment To roll back a deployment, you must first locate the Deployment via the TARGET Repositories menu bar (i. e., Deployments -> History -> View History -> Rollback).

Automated Deployments For the optimal migration method, you can set up a UNIX shell or Windows batch script that calls the pmrep DeployDeploymentGroup command, which can execute a deployment group migration without human intevention. This is ideal since the deployment group allows ultimate flexibility and convenience as the script can be scheduled to run overnight, thereby causing minimal impact on developers and the PowerCenter administrator. You can also use the pmrep utility to automate importing objects via XML.

Recommendations Informatica recommends using the following process when running in a three-tiered environment with development, test, and production servers. INFORMATICA CONFIDENTIAL

BEST PRACTICES

150 of 954

Non-Versioned Repositories For migrating from development into test, Informatica recommends using the Object Copy method. This method gives you total granular control over the objects that are being moved. It also ensures that the latest development mappings can be moved over manually as they are completed. For recommendations on performing this copy procedure correctly, see the steps listed in the Object Copy section.

Versioned Repositories For versioned repositories, Informatica recommends using the Deployment Groups method for repository migration in a distributed repository environment. This method provides the greatest flexibility in that you can promote any object from within a development repository (even across folders) into any destination repository. Also, by using labels, dynamic deployment groups, and the enhanced pmrep command line utility, the use of the deployment group migration method results in automated migrations that can be executed without manual intervention.

Third-Party Versioning Some organizations have standardized on third-party version control software. PowerCenter’s XML import/export functionality offers integration with such software and provides a means to migrate objects. This method is most useful in a distributed environment because objects can be exported into an XML file from one repository and imported into the destination repository. The XML Object Copy Process allows you to copy nearly all repository objects, including sources, targets, reusable transformations, mappings, mapplets, workflows, worklets, and tasks. Beginning with PowerCenter 7 and later versions, the export/import functionality allows the export/import of multiple objects to a single XML file. This can significantly cut down on the work associated with object level XML import/export. The following steps outline the process of exporting the objects from source repository and importing them into the destination repository:

Exporting 1. From Designer or Workflow Manager, login to the source repository. Open the folder and highlight the object to be exported. 2. Select Repository -> Export Objects INFORMATICA CONFIDENTIAL

BEST PRACTICES

151 of 954

3. The system prompts you to select a directory location on the local workstation. Choose the directory to save the file. Using the default name for the XML file is generally recommended. 4. Open Windows Explorer and go to the C:\Program Files\Informatica PowerCenter 7 and later versions x \Client directory. (This may vary depending on where you installed the client tools.) 5. Find the powrmart.dtd file, make a copy of it, and paste the copy into the directory where you saved the XML file. 6. Together, these files are now ready to be added to the version control software

Importing Log into Designer or the Workflow Manager client tool and login to the destination repository. Open the folder where the object is to be imported. 1. Select Repository -> Import Objects. 2. The system prompts you to select a directory location and file to import into the repository. 3. The following screen appears with the steps for importing the object.

4. Select the mapping and add it to the Objects to Import list.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

152 of 954

5. Click "Next", and then click "Import". Since the shortcuts have been added to the folder, the mapping will now point to the new shortcuts and their parent folder. 6. It is important to note that the pmrep command line utility was greatly enhanced in PowerCenter 7 and later versions, allowing the activities associated with XML import/export to be automated through pmrep. 7. Click on the destination repository service on the left pane and choose the “Action drop-down list box “ -> “Restore.” Remember, if the destination repository has content, it has to be deleted prior to restoring).

Last updated: 04-Jun-08 16:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

153 of 954

Migration Procedures - PowerExchange Challenge To facilitate the migration of PowerExchange definitions from one environment to another.

Description There are two approaches to perform a migration. ●

Using the DTLURDMO utility

●

Using the Power Exchange Client tool (Detail Navigator)

DTLURDMO Utility Step 1: Validate connectivity between the client and listeners

●

Test communication between clients and all listeners in the production environment with: dtlrexeprog=ping =.

●

Run selected jobs to exercise data access through PowerExchange data maps.

Step 2: Run DTLURDMO to copy PowerExchange objects. At this stage, if PowerExchange is to run against new versions of the PowerExchange objects rather than existing libraries, you need to copy the datamaps. To do this, use the PowerExchange Copy Utility DTLURDMO. The following section assumes that the entire datamap set is to be copied. DTLURDMO does have the ability to copy selectively, however, and the full functionality of the utility is documented in the PowerExchange Utilities Guide. The types of definitions that can be managed with this utility are: ●

PowerExchange data maps

INFORMATICA CONFIDENTIAL

BEST PRACTICES

154 of 954

●

PowerExchange capture registrations

●

PowerExchange capture extraction data maps

On MVS, the input statements for this utility are taken from SYSIN. On non-MVS platforms, the input argument point to a file containing the input definition. If no input argument is provided, it looks for a file dtlurdmo.ini in the current path. The utility runs on all capture platforms.

Windows and UNIX Command Line Syntax: DTLURDMO For example: DTLURDMO e:\powerexchange\bin\dtlurdmo.ini ●

DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates. If no definition file is specified, it looks for a file dtlurdmo.ini in the current path.

MVS DTLURDMO job utility Run the utility by submitting the DTLURDMO job, which can be found in the RUNLIB library. ●

DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates and is read from the SYSIN card.

AS/400 utility Syntax: CALL PGM() For example: CALL PGM(dtllib/DTLURDMO) ●

DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates. By default, the definition is in the member CFG/DTLURDMO in the current datalib library.

If you want to create a separate DTLURDMO definition file rather than use the default location, you must give the library and filename of the definition file as a parameter. For example: CALL PGM(dtllib/ DTLURDMO) parm ('datalib/deffile(dtlurdmo)')

Running DTLURDMO The utility should be run extracting information from the files locally, then writing out the datamaps through the new PowerExchange V8.x.x Listener. This causes the datamaps to be written out in the format required for the upgraded PowerExchange. DTLURDMO must be run once for the datamaps, then again for the registrations, and then the extract maps if this is a capture environment. Commands for mixed datamaps, registrations, and extract maps cannot be run together. INFORMATICA CONFIDENTIAL

BEST PRACTICES

155 of 954

If only a subset of the PowerExchange datamaps, registrations, and extract maps are required, then selective copies can be carried out. Details of performing selective copies are documented fully in the PowerExchange Utilities Guide. This document assumes that everything is going to be migrated from the existing environment to the new V8.x.x format.

Definition File Example The following example shows a definition file to copy all datamaps from the existing local datamaps (the local datamaps are defined in the DATAMAP DD card in the MVS JCL or by the path on Windows or UNIX) to the V8.x.x listener (defined by the TARGET location node1): USER DTLUSR; EPWD A3156A3623298FDC; SOURCE LOCAL; TARGET NODE1; DETAIL; REPLACE; DM_COPY; SELECT schema=*; Note: The encrypted password (EPWD) is generated from the FILE, ENCRYPT PASSWORD option from the PowerExchange Navigator.

Power Exchange Client tool (Detail Navigator) Step 1: Validate connectivity between the client and listeners

●

Test communication between clients and all listeners in the production environment with: dtlrexeprog=ping loc=.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

156 of 954

●

Run selected jobs to exercise data access through PowerExchange data maps.

Step 2: Start the Power Exchange Navigator

●

Select the datamap that is going to be promoted to production.

●

On the menu bar, select a file to send to the remote node.

●

On the drop-down list box, choose the appropriate location ( in this case mvs_prod). ●

Supply the user name and password and click OK. ●

A confirmation message for successful migration is displayed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

157 of 954

Last updated: 06-Feb-07 11:39

INFORMATICA CONFIDENTIAL

BEST PRACTICES

158 of 954

Running Sessions in Recovery Mode Challenge Understanding the recovery options that are available for PowerCenter when errors are encountered during the load.

Description When a task in the workflow fails at any point, one option is to truncate the target and run the workflow again from the beginning. As an alternative, the workflow can be suspended and the error can be fixed, rather than re-processing the portion of the workflow with no errors. This option, "Suspend on Error", results in accurate and complete target data, as if the session completed successfully with one run. There are also recovery options available for workflows and tasks that can be used to handle different failure scenarios.

Configure Mapping for Recovery For consistent recovery, the mapping needs to produce the same result, and in the same order, in the recovery execution as in the failed execution. This can be achieved by sorting the input data using either the sorted ports option in Source Qualifier (or Application Source Qualifier) or by using a sorter transformation with distinct rows option immediately after source qualifier transformation. Additionally, ensure that all the targets received data from transformations that produce repeatable data.

Configure Session for Recovery The recovery strategy can be configured on the Properties page of the Session task. Enable the session for recovery by selecting one of the following three Recovery Strategies: ●

Resume from the last checkpoint ❍

❍

The Integration Service saves the session recovery information and updates recovery tables for a target database. If a session interrupts, the Integration Service uses the saved recovery information to recover it.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

159 of 954

❍

●

Restart task ❍ ❍

●

The Integration Service recovers a stopped, aborted or terminated session from the last checkpoint.

The Integration Service does not save session recovery information. If a session interrupts, the Integration Service reruns the session during recovery.

Fail task and continue workflow ❍

The Integration Service recovers a workflow; it does not recover the session. The session status becomes failed and the Integration Service continues running the workflow.

Configure Workflow for Recovery The Suspend on Error option directs the Integration Service to suspend the workflow while the error is being fixed and then it resumes the workflow. The workflow is suspended when any of the following tasks fail: ●

Session

●

Command

●

Worklet

●

Email

When a task fails in the workflow, the Integration Service stops running tasks in the path. The Integration Service does not evaluate the output link of the failed task. If no other task is running in the workflow, the Workflow Monitor displays the status of the workflow as "Suspended." If one or more tasks are still running in the workflow when a task fails, the Integration Service stops running the failed task and continues running tasks in other paths. The Workflow Monitor displays the status of the workflow as "Suspending." When the status of the workflow is "Suspended" or "Suspending," you can fix the error, such as a target database error, and recover the workflow in the Workflow Monitor. When you recover a workflow, the Integration Service restarts the failed tasks and continues evaluating the rest of the tasks in the workflow. The Integration Service does not run any task that already completed successfully.

Truncate Target Table

INFORMATICA CONFIDENTIAL

BEST PRACTICES

160 of 954

If the truncate table option is enabled in a recovery-enabled session, the target table is not truncated during recovery process.

Session Logs In a suspended workflow scenario, the Integration Service uses the existing session log when it resumes the workflow from the point of suspension. However, the earlier runs that caused the suspension are recorded in the historical run information in the repository.

Suspension Email The workflow can be configured to send an email when the Integration Service suspends the workflow. When a task fails, the workflow is suspended and suspension email is sent. The error can be fixed and the workflow can be resumed subsequently. If another task fails while the Integration Service is suspending the workflow, another suspension email is not sent. The Integration Service only sends out another suspension email if another task fails after the workflow resumes. Check the "Browse Emails" button on the General tab of the Workflow Designer Edit sheet to configure the suspension email.

Suspending Worklets When the "Suspend On Error" option is enabled for the parent workflow, the Integration Service also suspends the worklet if a task within the worklet fails. When a task in the worklet fails, the Integration Service stops executing the failed task and other tasks in its path. If no other task is running in the worklet, the status of the worklet is "Suspended". If other tasks are still running in the worklet, the status of the worklet is "Suspending". The parent workflow is also suspended when the worklet is "Suspended" or "Suspending".

Starting Recovery The recovery process can be started using Workflow Manager or Workflow Monitor . Alternately, the recovery process can be started by using pmcmd in command line mode or by using a script.

Recovery Tables and Recovery Process When the Integration Service runs a session that has a resume recovery strategy, it

INFORMATICA CONFIDENTIAL

BEST PRACTICES

161 of 954

writes to recovery tables on the target database system. When the Integration Service recovers the session, it uses information in the recovery tables to determine where to begin loading data to target tables. If you want the Integration Service to create the recovery tables, grant table creation privilege to the database user name that is configured in the target database connection. If you do not want the Integration Service to create the recovery tables, create the recovery tables manually. The Integration Service creates the following recovery tables in the target database: PM_RECOVERY - Contains target load information for the session run. The Integration Service removes the information from this table after each successful session and initializes the information at the beginning of subsequent sessions. PM_TGT_RUN_ID - Contains information that the Integration Service uses to identify each target on the database. The information remains in the table between session runs. If you manually create this table, you must create a row and enter a value other than zero for LAST_TGT_RUN_ID to ensure that the session recovers successfully. PM_REC_STATE - When the Integration Service runs a real-time session that uses the recovery table and that has recovery enabled, it creates a recovery table, PM_REC_STATE, on the target database to store message IDs and commit numbers. When the Integration Service recovers the session, it uses information in the recovery tables to determine if it needs to write the message to the target table. The table contains information that the Integration Service uses to determine if it needs to write messages to the target table during recovery for a real-time session. If you edit or drop the recovery tables before you recover a session, the Integration Service cannot recover the session. If you disable recovery, the Integration Service does not remove the recovery tables from the target database and you must manually remove them

Session Recovery Considerations The following options affect whether the session is incrementally recoverable: ●

●

●

Output is deterministic. A property that determines if the transformation generates the same set of data for each session run. Output is repeatable. A property that determines if the transformation generates the data in the same order for each session run. You can set this property for Custom transformations. Lookup source is static. A Lookup transformation property that determines if the lookup source is the same between the session and recovery. The

INFORMATICA CONFIDENTIAL

BEST PRACTICES

162 of 954

Integration Service uses this property to determine if the output is deterministic.

Inconsistent Data During Recovery Process For recovery to be effective, the recovery session must produce the same set of rows; and in the same order. Any change after initial failure (in mapping, session and/or in the Integration Service) that changes the ability to produce repeatable data, results in inconsistent data during the recovery process. The following situations may produce inconsistent data during a recovery session: ●

Session performs incremental aggregation and the Integration Service stops unexpectedly.

●

Mapping uses sequence generator transformation.

●

Mapping uses a normalizer transformation.

●

Source and/or target changes after initial session failure.

●

Data movement mode change after initial session failure.

●

Code page (server, source or target) changes, after initial session failure.

●

●

●

●

Mapping changes in a way that causes server to distribute or filter or aggregate rows differently. Session configurations are not supported by PowerCenter for session recovery. Mapping uses a lookup table and the data in the lookup table changes between session runs. Session sort order changes, when server is running in Unicode mode.

HA Recovery Highly-available recovery allows the workflow to resume automatically in the case of Integration Service failover. The following options are available in the properties tab of the workflow: ●

●

●

Enable HA recovery Allows the workflow to be configured for Highly Availability. Automatically recover terminated tasks Recover terminated Session or Command tasks without user intervention. Maximum automatic recovery attempts When you automatically recover terminated tasks, you can choose the number of times the Integration Service

INFORMATICA CONFIDENTIAL

BEST PRACTICES

163 of 954

attempts to recover the task. The default setting is 5. Last updated: 26-May-08 11:28

INFORMATICA CONFIDENTIAL

BEST PRACTICES

164 of 954

Using PowerCenter Labels Challenge Using labels effectively in a data warehouse or data integration project to assist with administration and migration.

Description A label is a versioning object that can be associated with any versioned object or group of versioned objects in a repository. Labels provide a way to tag a number of object versions with a name for later identification. Therefore, a label is a named object in the repository, whose purpose is to be a “pointer” or reference to a group of versioned objects. For example, a label called “Project X version X” can be applied to all object versions that are part of that project and release. Labels can be used for many purposes: ●

Track versioned objects during development

●

Improve object query results.

●

Create logical groups of objects for future deployment.

●

Associate groups of objects for import and export.

Note that labels apply to individual object versions, and not objects as a whole. So if a mapping has ten versions checked in, and a label is applied to version 9, then only version 9 has that label. The other versions of that mapping do not automatically inherit that label. However, multiple labels can point to the same object for greater flexibility. The “Use Repository Manager” privilege is required in order to create or edit labels, To create a label, choose Versioning-Labels from the Repository Manager.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

165 of 954

When creating a new label, choose a name that is as descriptive as possible. For example, a suggested naming convention for labels is: Project_Version_Action. Include comments for further meaningful description. Locking the label is also advisable. This prevents anyone from accidentally associating additional objects with the label or removing object references for the label. Labels, like other global objects such as Queries and Deployment Groups, can have user and group privileges attached to them. This allows an administrator to create a label that can only be used by specific individuals or groups. Only those people working on a specific project should be given read/write/execute permissions for labels that are assigned to that project.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

166 of 954

Once a label is created, it should be applied to related objects. To apply the label to objects, invoke the “Apply Label” wizard from the Versioning >> Apply Label menu option from the menu bar in the Repository Manager (as shown in the following figure).

Applying Labels Labels can be applied to any object and cascaded upwards and downwards to parent and/or child objects. For example, to group dependencies for a workflow, apply a label to all children objects. The Repository Server applies labels to sources, targets, mappings, and tasks associated with the workflow. Use the “Move label” property to point the label to the latest version of the object(s). Note: Labels can be applied to any object version in the repository except checked-out versions. Execute permission is required for applying labels. After the label has been applied to related objects, it can be used in queries and deployment groups (see the Best Practice on Deployment Groups ). Labels can also be used to manage the size of the repository (i.e. to purge object versions).

Using Labels in Deployment An object query can be created using the existing labels (as shown below). Labels can be associated only with a dynamic deployment group. Based on the object query, objects associated with that label can be used in the deployment.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

167 of 954

Strategies for Labels Repository Administrators and other individuals in charge of migrations should develop their own label strategies and naming conventions in the early stages of a data integration project. Be sure that developers are aware of the uses of these labels and when they should apply labels. For each planned migration between repositories, choose three labels for the development and subsequent repositories: ●

The first is to identify the objects that developers can mark as ready for migration.

●

The second should apply to migrated objects, thus developing a migration audit trail.

●

The third is to apply to objects as they are migrated into the receiving repository, completing the migration audit trail.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

168 of 954

When preparing for the migration, use the first label to construct a query to build a dynamic deployment group. The second and third labels in the process are optionally applied by the migration wizard when copying folders between versioned repositories. Developers and administrators do not need to apply the second and third labels manually. Additional labels can be created with developers to allow the progress of mappings to be tracked if desired. For example, when an object is successfully unit-tested by the developer, it can be marked as such. Developers can also label the object with a migration label at a later time if necessary. Using labels in this fashion along with the query feature allows complete or incomplete objects to be identified quickly and easily, thereby providing an object-based view of progress.

Last updated: 04-Jun-08 13:47

INFORMATICA CONFIDENTIAL

BEST PRACTICES

169 of 954

Deploying Data Analyzer Objects Challenge To understand the methods for deploying Data Analyzer objects among repositories and the limitations of such deployment.

Description Data Analyzer repository objects can be exported to and imported from Extensible Markup Language (XML) files. Export/import facilitates archiving the Data Analyzer repository and deploying Data Analyzer Dashboards and reports from development to production. The following repository objects in Data Analyzer can be exported and imported: ●

Schemas

●

Reports

●

Time Dimensions

●

Global Variables

●

Dashboards

●

Security profiles

●

Schedules

●

Users

●

Groups

●

Roles

The XML file created after exporting objects should not be modified. Any change might invalidate the XML file and result in failure of import objects into a Data Analyzer repository. For more information on exporting objects from the Data Analyzer repository, refer to the Data Analyzer Administration Guide.

Exporting Schema(s)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

170 of 954

To export the definition of a star schema or an operational schema, you need to select a metric or folder from the Metrics system folder in the Schema Directory. When you export a folder, you export the schema associated with the definitions of the metrics in that folder and its subfolders. If the folder you select for export does not contain any objects, Data Analyzer does not export any schema definition and displays the following message: There is no content to be exported. There are two ways to export metrics or folders containing metrics: ●

●

Select the “Export Metric Definitions and All Associated Schema Table and Attribute Definitions” option. If you select to export a metric and its associated schema objects, Data Analyzer exports the definitions of the metric and the schema objects associated with that metric. If you select to export an entire metric folder and its associated objects, Data Analyzer exports the definitions of all metrics in the folder, as well as schema objects associated with every metric in the folder. Alternatively, select the “Export Metric Definitions Only” option. When you choose to export only the definition of the selected metric, Data Analyzer does not export the definition of the schema table from which the metric is derived or any other associated schema object.

1. Login to Data Analyzer as a System Administrator. 2. Click on the Administration tab » XML Export/Import » Export Schemas. 3. All the metric folders in the schema directory are displayed. Click “Refresh Schema” to display the latest list of folders and metrics in the schema directory. 4. Select the check box for the folder or metric to be exported and click “Export as XML” option. 5. Enter XML filename and click “Save” to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting Report(s) To export the definitions of more than one report, select multiple reports or folders. Data Analyzer exports only report definitions. It does not export the data or the schedule for cached reports. As part of the Report Definition export, Data Analyzer exports the report table, report chart, filters, indicators (i.e., gauge, chart, and table indicators), custom metrics, links to similar reports, and all reports in an analytic workflow, including links to similar reports.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

171 of 954

Reports can have public or personal indicators associated with them. By default, Data Analyzer exports only public indicators associated with a report. To export the personal indicators as well, select the Export Personal Indicators check box. To export an analytic workflow, you need to export only the originating report. When you export the originating report of an analytic workflow, Data Analyzer exports the definitions of all the workflow reports. If a report in the analytic workflow has similar reports associated with it, Data Analyzer exports the links to the similar reports. Data Analyzer does not export the alerts, schedules, or global variables associated with the report. Although Data Analyzer does not export global variables, it lists all global variables it finds in the report filter. You can, however, export these global variables separately. 1. 2. 3. 4. 5. 6.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Export Reports. Select the folder or report to be exported. Click “Export as XML”. Enter XML filename and click “Save” to save the XML file. The XML file will be stored locally on the client machine.

Exporting Global Variables 1. 2. 3. 4. 5. 6.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Export Global Variables. Select the Global variable to be exported. Click “Export as XML”. Enter the XML filename and click “Save” to save the XML file. The XML file will be stored locally on the client machine.

Exporting a Dashboard Whenever a dashboard is exported, Data Analyzer exports the reports, indicators, shared documents, and gauges associated with the dashboard. Data Analyzer does not, however, export the alerts, access permissions, attributes or metrics in the report (s), or real-time objects. You can export any of the public dashboards defined in the repository, and can export more than one dashboard at one time. 1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Export Dashboards. 3. Select the Dashboard to be exported.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

172 of 954

4. Click “Export as XML”. 5. Enter XML filename and click “Save” to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting a User Security Profile Data Analyzer maintains a security profile for each user or group in the repository. A security profile consists of the access permissions and data restrictions that the system administrator sets for a user or group. When exporting a security profile, Data Analyzer exports access permissions for objects under the Schema Directory, which include folders, metrics, and attributes. Data Analyzer does not export access permissions for filtersets, reports, or shared documents. Data Analyzer allows you to export only one security profile at a time. If a user or group security profile you export does not have any access permissions or data restrictions, Data Analyzer does not export any object definitions and displays the following message: There is no content to be exported. 1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Export Security Profile. 3. Click “Export from users” and select the user for which security profile to be exported. 4. Click “Export as XML”. 5. Enter XML filename and click “Save” to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting a Schedule You can export a time-based or event-based schedule to an XML file. Data Analyzer runs a report with a time-based schedule on a configured schedule. Data Analyzer runs a report with an event-based schedule when a PowerCenter session completes. When you export a schedule, Data Analyzer does not export the history of the schedule. 1. 2. 3. 4.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Export Schedules. Select the Schedule to be exported. Click “Export as XML”.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

173 of 954

5. Enter XML filename and click “Save” to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting Users, Groups, or Roles Exporting Users You can export the definition of any user defined in the repository. However, you cannot export the definitions of system users defined by Data Analyzer. If you have more than one thousand users defined in the repository, Data Analyzer allows you to search for the users that you want to export. You can use the asterisk (*) or the percent symbol (%) as wildcard characters to search for users to export. You can export the definitions of more than one user, including the following information: ●

Login name

●

Description

●

First, middle, and last name

●

Title

●

Password

●

Change password privilege

●

Password never expires indicator

●

Account status

●

Groups to which the user belongs

●

Roles assigned to the user

●

Query governing settings

Data Analyzer does not export the email address, reply-to address, department, or color scheme assignment associated with the exported user(s). 1. 2. 3. 4. 5. 6. 7.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Export User/Group/Role. Click “Export Users/Group(s)/Role(s)”. Select the user(s) to be exported. Click “Export as XML”. Enter XML filename and click “Save” to save the XML file. The XML file will be stored locally on the client machine.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

174 of 954

Exporting Groups You can export any group defined in the repository, and can export the definitions of multiple groups. You can also export the definitions of all the users within a selected group. Use the asterisk (*) or percent symbol (%) as wildcard characters to search for groups to export. Each group definition includes the following information: ●

Name

●

Description

●

Department

●

Color scheme assignment

●

Group hierarchy

●

Roles assigned to the group

●

Users assigned to the group

●

Query governing settings

Data Analyzer does not export the color scheme associated with an exported group. 1. 2. 3. 4. 5. 6. 7.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Export User/Group/Role. Click “Export Users/Group(s)/Role(s)”. Select the group to be exported. Click “Export as XML”. Enter XML filename and click “Save” to save the XML file. The XML file will be stored locally on the client machine.

Exporting Roles You can export the definitions of the custom roles defined in the repository. However, you cannot export the definitions of system roles defined by Data Analyzer. You can export the definitions of more than one role. Each role definition includes the name and description of the role and the permissions assigned to each role. 1. 2. 3. 4. 5.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Export User/Group/Role. Click “Export Users/Group(s)/Role(s)”. Select the role to be exported. Click “Export as XML”.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

175 of 954

6. Enter XML filename and click “Save” to save the XML file. 7. The XML file will be stored locally on the client machine.

Importing Objects You can import objects into the same repository or a different repository. If you import objects that already exist in the repository, you can choose to overwrite the existing objects. However, you can import only global variables that do not already exist in the repository. When you import objects, you can validate the XML file against the DTD provided by Data Analyzer. Informatica recommends that you do not modify the XML files after you export from Data Analyzer. Ordinarily, you do not need to validate an XML file that you create by exporting from Data Analyzer. However, if you are not sure of the validity of an XML file, you can validate it against the Data Analyzer DTD file when you start the import process. To import repository objects, you must have the System Administrator role or the Access XML Export/Import privilege. When you import a repository object, you become the owner of the object as if you created it. However, other system administrators can also access imported repository objects. You can limit access to reports for users who are not system administrators. If you select to publish imported reports to everyone, all users in Data Analyzer have read and write access to them. You can change the access permissions to reports after you import them.

Importing Schemas When importing schemas, if the XML file contains only the metric definition, you must make sure that the fact table for the metric exists in the target repository. You can import a metric only if its associated fact table exists in the target repository or the definition of its associated fact table is also in the XML file. When you import a schema, Data Analyzer displays a list of all the definitions contained in the XML file. It then displays a list of all the object definitions in the XML file that already exist in the repository. You can choose to overwrite objects in the repository. If you import a schema that contains time keys, you must import or create a time dimension. 1. Login to Data Analyzer as a System Administrator.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

176 of 954

2. 3. 4. 5. 6.

Click Administration » XML Export/Import » Import Schema. Click “Browse” to choose an XML file to import. Select “Validate XML against DTD”. Click “Import XML”. Verify all attributes on the summary page, and choose “Continue”.

Importing Reports A valid XML file of exported report objects can contain definitions of cached or ondemand reports, including prompted reports. When you import a report, you must make sure that all the metrics and attributes used in the report are defined in the target repository. If you import a report that contains attributes and metrics not defined in the target repository, you can cancel the import process. If you choose to continue the import process, you may not be able to run the report correctly. To run the report, you must import or add the attribute and metric definitions to the target repository. You are the owner of all the reports you import, including the personal or public indicators associated with the reports. You can publish the imported reports to all Data Analyzer users. If you publish reports to everyone, Data Analyzer provides read-access to the reports to all users. However, it does not provide access to the folder that contains the imported reports. If you want another user to access an imported report, you can put the imported report in a public folder and have the user save or move the imported report to his or her personal folder. Any public indicator associated with the report also becomes accessible to the user. If you import a report and its corresponding analytic workflow, the XML file contains all workflow reports. If you choose to overwrite the report, Data Analyzer also overwrites the workflow reports. Also, when importing multiple workflows, note that Data Analyzer does not import analytic workflows containing the same workflow report names. Thus, ensure that all imported analytic workflows have unique report names prior to being imported. 1. 2. 3. 4. 5. 6.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Import Report. Click “Browse” to choose an XML file to import. Select “Validate XML against DTD”. Click “Import XML”. Verify all attributes on the summary page, and choose “Continue”.

Importing Global Variables

INFORMATICA CONFIDENTIAL

BEST PRACTICES

177 of 954

You can import global variables that are not defined in the target repository. If the XML file contains global variables already in the repository, you can cancel the process. If you continue the import process, Data Analyzer imports only the global variables not in the target repository. 1. 2. 3. 4. 5. 6.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Import Global Variables. Click “Browse” to choose an XML file to import. Select “Validate XML against DTD”. Click “Import XML”. Verify all attributes on the summary page, and choose “Continue”.

Importing Dashboards Dashboards display links to reports, shared documents, alerts, and indicators. When you import a dashboard, Data Analyzer imports the following objects associated with the dashboard: ●

Reports

●

Indicators

●

Shared documents

●

Gauges

Data Analyzer does not import the following objects associated with the dashboard: ●

Alerts

●

Access permissions

●

Attributes and metrics in the report

●

Real-time objects

If an object already exists in the repository, Data Analyzer provides an option to overwrite it. Data Analyzer does not import the attributes and metrics in the reports associated with the dashboard. If the attributes or metrics in a report associated with the dashboard do not exist, the report does not display on the imported dashboard. 1. 2. 3. 4.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Import Dashboard. Click “Browse” to choose an XML file to import. Select “Validate XML against DTD”.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

178 of 954

5. Click “Import XML”. 6. Verify all attributes on the summary page, and choose “Continue”.

Importing Security Profile(s) To import a security profile, you must begin by selecting the user or group to which you want to assign the security profile. You can assign the same security profile to more than one user or group. When you import a security profile and associate it with a user or group, you can either overwrite the current security profile or add to it. When you overwrite a security profile, you assign the user or group only the access permissions and data restrictions found in the new security profile. Data Analyzer removes the old restrictions associated with the user or group. When you append a security profile, you assign the user or group the new access permissions and data restrictions in addition to the old permissions and restrictions. When exporting a security profile, Data Analyzer exports the security profile for objects in Schema Directory, including folders, attributes, and metrics. However, it does not include the security profile for filtersets. 1. 2. 3. 4.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Import Security Profile. Click “Import to Users”. Select the user with which you want to associate the security profile you import. ❍

❍

❍

❍

5. 6. 7. 8.

To associate the imported security profiles with all the users on the page, select the "Users" check box at the top of the list. To associate the imported security profiles with all the users in the repository, select “Import to All.”. To overwrite the selected user’s current security profile with the imported security profile, select “Overwrite.”. To append the imported security profile to the selected user’s current security profile, select “Append.”.

Click “Browse” to choose an XML file to import. Select “Validate XML against DTD”. Click “Import XML”. Verify all attributes on the summary page, and choose “Continue”.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

179 of 954

Importing Schedule(s) A time-based schedule runs reports based on a configured schedule. An event-based schedule runs reports when a PowerCenter session completes. You can import a timebased or event-based schedules from an XML file. When you import a schedule, Data Analyzer does not attach the schedule to any reports. 1. 2. 3. 4. 5. 6.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Import Schedule. Click “Browse” to choose an XML file to import. Select “Validate XML against DTD”. Click “Import XML”. Verify all attributes on the summary page, and choose “Continue”.

Importing Users, Groups, or Roles When you import a user, group, or role, you import all the information associated with each user, group, or role. The XML file includes definitions of roles assigned to users or groups, and definitions of users within groups. For this reason, you can import the definition of a user, group, or role in the same import process. When importing a user, you import the definitions of roles assigned to the user and the groups to which the user belongs. When you import a user or group, you import the user or group definitions only. The XML file does not contain the color scheme assignments, access permissions, or data restrictions for the user or group. To import the access permissions and data restrictions, you must import the security profile for the user or group. 1. 2. 3. 4. 5. 6.

Login to Data Analyzer as a System Administrator. Click Administration » XML Export/Import » Import User/Group/Role. Click “Browse” to choose an XML file to import. Select “Validate XML against DTD”. Click “Import XML” option. Verify all attributes on the summary page, and choose “Continue”.

Tips for Importing/Exporting ●

Schedule Importing/Exporting of repository objects for a time of minimal

INFORMATICA CONFIDENTIAL

BEST PRACTICES

180 of 954

Data Analyzer activity, when most of the users are not accessing the Data Analyzer repository. This should help to prevent users from experiencing timeout errors or degraded response time. Only the System Administrator should perform import/export operations. ●

●

●

●

●

●

●

Take a backup of the Data Analyzer repository prior to performing an import/ export operation. This backup should be completed using the Repository Backup Utility provided with Data Analyzer. Manually add user/group permissions for the report. These permissions will not be exported as part of exporting Reports and should be manually added after the report is imported in the desired server. Use a version control tool. Prior to importing objects into a new environment, it is advisable to check the XML documents with a version-control tool such as Microsoft's Visual Source Safe, or PVCS. This facilitates the versioning of repository objects and provides a means for rollback to a prior version of an object, if necessary. Attach cached reports to schedules. Data Analyzer does not import the schedule with a cached report. When you import cached reports, you must attach them to schedules in the target repository. You can attach multiple imported reports to schedules in the target repository in one process immediately after you import them. Ensure that global variables exist in the target repository. If you import a report that uses global variables in the attribute filter, ensure that the global variables already exist in the target repository. If they are not in the target repository, you must either import the global variables from the source repository or recreate them in the target repository. Manually add indicators to the dashboard. When you import a dashboard, Data Analyzer imports all indicators for the originating report and workflow reports in a workflow. However, indicators for workflow reports do not display on the dashboard after you import it until added manually. Check with your System Administrator to understand what level of LDAP integration has been configured (if any). Users, groups, and roles need to be exported and imported during deployment when using repository authentication. If Data Analyzer has been integrated with an LDAP (Lightweight Directory Access Protocol) tool, then users, groups, and/or roles may not require deployment.

When you import users into a Microsoft SQL Server or IBM DB2 repository, Data Analyzer blocks all user authentication requests until the import process is complete.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

181 of 954

Installing Data Analyzer Challenge Installing Data Analyzer on new or existing hardware, either as a dedicated application on a physical machine (as Informatica recommends) or co-existing with other applications on the same physical server or with other Web applications on the same application server.

Description Consider the following questions when determining what type of hardware to use for Data Analyzer: If the hardware already exists: 1. 2. 3. 4. 5.

Is the processor, operating system, and database software supported by Data Analyzer? Are the necessary operating system and database patches applied? How many CPUs does the machine currently have? Can the CPU capacity be expanded? How much memory does the machine have? How much is available to the Data Analyzer application? Will Data Analyzer share the machine with other applications? If yes, what are the CPU and memory requirements of the other applications?

If the hardware does not already exist: 1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? (e.g., Solaris, Windows, AIX, HP-UX, Redhat AS, SuSE) 3. What database and version is preferred and supported for the Data Analyzer repository? Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the reporting response time requirements for Data Analyzer. The following questions should be answered in order to estimate the size of a Data Analyzer server: 1. 2. 3. 4.

How many users are predicted for concurrent access? On average, how many rows will be returned in each report? On average, how many charts will there be for each report? Do the business requirements mandate a SSL Web server?

The hardware requirements for the Data Analyzer environment depend on the number of concurrent users, types of reports being used (i.e., interactive vs. static), average number of records in a report, application server and operating system used, among other factors. The following table should be used as a general guide for hardware recommendations for a Data Analyzer installation. Actual results may vary depending upon exact hardware configuration and user volume. For exact sizing recommendations, contact Informatica Professional Services for a Data Analyzer Sizing and Baseline Architecture engagement.

Windows # of Concurrent Users

Average Number of Rows per Report

Average # of Charts per Report

Estimated # of CPUs for Peak Usage

Estimated Total RAM (For Data Analyzer alone)

Estimated # of App servers in a Clustered Environment

50

1000

2

2

1 GB

1

INFORMATICA CONFIDENTIAL

BEST PRACTICES

182 of 954

100

1000

2

3

2 GB

1-2

200

1000

2

6

3.5 GB

3

400

1000

2

12

6.5 GB

6

100

1000

2

3

2 GB

1-2

-100

2000

2

3

2.5 GB

1-2

100

5000

2

4

3 GB

2

100

10000

2

5

4 GB

2-3

100

1000

2

3

2 GB

1-2

100

1000

5

3

2 GB

1-2

100

1000

7

3

2.5 GB

1-2

100

1000

10

3-4

3 GB

1-2

Notes: 1. This estimating guide is based on experiments conducted in the Informatica lab. 2. The sizing estimates are based on PowerAnalyzer 5 running BEA WebLogic 8.1 SP3, Windows 2000, on a 4 CPU 2.5 GHz Xeon Processor. This estimate may not be accurate for other, different environments. 3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However, this percentage can be as high as 50 percent or as low as 5 percent in some organizations. 4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance. 5. There will be an increase in overhead on for a SSL Web server architecture, depending on strength of encryption. 6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive charting, rather than the default PNG charting. 7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesn’t have to be across multiple boxes if >= 4 CPU) 8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.

IBM AIX # of Concurrent Users

Average Number of Rows per Report

Average # of Charts per Report

Estimated # of CPUs for Peak Usage

Estimated Total RAM (For Data Analyzer alone)

Estimated # of App servers in a Clustered Environment

50

1000

2

2

1 GB

1

INFORMATICA CONFIDENTIAL

BEST PRACTICES

183 of 954

100

1000

2

2-3

2 GB

1

200

1000

2

4-5

3.5 GB

2-3

400

1000

2

9 - 10

6 GB

4-5

100

1000

2

2-3

2 GB

1

-100

2000

2

2-3

2 GB

1-2

100

5000

2

2-3

3 GB

1-2

100

10000

2

4

4 GB

2

100

1000

2

2-3

2 GB

1

100

1000

5

2-3

2 GB

1

100

1000

7

2-3

2 GB

1-2

100

1000

10

2-3

2.5 GB

1-2

Notes: 1. This estimating guide is based on experiments conducted in the Informatica lab. 2. The sizing estimates are based on PowerAnalyzer 5 running IBM WebSphere 5.1.1.1 and AIX 5.2.02 on a 4 CPU 2.4 GHz IBM p630. This estimate may not be accurate for other, different environments. 3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However, this percentage can be as high as 50 percent or as low as 5 percent in some organizations. 4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance. 5. Add 30 to 50 percent overhead on for a SSL Web server architecture, depending on strength of encryption. 6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive charting, rather than the default PNG charting. 7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesn’t have to be across multiple boxes if >= 4 CPU) 8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.

Data Analyzer Installation The Data Analyzer installation process involves two main components: the Data Analyzer Repository and the Data Analyzer Server, which is an application deployed on an application server. A Web server is necessary to support these components and is included with the installation of the application servers. This section discusses the installation process for JBOSS, BEA WebLogic and IBM WebSphere. The installation tips apply to both Windows and UNIX environments. This section is intended to serve as a supplement to the Data Analyzer Installation Guide. Before installing Data Analyzer, be sure to complete the following steps: ●

Verify that the hardware meets the minimum system requirements for Data Analyzer. Ensure that the combination of hardware, operating system, application server, repository database, and, optionally, authentication software

INFORMATICA CONFIDENTIAL

BEST PRACTICES

184 of 954

are supported by Data Analyzer. Ensure that sufficient space has been allocated to the Data Analyzer repository. ●

Apply all necessary patches to the operating system and database software.

●

Verify connectivity to the data warehouse database (or other reporting source) and repository database.

●

● ●

If LDAP or NT Domain is used for Data Analyzer authentication, verify connectivity to the LDAP directory server or the NT primary domain controller. The Data Analyzer license file has been obtained from technical support. On UNIX/Linux installations, the OS user that is running Data Analyzer must have execute privileges on all Data Analyzer installation executables.

In addition to the standard Data Analyzer components that are installed by default, you can also install Metadata Manager. With Version 8.0, the Data Analyzer SDK and Portal Integration Kit are now installed with Data Analyzer. Refer to the Data Analyzer documentation for detailed information for these components.

Changes to Installation Process Beginning with Data Analyzer version 7.1.4, Data Analyzer is packaged with PowerCenter Advance Edition. To install only the Data Analyzer portion, during the installation process choose the Custom Installation option. On the following screen, uncheck all of the check boxes except the Data Analyzer check box and then click Next.

Repository Configuration To properly install Data Analyzer you need to have connectivity information for the database server where the repository is going to reside. This information includes: ●

Database URL

●

Repository username

●

Password for repository username

Installation Steps: JBOSS INFORMATICA CONFIDENTIAL

BEST PRACTICES

185 of 954

The following are the basic installation steps for Data Analyzer on JBOSS 1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to installation. 2. Install Data Analyzer. The Data Analyzer installation process will install JBOSS if a version does not already exist, or an existing instance can be selected. 3. Apply the Data Analyzer license key. 4. Install the Data Analyzer Online Help.

Installation Tips: JBOSS The following are the basic installation tips for Data Analyzer on JBOSS: ●

●

●

●

●

●

Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of JBOSS. Also, other applications can coexist with Data Analyzer on a single instance of JBOSS. Although this architecture should be considered during hardware sizing estimates, it allows greater flexibility during installation. For JBOSS installations on UNIX, the JBOSS Server installation program requires an X-Windows server. If JBOSS Server is installed on a machine where an X-Windows server is not installed, an X-Windows server must be installed on another machine in order to render graphics for the GUI-based installation program. For more information on installing on UNIX, please see the “UNIX Servers” section of the installation and configuration tips below. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTP’d in binary format To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the configuration parameters available during installation as the installer will configure all properties files at installation. The Data Analyzer license file must be applied prior to starting Data Analyzer.

Configuration Screen

INFORMATICA CONFIDENTIAL

BEST PRACTICES

186 of 954

Installation Steps: BEA WebLogic The following are the basic installation steps for Data Analyzer on BEA WebLogic: 1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to installation. 2. Install BEA WebLogic and apply the BEA license. 3. Install Data Analyzer. 4. Apply the Data Analyzer license key. 5. Install the Data Analyzer Online Help.

TIP When creating a repository in an Oracle database, make sure the storage parameters specified for the tablespace that contains the repository are not set too large. Since many target tablespaces are initially set for very large INITIAL and NEXT values, large storage parameters cause the repository to use excessive amounts of space. Also verify that the default tablespace for the user that owns the repository tables is set correctly. The following example shows how to set the recommended storage parameters, assuming the repository is stored in the “REPOSITORY” tablespace: ALTER TABLESPACE “REPOSITORY” DEFAULT STORAGE ( INITIAL 10K NEXT 10K MAXEXTENTS UNLIMITED PCTINCREASE 50 );

Installation Tips: BEA WebLogic INFORMATICA CONFIDENTIAL

BEST PRACTICES

187 of 954

The following are the basic installation tips for Data Analyzer on BEA WebLogic: ●

●

●

●

●

●

Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of WebLogic. Also, other applications can coexist with Data Analyzer on a single instance of WebLogic. Although this architecture should be factored in during hardware sizing estimates, it allows greater flexibility during installation. With Data Analyzer 8, there is a console version of the installation available. X-Windows is no longer required for WebLogic installations. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTP’d in binary format To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the configuration parameters available during installation since the installer will configure all properties files at installation. The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.

Configuration Screen

Installation Steps: IBM WebSphere The following are the basic installation steps for Data Analyzer on IBM WebSphere: 1. Setup the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but the empty database schema needs to exist and be able to be connected to via JDBC prior to installation. 2. Install IBM WebSphere and apply the WebSphere patches. WebSphere can be installed in its “Base” configuration or “Network Deployment” configuration if clustering will be utilized. In both cases, patchsets will need to be applied. INFORMATICA CONFIDENTIAL

BEST PRACTICES

188 of 954

3. 4. 5. 6.

Install Data Analyzer. Apply the Data Analyzer license key. Install the Data Analyzer Online Help. Configure the PowerCenter Integration Utility. See the section "Configuring the PowerCenter Integration Utility for WebSphere" in the PowerCenter Installation and Configuration Guide.

Installation Tips: IBM WebSphere ●

Starting in Data Analyzer 5, multiple Data Analyzer instances can be installed on a single instance of WebSphere. Also, other applications can coexist with Data Analyzer on a single instance of WebSphere. Although this architecture should be considered during sizing estimates, it allows greater flexibility during installation. ● With Data Analyzer 8 there is a console version of the installation available. X-Windows is no longer required for WebSphere installations. ●

●

●

●

●

●

●

For WebSphere on UNIX installations, Data Analyzer must be installed using the root user or system administrator account. Two groups (mqm and mqbrkrs) must be created prior to the installation and the root account should be added to both of these groups. For WebSphere on Windows installations, ensure that Data Analyzer is installed under the “padaemon” local Windows user ID that is in the Administrative group and has the advanced user rights: "Act as part of the operating system" and "Log on as a service." During the installation, the padaemon account will need to be added to the mqm group. If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTP’d in binary format. To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com). During the WebSphere installation process, the user will be prompted to enter a directory for the application server and the HTTP (web) server. In both instances, it is advisable to keep the default installation directory. Directory names for the application server and HTTP server that include spaces may result in errors. During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is utilized, have the configuration parameters available during installation as the installer will configure all properties files at installation. The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.

Configuration Screen

INFORMATICA CONFIDENTIAL

BEST PRACTICES

189 of 954

Installation and Configuration Tips: UNIX Servers With Data Analyzer 8 there is a console version of the installation available. For previous versions of Data Analyzer, a graphics display server is required for a Data Analyzer installation on UNIX. On UNIX, the graphics display server is typically an X-Windows server, although an X-Window Virtual Frame Buffer (XVFB) or personal computer X-Windows software such as WRQ Reflection-X can also be used. In any case, the XWindows server does not need to exist on the local machine where Data Analyzer is being installed, but does need to be accessible. A remote X-Windows, XVFB, or PC-X Server can be used by setting the DISPLAY to the appropriate IP address, as discussed below. If the X-Windows server is not installed on the machine where Data Analyzer will be installed, Data Analyzer can be installed using an X-Windows server installed on another machine. Simply redirect the DISPLAY variable to use the XWindows server on another UNIX machine. To redirect the host output, define the environment variable DISPLAY. On the command line, type the following command and press Enter:

C shell: setenv DISPLAY=:0

Bourne/Korn shell: export DISPLAY=”:0”

Configuration

●

Data Analyzer requires a means to render graphics for charting and indicators. When graphics rendering is not configured properly, charts and indicators do not display properly on dashboards or reports. For Data Analyzer

INFORMATICA CONFIDENTIAL

BEST PRACTICES

190 of 954

installations using an application server with JDK 1.4 and greater, the “java.awt.headless=true” setting can be set in the application server startup scripts to facilitate graphics rendering for Data Analyzer. If the application server does not use JDK 1.4 or later, use an X-Windows server or XVFB to render graphics. The DISPLAY environment variable should be set to the IP address of the X-Windows or XVFB server prior to starting Data Analyzer. ●

The application server heap size is the memory allocation for the JVM. The recommended heap size depends on the memory available on the machine hosting the application server and server load, but the recommended starting point is 512MB. This setting is the first setting that should be examined when tuning a Data Analyzer instance.

Last updated: 24-Jul-07 16:40

INFORMATICA CONFIDENTIAL

BEST PRACTICES

191 of 954

Data Connectivity using PowerCenter Connect for BW Integration Server Challenge Understanding how to use PowerCenter Connect for SAP NetWeaver - BW Option to load data into the SAP BW (Business Information Warehouse).

Description The PowerCenter Connect for SAP NetWeaver - BW Option supports the SAP Business Information Warehouse as both a source and target.

Extracting Data from BW PowerCenter Connect for SAP NetWeaver - BW Option lets you extract data from SAP BW to use as a source in a PowerCenter session. PowerCenter Connect for SAP NetWeaver - BW Option integrates with the Open Hub Service (OHS), SAP’s framework for extracting data from BW. OHS uses data from multiple BW data sources, including SAP's InfoSources and InfoCubes. The OHS framework includes InfoSpoke programs, which extract data from BW and write the output to SAP transparent tables.

Loading Data into BW PowerCenter Connect for SAP NetWeaver - BW Option lets you import BW target definitions into the Designer and use the target in a mapping to load data into BW. PowerCenter Connect for SAP NetWeaver - BW Option uses Business Application Program Interface (BAPI), to exchange metadata and load data into BW. PowerCenter can use SAP’s business content framework to provide a high-volume data warehousing solution or SAP’s Business Application Program Interface (BAPI), SAP’s strategic technology for linking components into the Business Framework, to exchange metadata with BW. PowerCenter extracts and transforms data from multiple sources and uses SAP’s highspeed bulk BAPIs to load the data into BW, where it is integrated with industry-specific models for analysis through the SAP Business Explorer tool.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

192 of 954

Using PowerCenter with PowerCenter Connect to Populate BW The following paragraphs summarize some of the key differences in using PowerCenter with the PowerCenter Connect to populate a SAP BW rather than working with standard RDBMS sources and targets. ●

●

●

●

●

●

BW uses a pull model. The BW must request data from a source system before the source system can send data to the BW. PowerCenter must first register with the BW using SAP’s Remote Function Call (RFC) protocol. The native interface to communicate with BW is the Staging BAPI, an API published and supported by SAP. Three products in the PowerCenter suite use this API. PowerCenter Designer uses the Staging BAPI to import metadata for the target transfer structures; PowerCenter Integration Server for BW uses the Staging BAPI to register with BW and receive requests to run sessions; and the PowerCenter Server uses the Staging BAPI to perform metadata verification and load data into BW. Programs communicating with BW use the SAP standard saprfc.ini file to communicate with BW. The saprfc.ini file is similar to the tnsnames file in Oracle or the interface file in Sybase. The PowerCenter Designer reads metadata from BW and the PowerCenter Server writes data to BW. BW requires that all metadata extensions be defined in the BW Administrator Workbench. The definition must be imported to Designer. An active structure is the target for PowerCenter mappings loading BW. Because of the pull model, BW must control all scheduling. BW invokes the PowerCenter session when the InfoPackage is scheduled to run in BW. BW only supports insertion of data into BW. There is no concept of update or deletes through the staging BAPI.

Steps for Extracting Data from BW The process of extracting data from SAP BW is quite similar to extracting data from SAP. Similar transports are used on the SAP side, and data type support is the same as that supported for SAP PowerCenter Connect. The steps required for extracting data are: 1. Create an InfoSpoke. Create an InfoSpoke in the BW to extract the data from the BW database and write it to either a database table or a file output target. 2. Import the ABAP program. Import the Informatica-provided ABAP program, INFORMATICA CONFIDENTIAL

BEST PRACTICES

193 of 954

3. 4. 5. 6.

which calls the workflow created in the Workflow Manager. Create a mapping. Create a mapping in the Designer that uses the database table or file output target as a source. Create a workflow to extract data from BW. Create a workflow and session task to automate data extraction from BW. Create a Process Chain. A BW Process Chain links programs together to run in sequence. Create a Process Chain to link the InfoSpoke and ABAP programs together. Schedule the data extraction from BW. Set up a schedule in BW to automate data extraction.

Steps To Load Data into BW 1. Install and Configure PowerCenter Components. The installation of the PowerCenter Connect for SAP NetWeaver - BW Option includes both a client and a server component. The Connect server must be installed in the same directory as the PowerCenter Server. Informatica recommends installing the Connect client tools in the same directory as the PowerCenter Client. For more details on installation and configuration refer to the PowerCenter and the PowerCenter Connect installation guides. Note: On SAP Transports for PowerConnect version 8.1 and above, it is crucial to install or upgrade PowerCenter 8.1 transports on the appropriate SAP system, when installing or upgrading PowerCenter Connect for SAP NetWeaver - BW Option. If you are extracting data from BW using OHS, you must also configure the mySAP option. If the BW system is separate from the SAP system, install the designated transports on the BW system. It is also important to note that there are now three categories of transports (as compared to two in previous versions). These are as follows: ●

Transports for SAP versions 3.1H and 3.1I. ●

Transports for SAP versions 4.0B to 4.6B, 4.6C, and non-Unicode versions 4.7 and above. ●

Transports for SAP Unicode versions 4.7 and above; this category has been added for Unicode extraction support which was not previously available in SAP versions 4.6 and earlier.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

194 of 954

2. Build the BW Components. To load data into BW, you must build components in both BW and PowerCenter. You must first build the BW components in the Administrator Workbench: ●

● ●

●

●

Define PowerCenter as a source system to BW. BW requires an external source definition for all non-R/3 sources.

Create the InfoObjects in BW (this is similar to a database table). The InfoSource represents a provider structure. Create the InfoSource in the BW Administrator Workbench and import the definition into the PowerCenter Warehouse Designer. Assign the InfoSource to the PowerCenter source system. After you create an InfoSource, assign it to the PowerCenter source system. Activate the InfoSource. When you activate the InfoSource, you activate the InfoObjects and the transfer rules.

3. Configure the sparfc.ini file. Required for PowerCenter and Connect to connect to BW. PowerCenter uses two types of entries to connect to BW through the saprfc.ini file: ●

●

Type A. Used by PowerCenter Client and PowerCenter Server. Specifies the BW application server. Type R. Used by the PowerCenter Connect for SAP NetWeaver - BW Option. Specifies the external program, which is registered at the SAP gateway. Note: Do not use Notepad to edit the sparfc.ini file because Notepad can corrupt the file. Set RFC_INI environment variable for all Windows NT, Windows 2000, and Windows 95/98 machines with saprfc.ini file. RFC_INI is used to locate the saprfc.ini.

4. Start the Connect for BW server Start Connect for BW server after you start PowerCenter Server and before you create InfoPackage in BW.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

195 of 954

5. Build mappings Import the InfoSource into the PowerCenter repository and build a mapping using the InfoSource as a target. The following restrictions apply to building mappings with BW InfoSource target: ●

You cannot use BW as a lookup table.

●

You can use only one transfer structure for each mapping.

●

You cannot execute stored procedure in a BW target.

●

You cannot partition pipelines with a BW target.

●

●

You cannot copy fields that are prefaced with /BIC/ from the InfoSource definition into other transformations. You cannot build an update strategy in a mapping. BW supports only inserts; it does not support updates or deletes. You can use Update Strategy transformation in a mapping, but the Connect for BW Server attempts to insert all records, even those marked for update or delete.

6. Load data To load data into BW from PowerCenter, both PowerCenter and the BW system must be configured. Use the following steps to load data into BW: ●

●

Configure a workflow to load data into BW. Create a session in a workflow that uses a mapping with an InfoSource target definition. Create and schedule an InfoPackage. The InfoPackage associates the PowerCenter session with the InfoSource. When the Connect for BW Server starts, it communicates with the BW to register itself as a server. The Connect for BW Server waits for a request from the BW to start the workflow. When the InfoPackage starts, the BW communicates with the registered Connect for BW Server and sends the workflow name to be scheduled with the PowerCenter Server. The Connect for BW Server reads information about the workflow and sends a request to the PowerCenter Server to run the workflow. The PowerCenter Server validates the workflow name in the repository and the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

196 of 954

workflow name in the InfoPackage. The PowerCenter Server executes the session and loads the data into BW. You must start the Connect for BW Server after you restart the PowerCenter Server.

Supported Datatypes The PowerCenter Server transforms data based on the Informatica transformation datatypes. BW can only receive data in 250 bytes per packet. The PowerCenter Server converts all data to a CHAR datatype and puts it into packets of 250 bytes, plus one byte for a continuation flag. BW receives data until it reads the continuation flag set to zero. Within the transfer structure, BW then converts the data to the BW datatype. Currently, BW only supports the following datatypes in transfer structures assigned to BAPI source systems (PowerCenter ): CHAR, CUKY, CURR, DATS, NUMC, TIMS, UNIT. All other datatypes result in the following error in BW: Invalid data type (data type name) for source system of type BAPI.

Date/Time Datatypes The transformation date/time datatype supports dates with precision to the second. If you import a date/time value that includes milliseconds, the PowerCenter Server truncates to seconds. If you write a date/time value to a target column that supports milliseconds, the PowerCenter Server inserts zeros for the millisecond portion of the date.

Binary Datatypes BW does not allow you to build a transfer structure with binary datatypes. Therefore, you cannot load binary data from PowerCenter into BW.

Numeric Datatypes PowerCenter does not support the INT1 datatype.

Performance Enhancement for Loading into SAP BW

INFORMATICA CONFIDENTIAL

BEST PRACTICES

197 of 954

If you see a performance slowdown for sessions that load into SAP BW, set the default buffer block size to 15MB to 20MB to enhance performance. You can put 5,000 to 10,000 rows per block, so you can calculate the buffer block size needed with the following formula: Row size x Rows per block = Default Buffer Block size For example, if your target row size is 2KB: 2 KB x 10,000 = 20MB.

Last updated: 04-Jun-08 16:31

INFORMATICA CONFIDENTIAL

BEST PRACTICES

198 of 954

Data Connectivity using PowerExchange for WebSphere MQ Challenge Integrate WebSphere MQ applications with PowerCenter mappings.

Description With increasing requirements for both “on-demand” real-time data integration and the development of Enterprise Application Integration (EAI) architectures, WebSphere MQ has become an important part of the Informatica data integration platform. PowerExchange for WebSphere MQ provides data integration for transactional data generated by continuously messaging systems. PowerCenter’s Zero Latency (ZL) Engine provides immediate processing of trickle-feed data for these types of messaging systems that allows both uni-directional and bi-directional processing of real-time data flow.

High Volume System Considerations When working with high volume systems, two things to consider are the volume and the size of the messages coming over the network and whether or not the messages are persistent or non-persistent. Although a queue may be configured for persistence, a specific message can override this setting. When a message is persistent, the Queue Manager first writes the message out to a log before it allows it to be visible in the queue. In a very high volume flow, if this is not handled correctly, it can lead to performance degradation and cause the logging to potentially fill up the file system. Non-persistent messages are immediately visible in the queue for processing, but unlike persistent messages, if the Queue Manager or server crashes they cannot be recovered. To handle this type of flow volume, PowerCenter workflows can be configured to run in a Grid environment. The image below shows the two options that are available for persistence when creating a Local Queue:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

199 of 954

In conjunction with the PowerCenter Grid option, WebSphere MQ can also be clustered to allow multiple Queue Managers to process the same message flow(s). In this type of configuration, separate Integration Services can be created to each hold unique MQSERVER environment variables. Alternately, a Client Connection can be created for one Integration Service, with multiple connection properties configured for each Queue Manager in the cluster that holds the flow.

Message Affinity Message Affinity is a consideration that is unique to clustered environments. Message Affinity occurs when the order in which a message should be processed happens out of sync. Example:

In a trading system environment, a user’s sell message comes before the buy message.

Solution:

To help limit this behavior messages can have a unique id placed in the message header to show grouping as well as order. IMPORTANT -- It is not a common practice for the resequencing of these messages to be placed on the middleware software. The sending and receiving application should be responsible for this algorithm.

Message Sizes The message size for any given flow needs to be determined before the development and architecture of workflows and queues. By default, all messaging communication objects are set to allow up to a 4 MB message size. If a message in the flow is larger than 4 MB the Queue Manager will log an error and allow the message through.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

200 of 954

To overcome this issue MQCHLLIB/MQCHLTAB environment variables must be used. The following settings must also be modified to allow for the larger message(s) in the queue. 1. Client Connection Channel: Set the Maximum Message Length to the largest estimated message size (100 MB limit).

2. Local Queue: Set the Max Message Length to the largest message size (100 MB limit).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

201 of 954

3. Queue Manager: The Queue Manager Max Message Length setting is key to allowing other objects to allow messages through. If the Queue Manager has a Max Message Length set to anything smaller than what is set in a Channel or a Local Queue the message will fail. For large messaging systems, create a separate Queue Manager just for those flows. Maximum size a Queue Manager can handle is 100 MB.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

202 of 954

Example:

A high volume application requiring PowerCenter to process a minimum 200 MSG/Sec 24/7. One message has four segments and each segment loads to a separate table. Three of the segments are optional and may not be present in a given message. The message is XML and must go thru a midstream XML parser in order to get the separate data out for each table. If a midstream XML Parser cannot handle segmenting the XML and loading it to the correct database tables fast enough to keep up with the message flow, messages can back up and cause the Queue Manager to overflow.

Solution:

First estimate each message’s maximum size and then create a separate queue for each of the separate segments within the message. Create individual workflows to handle each queue and to load the data to the correct table. Then use an expression in PowerCenter to break out each segment and load it to the associated queue. For the optional segments, if they don’t exist, there is nothing to load. Each workflow can then separately load the segmented XML into its own Mid Stream XML parser and into the correct database.

Result:

Processing speed thru PowerCenter increased to 400–450 MSG/Sec.

Last updated: 27-May-08 13:07

INFORMATICA CONFIDENTIAL

BEST PRACTICES

203 of 954

Data Connectivity using PowerExchange for SAP NetWeaver Challenge Understanding how to install PowerExchange for SAP NetWeaver, extract data from SAP R/3, and load data into SAP R/3.

Description SAP R/3 is an ERP software that provides multiple business applications/modules, such as financial accounting, materials management, sales and distribution, human resources, CRM and SRM. The CORE R/3 system (BASIS layer) is programmed in Advance Business Application Programming-Fourth Generation (ABAP/4, or ABAP), a language proprietary to SAP. PowerExchange for SAP NetWeaver can write/read/change data in R/3 via BAPI/RFC and IDoc interfaces. The ABAP interface of PowerExchange for SAP NetWeaver can only read data from SAP R/3. PowerExchange for SAP NetWeaver provides the ability to extract SAP R/3 data into data warehouses, data integration applications, and other third-party applications. All of this is accomplished without writing complex ABAP code. PowerExchange for SAP NetWeaver generates ABAP programs and is capable of extracting data from transparent tables, pool tables, and cluster tables. When integrated with R/3 using ALE (Application Link Enabling), PowerExchange for SAP NetWeaver can also extract data from R/3 using outbound IDocs (Intermediate Documents) in near real-time. The ALE concept available in R/3 Release 3.0 supports the construction and operation of distributed applications. It incorporates controlled exchange of business data messages while ensuring data consistency across loosely-coupled SAP applications. The integration of various applications is achieved by using synchronous and asynchronous communication, rather than by means of a central database. The database server stores the physical tables in the R/3 system, while the application server stores the logical tables. A transparent table definition on the application server is represented by a single physical table on the database server. Pool and cluster tables are logical definitions on the application server that do not have a one-to-one relationship with a physical table on the database server.

Communication Interfaces TCP/IP is the native communication interface between PowerCenter and SAP R/3. Other interfaces between the two include: Common Program Interface-Communications (CPI-C). CPI-C communication protocol enables online data exchange and data conversion between R/3 and PowerCenter. To initialize CPI-C communication with PowerCenter, SAP R/3 requires information such as the host name of the application server and the SAP gateway. This information is stored on the PowerCenter Server in a configuration file named sideinfo. The PowerCenter Server uses parameters in the sideinfo file to execute ABAP stream mode sessions. Remote Function Call (RFC). RFC is the remote communication protocol used by SAP and is based on RPC (Remote Procedure Call). To execute remote calls from PowerCenter, SAP R/3 requires information such as the connection type and the service name and gateway on the application server. This information is stored on the PowerCenter Client and PowerCenter Server in a configuration file named saprfc.ini. PowerCenter makes remote function calls when importing source definitions, installing ABAP programs and running ABAP file mode sessions. Transport system. The transport system in SAP is a mechanism to transfer objects developed on one system to INFORMATICA CONFIDENTIAL

BEST PRACTICES

204 of 954

another system. Transport system is primarily used to migrate code and configuration from development to QA and production systems. It can be used in the following cases: ●

PowerExchange for SAP NetWeaver installation transports

●

PowerExchange for SAP NetWeaver generated ABAP programs

Note: If the ABAP programs are installed in the $TMP development class, they cannot be transported from development to production. Ensure you have a transportable development class/package for the ABAP mappings. Security You must have proper authorizations on the R/3 system to perform integration tasks. The R/3 administrator needs to create authorizations, profiles, and users for PowerCenter users.

Integration Feature

Authorization Object Activity

Import Definitions, Install Programs

S_DEVELOP

All activities. Also need to set Development Object ID to PROG

Extract Data

S_TABU_DIS

READ

Run File Mode Sessions

S_DATASET

WRITE

Submit Background Job

S_PROGRAM

BTCSUBMIT, SUBMIT

Release Background Job

S_BTCH_JOB

DELE, LIST, PLAN, SHOW Also need to set Job Operation to RELE

Run Stream Mode Sessions

S_CPIC

All activities

Authorize RFC privileges

S_RFC

All activities

You also need access to the SAP GUI, as described in following SAP GUI Parameters table:

Parameter

Feature references to this variable

INFORMATICA CONFIDENTIAL

BEST PRACTICES

Comments

205 of 954

User ID

$SAP_USERID

Identify the username that connects to the SAP GUI and is authorized for read-only access to the following transactions: - SE12 - SE15 - SE16 - SPRO

Password

$SAP_PASSWORD

Identify the password for the above user

System Number

$SAP_SYSTEM_NUMBER

Identify the SAP system number

Client Number

$SAP_CLIENT_NUMBER

Identify the SAP client number

Server

$SAP_SERVER

Identify the server on which this instance of SAP is running

Key Capabilities of PowerExchange for SAP NetWeaver Some key capabilities of PowerExchange for SAP NetWeaver include: ●

Extract data from SAP R/3 using ABAP BAPI /RFC and IDoc interfaces.

●

Migrate/load data from any source into R/3 using IDoc, BAPI/RFC and DMI interfaces.

●

●

●

●

●

●

●

●

Generate DMI files ready to be loaded into SAP via SXDA TOOLS or LSMW or SAP standard delivered programs. Support calling BAPI and RFC functions dynamically from PowerCenter for data integration. PowerExchange for SAP NetWeaver can make BAPI and RFC function calls dynamically from mappings to extract or load. Capture changes to the master and transactional data in SAP R/3 using ALE. PowerExchange for SAP NetWeaver can receive outbound IDocs from SAP R/3 in real time and load into SAP R/3 using inbound IDocs. To receive IDocs in real time using ALE, install PowerExchange for SAP NetWeaver on PowerCenterRT. Provide rapid development of the data warehouse based on R/3 data using Analytic Business Components for SAP R/3 (ABC). ABC is a set of business content that includes mappings, mapplets, source objects, targets, and transformations. Set partition points in a pipeline for outbound/inbound IDoc sessions; sessions that fail when reading outbound IDocs from an SAP R/3 source can be configured for recovery. You can also receive data from outbound IDoc files and write data to inbound IDoc files. Insert ABAP Code Block to add functionality to the ABAP program flow and use static/dynamic filters to reduce return rows. Customize the ABAP program flow with joins, filters, SAP functions, and code blocks. For example: qualifying table = table1-field1 = table2-field2 where the qualifying table is the last table in the condition based on the join order including outer joins. Create ABAP program variables to represent SAP R/3 structures, structure fields, or values in the ABAP

INFORMATICA CONFIDENTIAL

BEST PRACTICES

206 of 954

program. ● ●

Remove ABAP program information from SAP R/3 and the repository when a folder is deleted. Provide enhanced platform support by running on 64-bit AIX and HP-UX (Itanium). You can install PowerExchange for SAP NetWeaver for the PowerCenter Server and Repository Server on SuSe Linux or on Red Hat Linux.

Installation and Configuration Steps PowerExchange for SAP NetWeaver setup programs install components for PowerCenter Server, Client, and repository server. These programs install drivers, connection files, and a repository plug-in XML file that enables integration between PowerCenter and SAP R/3. Setup programs can also install PowerExchange for SAP NetWeaver Analytic Business Components, and PowerExchange for SAP NetWeaver Metadata Exchange. The PowerExchange for SAP NetWeaver repository plug-in is called sapplg.xml. After the plug-in is installed, it needs to be registered in the PowerCenter repository.

For SAP R/3 Informatica provides a group of customized objects required for R/3 integration in the form of transport files. These objects include tables, programs, structures, and functions that PowerExchange for SAP NetWeaver exports to data files. The R/3 system administrator must use the transport control program, tp import, to transport these object files on the R/3 system. The transport process creates a development class called ZERP. The SAPTRANS directory contains “data” and “co” files. The “data” files are the actual transport objects. The “co” files are control files containing information about the transport request. The R/3 system needs development objects and user profiles established to communicate with PowerCenter. Preparing R/3 for integration involves the following tasks: ●

Transport the development objects on the PowerCenter CD to R/3. PowerCenter calls these objects each time it makes a request to the R/3 system.

●

Run the transport program that generates unique Ids.

●

Establish profiles in the R/3 system for PowerCenter users.

●

Create a development class for the ABAP programs that PowerCenter installs on the SAP R/3 system.

For PowerCenter The PowerCenter server and client need drivers and connection files to communicate with SAP R/3. Preparing PowerCenter for integration involves the following tasks: ●

Run installation programs on PowerCenter Server and Client machines.

●

Configure the connection files: ❍

The sideinfo file on the PowerCenter Server allows PowerCenter to initiate CPI-C with the R/3 system. Following are the required parameters for sideinfo : DEST logical name of the R/3 system TYPE set to A to indicate connection to specific R/3 system. ASHOST host name of the SAP R/3 application server. SYSNR system number of the SAP R/3 application server.

❍

-The saprfc.ini file on the PowerCenter Client and Server allows PowerCenter to connect to the R/3 system as an RFC client. The required parameters for sideinfo are: DEST logical name of the R/3 system LU host name of the SAP application server machine

INFORMATICA CONFIDENTIAL

BEST PRACTICES

207 of 954

TP set to sapdp GWHOST host name of the SAP gateway machine. GWSERV set to sapgw ROTOCOL set to I for TCP/IP connection. Following is the summary of required steps: 1. 2. 3. 4. 5. 6.

Install PowerExchange for SAP NetWeaver on PowerCenter. Configure the sideinfo file. Configure the saprfc.ini Set the RFC_INI environment variable. Configure an application connection for SAP R/3 sources in the Workflow Manager. Configure SAP/ALE IDoc connection in the Workflow Manager to receive IDocs generated by the SAP R/3 system. 7. Configure the FTP connection to access staging files through FTP. 8. Install the repository plug-in in the PowerCenter repository.

Configuring the Services File Windows If SAPGUI is not installed, you must make entries in the Services file to run stream mode sessions. This is found in the \WINNT\SYSTEM32\drivers\etc directory. Entries should be similar to the following: sapdp /tcp sapgw /tcp Note: SAPGUI is not technically required, but experience has shown that evaluators typically want to log into the R/3 system to use the ABAP workbench and to view table contents.

UNIX Services file is located in /etc ●

sapdp /TCP

●

sapgw /TCP

The system number and port numbers are provided by the BASIS administrator.

Configure Connections to Run Sessions Informatica supports two methods of communication between the SAP R/3 system and the PowerCenter Server. ●

●

Streaming Mode does not create any intermediate files on the R/3 system. This method is faster, but uses more CPU cycles on the R/3 system. File Mode creates an intermediate file on the SAP R/3 system, which is then transferred to the machine running the PowerCenter Server.

If you want to run file mode sessions, you must provide either FTP access or NFS access from the machine running the PowerCenter Server to the machine running SAP R/3. This, of course, assumes that PowerCenter and SAP R/3 are not running on the same machine; it is possible to run PowerCenter and R/3 on the same system, but highly INFORMATICA CONFIDENTIAL

BEST PRACTICES

208 of 954

unlikely. If you want to use File mode sessions and your R/3 system is on a UNIX system, you need to do one of the following: ● ●

●

Provide the login and password for the UNIX account used to run the SAP R/3 system. Provide a login and password for a UNIX account belonging to same group as the UNIX account used to run the SAP R/3 system. Create a directory on the machine running SAP R/3, and run “chmod g+s” on that directory. Provide the login and password for the account used to create this directory.

Configure database connections in the Server Manager to access the SAP R/3 system when running a session, then configure an FTP connection to access staging file through FTP.

Extraction Process R/3 source definitions can be imported from the logical tables using RFC protocol. Extracting data from R/3 is a fourstep process: Import source definitions. The PowerCenter Designer connects to the R/3 application server using RFC. The Designer calls a function in the R/3 system to import source definitions. Note: If you plan to join two or more tables in SAP, be sure you have the optimized join conditions. Make sure you have identified your driving table (e.g., if you plan to extract data from bkpf and bseg accounting tables, be sure to drive your extracts from bkpf table). There is a significant difference in performance if the joins are properly defined. Create a mapping. When creating a mapping using an R/3 source definition, you must use an ERP source qualifier. In the ERP source qualifier, you can customize properties of the ABAP program that the R/3 server uses to extract source data. You can also use joins, filters, ABAP program variables, ABAP code blocks, and SAP functions to customize the ABAP program. Generate and install ABAP program. You can install two types of ABAP programs for each mapping: ●

●

File mode. Extract data to file. The PowerCenter Server accesses the file through FTP or NFS mount. This mode is used for large extracts as there are timeouts set in SAP for long running queries. Stream Mode. Extract data to buffers. The PowerCenter Server accesses the buffers through CPI-C, the SAP protocol for program-to-program communication. This mode is preferred for short running extracts.

You can modify the ABAP program block and customize according to your requirements (e.g., if you want to get data incrementally, create a mapping variable/parameter and use it in the ABAP program).

Create Session and Run Workflow ●

●

Stream Mode. In stream mode, the installed ABAP program creates buffers on the application server. The program extracts source data and loads it into the buffers. When a buffer fills, the program streams the data to the PowerCenter Server using CPI-C. With this method, the PowerCenter Server can process data when it is received. File Mode. When running a session in file mode, the session must be configured to access the file through NFS mount or FTP. When the session runs, the installed ABAP program creates a file on the application server. The program extracts source data and loads it into the file. When the file is complete, the PowerCenter Server accesses the file through FTP or NFS mount and continues processing the session.

Data Integration Using RFC/BAPI Functions INFORMATICA CONFIDENTIAL

BEST PRACTICES

209 of 954

PowerExchange for SAP NetWeaver can generate RFC/BAPI function mappings in the Designer to extract data from SAP R/3, change data in R/3, or load data into R/3. When it uses an RFC/BAPI function mapping in a workflow, the PowerCenter Server makes the RFC function calls on R/3 directly to process the R/3 data. It doesn’t have to generate and install the ABAP program for data extraction.

Data Integration Using ALE PowerExchange for SAP NetWeaver can integrate PowerCenter with SAP R/3 using ALE. With PowerExchange for SAP NetWeaver, PowerCenter can generate mappings in the Designer to receive outbound IDocs from SAP R/3 in real time. It can also generate mappings to send inbound IDocs to SAP for data integration. When PowerCenter uses an inbound or outbound mapping in a workflow to process data in SAP R/3 using ALE, it doesn’t have to generate and install the ABAP program for data extraction.

Analytical Business Components Analytic Business Components for SAP R/3 (ABC) allows you to use predefined business logic to extract and transform R/3 data. It works in conjunction with PowerCenter and PowerExchange for SAP NetWeaver to extract master data, perform lookups, provide documents, and other fact and dimension data from the following R/3 modules: ●

Financial Accounting

●

Controlling

●

Materials Management

●

Personnel Administration and Payroll Accounting

●

Personnel Planning and Development

●

Sales and Distribution

Refer to the ABC Guide for complete installation and configuration information.

Last updated: 04-Jun-08 17:30

INFORMATICA CONFIDENTIAL

BEST PRACTICES

210 of 954

Data Connectivity using PowerExchange for Web Services Challenge Understanding PowerExchange for Web Services and configuring PowerCenter to access a secure web service.

Description PowerExchange for Web Services is a service oriented integration technology that can be utilized for bringing application logic that is embedded in existing systems into the PowerCenter data integration platform. Leveraging the logic in existing systems is a cost-effective method for data integration. For example, an insurance policy score calculation logic that is available in a mainframe application can be exposed as a web service and then used by PowerCenter mappings. PowerExchange for Web Services (WebServices Consumer) allows PowerCenter to act as a web services client to consume external web services. PowerExchange for Web Services uses the Simple Object Access Protocol (SOAP) to communicate with the external web service provider. An external web service can be invoked from PowerCenter in three ways: ●

Web Service source

●

Web Service transformation

●

Web Service target

In order to increase performance of message transmission, SOAP requests and responses can be compressed. Furthermore, pass-through partitioned sessions can be used for increasing parallelism in the case of large data volumes.

Web Service Source Usage PowerCenter supports a request-response type of operation when using a Web Services source. The web service can be used as a source if the input in the SOAP request remains fairly constant (since input values for a web service source can only be provided at the source transformation level). Although Web services source definitions

INFORMATICA CONFIDENTIAL

BEST PRACTICES

211 of 954

can be created without using a WSDL they can be edited in the WSDL workspace in PowerCenter Designer.

Web Service Transformation Usage PowerCenter also supports a request-response type of operation when using a Web Services transformation. The web service can be used as a transformation if input data is available midstream and the response values will be captured from the web service. The following steps provide an example for invoking a Stock Quote web service to learn the price of each of the ticker symbols available in a flat file: 1. In Transformation Developer, create a web service consumer transformation. 2. Specify the URL for the stock quote wsdl and choose the operation get quote. 3. Connect the input port of this transformation to the field containing the ticker symbols. 4. To invoke the web service for each input row, change to source-based commit and an interval of 1. Also change the Transaction Scope to Transaction in the web services consumer transformation.

Web Service Target Usage PowerCenter supports a one-way type of operation when using a Web Services target. The web service can be used as a target if it is needed only to send a message (and no response is needed). PowerCenter only waits for the web server to start processing the message; it does not wait for the web server to finish processing the web service operation. Existing relational and flat files can be used for the target definitions; or target columns can be defined manually.

PowerExchange for Web Services and Web Services Provider PowerCenter Web Services Provider is a separate product from PowerExchange for Web Services. An advantage to using PowerCenter Web Services Provider is that it decouples the web service that needs to be consumed from the client. By using PowerCenter as the glue, changes can be made that are transparent to the client. This is useful because often there is no access to the client code or to the web service. Other considerations include: ●

PowerCenter Web Services Provider acts as a Service Provider and exposes many key functionalities as web services.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

212 of 954

●

●

●

In PowerExchange for Web Services, PowerCenter acts as a web service client and consumes external web services. It is not necessary to install or configure Web Services Provider in order to use PowerExchange for Web Services. Web Services exposed through PowerCenter have two formats that can be invoked by different kinds of client programs (e.g., C#, Java, .net) by using the WSDL that can be generated from the Web Services Hub. ❍

❍

Real-Time: In real time mode, web enabled workflows are exposed. The Web Services Provider must be used and be pointed to the workflow that is going to be invoked as a web service. Workflows can be started and protected. Batch: In batch mode, a pre-set of services are exposed to run and monitor workflows in PowerCenter. This feature can be used for reporting and monitoring purposes.

Last but not least, PowerCenter’s open architecture facilitates HTTP and HTTPS requests with an http transformation for GET, POST, and SIMPLE POST methods to read from or write data to an HTTP server.

Configuring PowerCenter to Invoke a Secure Web Service Secure Sockets Layer (SSL) is used to provide security features such as authentication and encryption to web services applications. The authentication certificates follow the Public Key Infrastructure (PKI) standard, a system of digital certificates provided by certificate authorities to verify and authenticate parties of Internet communications or transactions. These certificates are managed in the following two keystore files: ●

●

Trust store. A trust store holds the public keys for the entities it can trust. Integration Service uses the entries in the trust store file to authenticate the external web services servers. Client store. A client store holds both the entity’s public and private keys. Integration Service sends the entries in the client store file to the web services provider so that the web services provider can authenticate the Integration Service.

By default, the trust certificates file is named ca-bundle.crt and contains certificates issued by major, trusted certificate authorities. The ca-bundle.crt file is located in /server/bin.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

213 of 954

SSL authentication can be performed in three ways: ●

Server authentication

●

Client authentication

●

Mutual authentication

All of the SSL authentication configurations can be set by entering values for Web Service application connections in Workflow Manager.

Server Authentication: Since the web service provider is the server and the Integration Service is the client, the web service provider is responsible for authenticating the Integration Service. The Integration Service sends the web service provider a client certificate file containing a public key and the web service provider verifies this file. The client certificate file and the corresponding private key file should be configured for this option.

Client Authentication: Since the Integration Service is the client of the web service provider, it establishes an SSL session to authenticate the web service provider. The Integration Service verifies that the authentication certificate sent by the web service provider exists in the trust certificates file. The trust certificates file should be configured for this option.

Mutual Authentication The Integration Service and web service provider exchange certificates and verify each other. For this option the trust certificates file, the client certificate and the corresponding private key file should be configured.

Converting Other Formats of Certificate Files There are a number of other formats of certificate files available: DER format (.cer and . der extensions); PEM format (.pem extension); and PKCS#12 format (.pfx or .P12 extension). The private key for a client certificate must be in PEM format. Files can be converted from one format of certificate to another using the OpenSSL utility. Refer to the OpenSSL documentation for complete information on such conversions. A few examples are given below:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

214 of 954

To convert from DER to PEM (assuming there is a DER file called server.der) openssl x509 -in server.der -inform DER -out server.pem -outform PEM To convert a PKCS12 file called server.pfx to PEM openssl pkcs12 -in server.pfx -out server.pem

Web Service Performance Tips The basis of Web Services communication takes place in the form of XML Documents. The performance does get affected by the type of requests that are being transmitted. Below are some tips that can help to improve performance. ● ●

●

●

●

●

Avoid frequent transmissions of huge data elements. The nesting of elements in a SOAP request has a significant effect on performance. Run these requests in verbose data mode in order to check for this. When data is being retrieved for aggregation purposes or for financial calculations (i.e., not real-time) shift those requests to non-peak hours to improve response time. Capture the response time for each request sent, by using Sysdate in an expression before the web service transformation, and in an expression after. This will show the true latency which can then be averaged to determine scaling needs. Try to limit the number of web service calls (when possible). If you are using the same calls multiple times to return pieces of information for different targets, it would be better to return a complete set of results with a unique ID and then stage the sourcing for the different targets. Sending simple datatypes (e.g., integer, float, string) improves performance.

Last updated: 27-May-08 16:45

INFORMATICA CONFIDENTIAL

BEST PRACTICES

215 of 954

Data Migration Principles Challenge A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informatica’s suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget). In this Best Practice we will discuss basic principles for data migration to lower the project time, to lower staff time to develop, lower risk and lower the total cost of ownership of the project. These principles include: 1. 2. 3. 4. 5. 6. 7.

Leverage staging strategies Utilize table driven approaches Develop via Modular Design Focus On Re-Use Common Exception Handling Processes Multiple Simple Processes versus Few Complex Processes Take advantage of metadata

Description Leverage Staging Strategies As discussed elsewhere in Velocity, in data migration it is recommended to employ both a legacy staging and pre-load staging area. The reason for this is simple, it provides the ability to pull data from the production system and use it for data cleaning and harmonization activities without interfering with the production systems. By leveraging this type of strategy you are able to see real production data sooner and follow the guiding principle of ‘Convert Early, Convert Often, and with Real Production Data'.

Utilize Table Driven Approaches INFORMATICA CONFIDENTIAL

BEST PRACTICES

216 of 954

Developers frequently find themselves in positions where they need to perform a large amount of cross-referencing, hard-coding of values, or other repeatable transformations during a Data Migration. These transformations often have a probability to change over time. Without a table driven approach this will cause code changes, bug fixes, re-testing, and re-deployments during the development effort. This work is unnecessary on many occasions and could be avoided with the use of configuration or reference data tables. It is recommend to use table driven approaches such as these whenever possible. Some common table driven approaches include: ●

●

●

●

Default Values – hard-coded values for a given column, stored in a table where the values could be changed whenever a requirement changes. For example, if you have a hard coded value of NA for any value not populated and then want to change that value to NV you could simply change the value in a default value table rather then change numerous hard-coded values. Cross-Reference Values – frequently in data migration projects there is a need to take values from the source system and convert them to the value of the target system. These values are usually identified up-front, but as the source system changes additional values are also needed. In a typical mapping development situation this would require adding additional values to a series of IIF or Decode statements. With a table driven situation, new data could be added to a cross-reference table and no coding, testing, or deployment would be required. Parameter Values – by using a table driven parameter file you can reduce the need for scripting and accelerate the development process. Code-Driven Table – in some instances a set of understood rules are known. By taking those rules and building code against them, a table-driven/code solution can be very productive. For example, if you had a rules table that was keyed by table/column/rule id, then whenever that combination was found a pre-set piece of code would be executed. If at a later date the rules change to a different set of pre-determined rules, the rule table could change for the column and no additional coding would be required.

Develop Via Modular Design As part of the migration methodology, modular design is encouraged. Modular design is the act of developing a standard way of how similar mappings should function. These are then published as templates and developers are required to build similar mappings in that same manner. This provides rapid development, increases efficiency for testing, and increases ease of maintenance. The result of this change is it causes dramatically lower total cost of ownership and reduced cost.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

217 of 954

Focus On Re-Use Re-use should always be considered during Informatica development. However, due to such a high degree of repeatability, on data migration projects re-use is paramount to success. There is often tremendous opportunity for re-use of mappings/strategies/ processes/scripts/testing documents. This reduces the staff time for migration projects and lowers project costs.

Common Exception Handling Processes Employing the Velocity Data Migration Methodology through its iterative intent will add new data quality rules as problems are found with the data. Because of this it is critical to find data exceptions and write appropriate rules to correct these situations throughout the data migration effort. It is highly recommended to build a common method for capturing and recording these exceptions. This common method should then be deployed for all data migration processes.

Multiple Simple Processes versus Few Complex Processes For data migration projects it is possible to build one process to pull all data for a given entity from all systems to the target system. While this may seem ideal, these type of complex processes take much longer to design and develop, are challenging to test, and are very difficult to maintain over time. Due to these drawbacks, it is recommend to develop many simple processes as needed to complete the effort rather then a few complex processes.

Take Advantage of Metadata The Informatica data integration platform is highly metadata driven. Take advantage of those capabilities on data migration projects. This can be done via a host of reports against the data integration repository such as: 1. 2. 3. 4. 5. 6.

Illustrate how the data is being transformed (i.e., lineage reports) Illustrate who has access to what data (i.e., security group reports) Illustrate what source or target objects exist in the repository Identify how many mappings each developer has created Identify how many sessions each developer has run during a given time period Identify how many successful/failed sessions have been executed

In summary, these design principles provide significant benefits to data migration

INFORMATICA CONFIDENTIAL

BEST PRACTICES

218 of 954

projects and add to the large set of typical best practice items that are available in Velocity. The key to Data Migration projects is architect well, design better, and execute best. Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

219 of 954

Data Migration Project Challenges Challenge A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity, or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informatica’s suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget). In this best practice the three main data migration project challenges will be discussed. These include: 1. Specifications incomplete, inaccurate, or not completed on-time. 2. Data quality problems impacting project time-lines. 3. Difficulties in project management executing the data migration project.

Description Unlike other Velocity Best Practices we will not specify the full solution to each. Rather, it is more important to understand these three challenges and take action to address them throughout the implementation.

Migration Specifications During the execution of data migration projects a challenge that projects always encounter is problems with migration specifications. Projects require the completion of functional specs to identify what is required of each migration interface. Definitions: ●

●

A migration interface is defined as 1 to many mapping/sessions/workflows or scripts used to migrate a data entity from one source system to one target system. A Functional Requirements Specification is normally comprised of a document covering details including security, database join needs, audit needs, and primary contact details. These details are normally at the interface level rather then at the column level. It also includes a Target-Source Matrix target-source matrix which identifies details at the column level such as how source table/columns map to target table/columns, business rules, data cleansing rules, validation rules, and other column level specifics.

Many projects attempt to complete these migrations without these types of specifications. Often these projects have little to no chance to complete on-time or on-budget. Time and subject matter expertise INFORMATICA CONFIDENTIAL

BEST PRACTICES

220 of 954

is needed to complete this analysis; this is the baseline for project success. Projects are disadvantaged when functional specifications are not completed on-time. Developers can often be in a wait mode for extended periods of time when these specs are not completed at the time specified by the project plan. Another project risk occurs when the right individuals are not used to write these specs or often inappropriate levels of importance are applied to this exercise. These situations cause inaccurate or incomplete specifications which prevent data integration developers from successfully building the migration processes. To address the spec challenge for migration projects, projects must have specifications that are completed with accuracy and delivered on time.

Data Quality Most projects are affected by data quality due to the need to address problems in the source data that fit into the six dimensions of data quality:

Data Quality Dimension

Description

Completeness

What data is missing or unusable?

Conformity

What data is stored in a non-standard format?

Consistency

What data values give conflicting Informatica?

Accuracy

What data is incorrect or out of date?

Duplicates

What data records or attributes are repeated?

Integrity

What data is missing or not referenced?

Data migration data quality problems are typically worse then planned for. Projects need to allow enough time to identify and fix data quality problems BEFORE loading the data into the new target system. Informatica’s data integration platform provides data quality capabilities that can help to identify the data quality problems in an efficient manner, but Subject-Matter Experts are required to address how these data problems should be addressed within business context and process.

Project Management INFORMATICA CONFIDENTIAL

BEST PRACTICES

221 of 954

Project managers are often disadvantaged on these types of projects as they are mainly much larger, more expensive, and more complex then any prior project they have been involved with. They need to understand early in the project the importance of correctly completed specs and the importance of addressing data quality and establish a set of tools to accurately and objectively plan the project with the ability to evaluate progress. Informatica’s Velocity Migration Methodology, its tool sets, and the metadata reporting capabilities are key to addressing these project challenges. The key challenge is to fully understand the pitfalls early on in the project and how PowerCenter and Informatica Data Quality can address these challenges, and how metadata reporting can provide objective information relative to project status. In summary, data migration projects are challenged by specification issues, data quality issues, and project management difficulties. By understanding the Velocity Methodology focus on data migration and how Informatica’s products can handle these changes for a successful migration, these challenges can be minimized.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

222 of 954

Data Migration Velocity Approach Challenge A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informatica’s suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget). To meet these objectives a set of best practices have been provided in Velocity focused on Data Migration. This Best Practice provides an overview of how to use Informatica’s Products in an iterative methodology to expedite a data migration project. The keys to the methodology are further discussed in the Best Practice Data Migration Principles.

Description The Velocity approach to data migration is illustrated here. While it is possible to migrate data in one step it is more productive to break these processes up into two or three simpler steps. The goal for data migration is to get the data into the target application as early as possible for large scale implementations. Typical implementations will have three to four trial cutovers or mock-runs before the final implementation of ‘Go-Live’. The mantra for the Informatica based migration is to ‘Convert Early, Convert Often, and Convert with Real Production Data.’ To do this the following approach is encouraged:

Analysis In the analysis phase the functional specs will be completed, these will include both functional specs and target-source matrix. See the Best Practice Data Migration Project Challenges for related information.

Acquire INFORMATICA CONFIDENTIAL

BEST PRACTICES

223 of 954

In the acquire phase the targets-source matrix will be reviewed and all source systems/tables will be identified. These tables will be used to develop one mapping per source table to populate a mirrored structure in a legacy data based schema. For example if there were 50 source tables identified in all the Target-Source Matrix documents, 50 legacy tables would be created and one mapping would be developed; one for each table. It is recommended to perform the initial development against test data, but once complete run a single extract of the current production data. This will assist in addressing data quality problems without impacting production systems. It is recommended to run these extracts in low use time periods and with the cooperation of the operations group responsible for these systems. It is also recommended to take advantage of the Visio Generation Option if available. These mappings are very straight forward and the use of autogeneration can increase consistency and lower required staff time for the project.

Convert In this phase data will be extracted from the legacy stage tables (merged, transformed, and cleansed) to populate a mirror of the target application. As part of this process a standard exception process should be developed to determine exceptions and expedite data cleansing activities. The results of this convert process should be profiled, and appropriate data quality scorecards should be reviewed. During the convert phase the basic set of exception tests should be executed, with exception details collected for future reporting and correction. The basic exception tests include: 1. 2. 3. 4. 5.

Data Type Data Size Data Length Valid Values Range of Values

Exception Type

Exception Description

Data Type

Will the source data value load correctly to the target data type such as a numeric date loading into an Oracle date type?

Data Size

Will a numeric value from a source value load correctly to the target column or will a numeric overflow occur?

Data Length

Is the input value too large for the target column? (This is appropriate for all data types but of particular interest for string data types. For example, in one system a field could be char(256) but most of the values are char(10). In the target the new field is varchar(20) so any value over char (20) should raise an exception.)

Range of Values

Is the input value within a tolerable range for the new system? (For example, does the birth date for an Insurance Subscriber fall between Jan 1, 1900 and Jan 1, 2006? If this test fails the date is unreasonable and should be addressed.)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

224 of 954

Valid Values

Is the input value in a list of tolerant values in the target system? (An example of this would be does the state code for an input record match the list of states in the new target system? If not the data should be corrected prior to entry to the new system.)

Once profiling exercises, exception reports and data quality scorecards are complete a list of data quality issues should be created. This list should then be reviewed with the functional business owners to generate new data quality rules to correct the data. These details should be added to the spec and the original convert process should be modified with the new data quality rules. The convert process should then be re-executed as well as the profiling, exception reporting and data scorecarding until the data is correct and ready for load to the target application.

Migrate In the migrate phase the data from the convert phase should be loaded to the target application. The expectation is that there should be no failures on these loads. The data should be corrected in the covert phase prior to loading the target application. Once the migrate phase is complete, validation should occur. It is recommended to complete an audit/balancing step prior to validation. This is discussed in the Best Practice Build Data Audit/Balancing Processes. Additional detail about these steps are defined in the Best Practice Data Migration Principles. Last updated: 06-Feb-07 12:08

INFORMATICA CONFIDENTIAL

BEST PRACTICES

225 of 954

Build Data Audit/Balancing Processes Challenge Data Migration and Data Integration projects are often challenged to verify that the data in an application is complete. More specifically, to identify that all the appropriate data was extracted from a source system and propagated to its final target. This best practice illustrates how to do this in an efficient and a repeatable fashion for increased productivity and reliability. This is particularly important in businesses that are either highly regulated internally and externally or that have to comply with a host of government compliance regulations such as Sarbanes-Oxley, BASEL II, HIPAA, Patriot Act, and many others.

Description The common practice for audit and balancing solutions is to produce a set of common tables that can hold various control metrics regarding the data integration process. Ultimately, business intelligence reports provide insight at a glance to verify that the correct data has been pulled from the source and completely loaded to the target. Each control measure that is being tracked will require development of a corresponding PowerCenter process to load the metrics to the Audit/ Balancing Detail table. To drive out this type of solution execute the following tasks: 1. Work with business users to identify what audit/balancing processes are needed. Some examples of this may be: a. Customers – (Number of Customers or Number of Customers by Country) b. Orders – (Qty of Units Sold or Net Sales Amount) c. Deliveries – (Number of shipments or Qty of units shipped of Value of all shipments) d. Accounts Receivable – (Number of Accounts Receivable Shipments or Total Accounts Receivable Outstanding) 2. Define for each process defined in #1 which columns should be used for tracking purposes for both the source and target system. 3. Develop a data integration process that will read from the source system and populate the detail audit/balancing table with the control totals. 4. Develop a data integration process that will read from the target system and populate the detail audit/balancing table with the control totals. 5. Develop a reporting mechanism that will query the audit/balancing table and identify the the source and target entries match or if there is a discrepancy. An example audit/balance table definition looks like this : Audit/Balancing Details

INFORMATICA CONFIDENTIAL

BEST PRACTICES

226 of 954

Column Name

Data Type

Size

AUDIT_KEY

NUMBER

10

CONTROL_AREA

VARCHAR2

50

CONTROL_SUB_AREA VARCHAR2

50

CONTROL_COUNT_1

NUMBER

10

CONTROL_COUNT_2

NUMBER

10

CONTROL_COUNT_3

NUMBER

10

CONTROL_COUNT_4

NUMBER

10

CONTROL_COUNT_5

NUMBER

10

CONTROL_SUM_1

NUMBER (p,s) 10,2

CONTROL_SUM_2

NUMBER (p,s) 10,2

CONTROL_SUM_3

NUMBER (p,s) 10,2

CONTROL_SUM_4

NUMBER (p,s) 10,2

CONTROL_SUM_5

NUMBER (p,s) 10,2

UPDATE_TIMESTAMP TIMESTAMP UPDATE_PROCESS

VARCHAR2

50

Control Column Definition by Control Area/Control Sub Area Column Name

Data Type

Size

CONTROL_AREA

VARCHAR2 50

CONTROL_SUB_AREA VARCHAR2 50

INFORMATICA CONFIDENTIAL

BEST PRACTICES

227 of 954

CONTROL_COUNT_1

VARCHAR2 50

CONTROL_COUNT_2

VARCHAR2 50

CONTROL_COUNT_3

VARCHAR2 50

CONTROL_COUNT_4

VARCHAR2 50

CONTROL_COUNT_5

VARCHAR2 50

CONTROL_SUM_1

VARCHAR2 50

CONTROL_SUM_2

VARCHAR2 50

CONTROL_SUM_3

VARCHAR2 50

CONTROL_SUM_4

VARCHAR2 50

CONTROL_SUM_5

VARCHAR2 50

UPDATE_TIMESTAMP TIMESTAMP UPDATE_PROCESS

VARCHAR2 50

The following is a screenshot of a single mapping that will populate both the source and target values in a single mapping:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

228 of 954

The following two screenshots show how two mappings could be used to provide the same results:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

229 of 954

Note: One key challenge is how to capture the appropriate control values from the source system if it is continually being updated. The first example with one mapping will not work due to the changes that occur in the time between the extraction of the data from the source and the completion of the load to the target application. In those cases you may want to take advantage of an aggregator transformation to collect the appropriate control totals as illustrated in this screenshot:

The following are two Straw-man Examples of an Audit/Balancing Report which is the end-result of this type of process: Data Area Leg count TT count Diff Leg amt TT amt Customer 11000

10099

1

0

9827

9827

0

11230.21 11230.21 0

Deliveries 1298

1288

0

21294.22 21011.21 283.01

Orders

In summary, there are two big challenges in building audit/balancing processes: 1. Identifying what the control totals should be 2. Building processes that will collect the correct information at the correct granularity There are also a set of basic tasks that can be leveraged and shared across any audit/balancing needs. By building a common model for meeting audit/balancing needs, projects can lower the time needed to develop these solutions and still provide risk reductions by having this type of solution in place.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

230 of 954

Continuing Nature of Data Quality Challenge A data quality (DQ) project usually begins with a specific use case in mind; such as resolving data quality issues as a part of a data migration effort or attempting to reconcile data acquired as a part of a merger or acquisition. Regardless of the specific data quality need, planning for the data quality project should be considered an iterative process. As change will always be prevalent, data quality is not something that should be considered an absolute. An organization must be cognizant of the continuing nature of data quality whenever undertaking a project that involves data quality. The goal of this Best Practice is to set forth principles that outline the iterative nature of data quality and the steps that should be considered when planning a data quality initiative. Experience has shown that applying these principles and steps will maximize the potential for ongoing success in data quality projects.

Description Reasons for considering data quality as an iterative process stems from two core concepts. First, the level of sophistication around data quality will continue to improve as a DQ process is implemented. Specifically, as the results are disseminated throughout the organization, it will become easier to make decisions on the types of rules and standards that should be implemented; as everyone will be working from a single view of the truth. Although everyone may not agree on how data is being entered or identified, the baseline analysis will identify the standards (or lack thereof) currently in place and provide a starting point to work from. Once the initial data quality process is implemented, the iterative nature begins. The users become more familiar with the data as they review the results of the data quality plans to standardize, cleanse and de-duplicate the data. As each iteration continues, the data stewards should determine if the business rules and reference dictionaries initially put into place need to be modified to effectively address any new issues that arise. The second reason that data quality continues to evolve is based on the premise that the data will not remain static. Although a baseline set of data quality rules will eventually be agreed upon, the assumption is that as soon as legacy data has been cleansed, standardized and de-duplicated it will ultimately change. This change could come from a user updating a record or a new data source being introduced that ultimately needs to become a part of the master data. In either case, the need to perform additional iterations on the updated records and/or new sources should be considered. The frequency of these iterations will vary and are ultimately driven by the processes for data entry and manipulation within an organization. This can result in anything from a need to cleanse data in realtime to possibly performing a nightly or weekly batch process. Regardless, scorecards should be monitored to determine if the business rules initially implemented need to be modified or if they are continuing to meet the needs of the organization as it pertains to data quality. The questions that should be considered when evaluating the continuing and iterative nature of data quality include: ●

Are the business rules and reference dictionaries meeting the needs of the organization when attempting to report on the underlying data?

INFORMATICA CONFIDENTIAL

BEST PRACTICES

231 of 954

●

●

If a new data source is introduced, can the same data quality rules be applied or do new rules need to be developed to meet the type of data found in this new source? From a trend perspective, is the quality of data improving over time? If not, what needs to be done to remedy the situation?

The answers to these questions will provide a framework to measure the current level of success achieved in implementing an iterative data quality initiative. Just as data quality should be viewed as iterative, so should these questions. They should be reflected upon frequently to determine if changes are needed to how data quality is implemented within the environment; or to the underlying business rules within a specific DQ process. Although the reasons to iterate through the data may vary, the following steps will be prevalent in each iteration: 1. Identify the problematic data element that needs to be addressed. This problematic data could include bad addresses, duplicate records or incomplete data elements as well as other examples. 2. Define the data quality rules and targets that need to be resolved. This includes rules for specific sources and content around which data quality areas are being addressed. 3. Design data quality plans to correct the problematic data. This could be one or many data quality plans, depending upon the scope and complexity of the source data. 4. Implement quality improvement processes to identify problematic data on an ongoing basis. These processes should detect data anomalies which could lead to known and unknown data problems. 5. Monitor and Repeat. This is done to ensure that the data quality plans correct the data to desired thresholds. Since data quality definitions can be adjusted based on business and data factors, this iterative review is essential to ensure that the stakeholders understand what will change with the data as it is cleansed and how that cleansed data may affect existing business process and management reporting.

Example of the Iterative Process

INFORMATICA CONFIDENTIAL

BEST PRACTICES

232 of 954

As noted in the above diagram, the iterative data quality process will continue to be leveraged within an organization as new master data is introduced. By having defined processes in place upfront, the ability to effectively leverage the data quality solution will be enhanced. An organization’s departments that are charged with implementing and monitoring data quality will be doing so within the confines of the enterprise wide rules and procedures that have been identified for the organization. The following points should be considered as an expansion to the five steps noted above: 1. Identify & Measure Data Quality: This first point is key. The ability to understand the data within the confines of the six dimensions of data quality will form the foundation for the business rules and processes that will be put in place. Without performing an upfront assessment, the ability to effectively implement a data quality strategy will be negatively impacted. From an ongoing perspective, the data quality assessment will allow an organization to see how the data quality procedures put into place have caused the quality of the data to improve. Additionally, as new data enters the organization, the assessment will provide key information for making ongoing modifications to the data quality processes. 2. Define Data Quality Rules & Targets: Once the assessment is complete, the second part of the analysis phase involves scorecarding the results in order to put into place success criteria and metrics for the data quality management initiative. From an ongoing perspective, this phase will involve performing trend analysis on the data and the rules in place to ensure the data continues to conform to the rules that were put into place during the data quality management initiative. 3. Design Quality Improvement Processes: This phase involves the manipulation of the data to align it with the business rules put into place. Examples of potential improvements includestandardization, removing noise, aligning product attributes and implementing measures or classifications. 4. Implement Quality Improvement Processes: Once the data has been standardized, an adjunct to the enhancement process involves the identification of duplicate data and taking INFORMATICA CONFIDENTIAL

BEST PRACTICES

233 of 954

action based upon the business rules that have been identified. The rules to identify and address duplicate data will continue to evolve. This evolution occurs as data stewards become more familiar with the data and as the policies and procedures set in place by the data governance committee become widely adopted throughout the organization. As this occurs, the ability to find additional duplicates or the ability to find new relationships within the data begins to arise. 5. Monitor Data Quality versus Targets: The ability to monitor the data quality processes is critical as it provides the organization with a quick snapshot of the health of the data. Through analysis of the scorecard results, the data governance committee will have the information needed to confidently make additional modifications to the data quality strategies in place, if needed. Conversely, the scorecards and trend analysis results will provide the peace of mind that data quality is being effectively addressed within the organization.

Last updated: 20-May-08 22:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

234 of 954

Data Cleansing Challenge Poor data quality is one of the biggest obstacles to the success of many data integration projects. A 2005 study by the Gartner Group stated that the majority of currently planned data warehouse projects will suffer limited acceptance or fail outright. Gartner declared that the main cause of project problems was a lack of attention to data quality. Moreover, once in the system, poor data quality can cost organizations vast sums in lost revenues. Defective data leads to breakdowns in the supply chain, poor business decisions, and inferior customer relationship management. It is essential that data quality issues are tackled during any large-scale data project to enable project success and future organizational success. Therefore, the challenge is twofold: to cleanse project data, so that the project succeeds, and to ensure that all data entering the organizational data stores provides for consistent and reliable decision-making.

Description A significant portion of time in the project development process should be dedicated to data quality, including the implementation of data cleansing processes. In a production environment, data quality reports should be generated after each data warehouse implementation or when new source systems are integrated into the environment. There should also be provision for rolling back if data quality testing indicates that the data is unacceptable. Informatica offers two application suites for tackling data quality issues: Informatica Data Explorer (IDE) and Informatica Data Quality (IDQ). IDE focuses on data profiling, and its results can feed into the data integration process. However, its unique strength is its metadata profiling and discovery capability. IDQ has been developed as a data analysis, cleansing, correction, and de-duplication tool, one that provides a complete solution for identifying and resolving all types of data quality problems and preparing data for the consolidation and load processes.

Concepts Following are some key concepts in the field of data quality. These data quality concepts provide a foundation that helps to develop a clear picture of the subject data, which can improve both efficiency and effectiveness. The list of concepts can be read as a process, leading from profiling and analysis to consolidation. Profiling and Analysis - whereas data profiling and data analysis are often synonymous terms, in Informatica terminology these tasks are assigned to IDE and IDQ respectively. Thus, profiling is primarily concerned with metadata discovery and definition, and IDE is ideally suited to these tasks. IDQ can discover data quality issues at a record and field level, and Velocity best practices recommends the use of IDQ for such purposes. Note: The remaining items in this document will therefore, focus in the context of IDQ usage.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

235 of 954

Parsing - the process of extracting individual elements within the records, files, or data entry forms in order to check the structure and content of each field and to create discrete fields devoted to specific information types. Examples may include: name, title, company name, phone number, and SSN. Cleansing and Standardization - refers to arranging information in a consistent manner or preferred format. Examples include the removal of dashes from phone numbers or SSNs. For more information, see the Best Practice Effective Data Standardizing Techniques. Enhancement - refers to adding useful, but optional, information to existing data or complete data. Examples may include: sales volume, number of employees for a given business, and zip+4 codes. Validation - the process of correcting data using algorithmic components and secondary reference data sources, to check and validate information. Example: validating addresses with postal directories. Matching and de-duplication - refers to removing, or flagging for removal, redundant or poor-quality records where high-quality records of the same information exist. Use matching components and business rules to identify records that may refer, for example, to the same customer. For more information, see the Best Practice Effective Data Matching Techniques. Consolidation - using the data sets defined during the matching process to combine all cleansed or approved data into a single, consolidated view. Examples are building best record, master record, or house-holding.

Informatica Applications The Informatica Data Quality software suite has been developed to resolve a wide range of data quality issues, including data cleansing. The suite comprises the following elements: ●

IDQ Workbench - a stand-alone desktop tool that provides a complete set of data quality functionality on a single computer (Windows only).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

236 of 954

●

●

●

IDQ Server- a set of processes that enables the deployment and management of data quality procedures and resources across a network of any size through TCP/IP. IDQ Integration - a plug-in component that integrates Workbench with PowerCenter, enabling PowerCenter users to embed data quality procedures defined in IDQ in their mappings. IDQ stores all its processes as XML in the Data Quality Repository (MySQL). IDQ Server enables the creation and management of multiple repositories.

Using IDQ in Data Projects IDQ can be used effectively alongside PowerCenter in data projects, to run data quality procedures in its own applications or to provide them for addition to PowerCenter transformations. Through its Workbench user-interface tool, IDQ tackles data quality in a modular fashion. That is, Workbench enables you to build discrete procedures (called plans in Workbench) which contain data input components, output components, and operational components. Plans can perform analysis, parsing, standardization, enhancement, validation, matching, and consolidation operations on the specified data. Plans are saved into projects that can provide a structure and sequence to your data quality endeavors. The following figure illustrates how data quality processes can function in a project setting:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

237 of 954

In stage 1, you analyze the quality of the project data according to several metrics, in consultation with the business or project sponsor. This stage is performed in Workbench, which enables the creation of versatile and easy to use dashboards to communicate data quality metrics to all interested parties. In stage 2, you verify the target levels of quality for the business according to the data quality measurements taken in stage 1, and in accordance with project resourcing and scheduling. In stage 3, you use Workbench to design the data quality plans and projects to achieve the targets. Capturing business rules and testing the plans are also covered in this stage. In stage 4, you deploy the data quality plans. If you are using IDQ Workbench and Server, you can deploy plans and resources to remote repositories and file systems through the user interface. If you are running Workbench alone on remote computers, you can export your plans as XML. Stage 4 is the phase in which data cleansing and other data quality tasks are performed on the project data. In stage 5, you’ll test and measure the results of the plans and compare them to the initial data quality assessment to verify that targets have been met. If targets have not been met, this information feeds into another iteration of data quality operations in which the plans are tuned and optimized. In a large data project, you may find that data quality processes of varying sizes and impact are necessary at many points in the project plan. At a high level, stages 1 and 2 ideally occur very early in the project, at a point defined as the Manage Phase within Velocity. Stages 3 and 4 typically occur during the Design Phase of Velocity. Stage 5 can occur during the Design and/or Build Phase of Velocity, depending on the level of unit testing required.

Using the IDQ Integration

INFORMATICA CONFIDENTIAL

BEST PRACTICES

238 of 954

Data Quality Integration is a plug-in component that enables PowerCenter to connect to the Data Quality repository and import data quality plans to a PowerCenter transformation. With the Integration component, you can apply IDQ plans to your data without necessarily interacting with or being aware of IDQ Workbench or Server. The Integration interacts with PowerCenter in two ways: ●

●

On the PowerCenter client side, it enables you to browse the Data Quality repository and add data quality plans to custom transformations. The data quality plans’ functional details are saved as XML in the PowerCenter repository. On the PowerCenter server side, it enables the PowerCenter Server (or Integration service) to send data quality plan XML to the Data Quality engine for execution.

The Integration requires that at least the following IDQ components are available to PowerCenter: ● ●

Client side: PowerCenter needs to access a Data Quality repository from which to import plans. Server side: PowerCenter needs an instance of the Data Quality engine to execute the plan instructions.

An IDQ-trained consultant can build the data quality plans, or you can use the pre-built plans provided by Informatica. Currently, Informatica provides a set of plans dedicated to cleansing and de-duplicating North American name and postal address records. The Integration component enables the following process: ●

●

●

Data quality plans are built in Data Quality Workbench and saved from there to the Data Quality repository. The PowerCenter Designer user opens a Data Quality Integration transformation and configures it to read from the Data Quality repository. Next, the users selects a plan from the Data Quality repository and adds it to the transformation. The PowerCenter Designer user saves the transformation and the mapping containing it to the PowerCenter repository. The plan information is saved with the transformation as XML.

The PowerCenter Integration service can then run a workflow containing the saved mapping. The relevant source data and plan information will be sent to the Data Quality engine, which processes the data (in conjunction with any reference data files used by the plan) and returns the results to PowerCenter.

Last updated: 06-Feb-07 12:43

INFORMATICA CONFIDENTIAL

BEST PRACTICES

239 of 954

Data Profiling Challenge Data profiling is an option in PowerCenter version 7.0 and later that leverages existing PowerCenter functionality and a data profiling GUI front-end to provide a wizard-driven approach to creating data profiling mappings, sessions, and workflows. This Best Practice is intended to provide an introduction on usage for new users. Bear in mind that Informatica’s Data Quality (IDQ) applications also provide data profiling capabilities. Consult the following Velocity Best Practice documents for more information: ●

Data Cleansing

●

Using Data Explorer for Data Discovery and Analysis

Description Creating a Custom or Auto Profile The data profiling option provides visibility into the data contained in source systems and enables users to measure changes in the source data over time. This information can help to improve the quality of the source data. An auto profile is particularly valuable when you are data profiling a source for the first time, since auto profiling offers a good overall perspective of a source. It provides a row count, candidate key evaluation, and redundancy evaluation at the source level, and domain inference, distinct value and null value count, and min, max, and average (if numeric) at the column level. Creating and running an auto profile is quick and helps to gain a reasonably thorough understanding of a source in a short amount of time. A custom data profile is useful when there is a specific question about a source. Custom profiling is useful for validating business rules and/or verifying that data matches a particular pattern. For example, use custom profiling if you have a business rule that you want to validate, or if you want to test whether data matches a particular pattern.

Setting Up the Profile Wizard To customize the profile wizard for your preferences: ● ●

●

Open the Profile Manager and choose Tools > Options. If you are profiling data using a database user that is not the owner of the tables to be sourced, check the “Use source owner name during profile mapping generation” option. If you are in the analysis phase of your project, choose “Always run profile interactively” since most of your dataprofiling tasks will be interactive. (In later phases of the project, uncheck this option because more permanent data profiles are useful in these phases.)

Running and Monitoring Profiles Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode by checking or unchecking “Configure Session” on the "Function-Level Operations” tab of the wizard. ●

●

Use Interactive to create quick, single-use data profiles. The sessions are created with default configuration parameters. For data-profiling tasks that are likely to be reused on a regular basis, create the sessions manually in Workflow Manager and configure and schedule them appropriately.

Generating and Viewing Profile Reports Use Profile Manager to view profile reports. Right-click on a profile and choose View Report. INFORMATICA CONFIDENTIAL

BEST PRACTICES

240 of 954

For greater flexibility, you can also use Data Analyzer to view reports. Each PowerCenter client includes a Data Analyzer schema and reports xml file. The xml files are located in the \Extensions\DataProfile\IPAReports subdirectory of the client installation. You can create additional metrics, attributes, and reports in Data Analyzer to meet specific business requirements. You can also schedule Data Analyzer reports and alerts to send notifications in cases where data does not meet preset quality limits.

Sampling Techniques Four types of sampling techniques are available with the PowerCenter data profiling option:

Technique

Description

Usage

No sampling

Uses all source data

Relatively small data sources

Automatic random sampling PowerCenter determines the Larger data sources where you appropriate percentage to sample, then want a statistically significant data samples random rows. analysis Manual random sampling

PowerCenter samples random rows of the source data based on a userspecified percentage.

Samples more or fewer rows than the automatic option chooses.

Sample first N rows

Samples the number of user-selected rows

Provides a quick readout of a source (e.g., first 200 rows)

Profile Warehouse Administration Updating Data Profiling Repository Statistics The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To ensure that queries run optimally, be sure to keep database statistics up to date. Run the query below as appropriate for your database type, then capture the script that is generated and run it.

ORACLE select 'analyze table ' || table_name || ' compute statistics;' from user_tables where table_name like 'PMDP%'; select 'analyze index ' || index_name || ' compute statistics;' from user_tables where index_name like 'DP%';

Microsoft SQL Server select 'update statistics ' + name from sysobjects where name like 'PMDP%'

SYBASE select 'update statistics ' + name from sysobjects where name like 'PMDP%'

INFORMIX

INFORMATICA CONFIDENTIAL

BEST PRACTICES

241 of 954

select 'update statistics low for table ', tabname, ' ; ' from systables where table_name like 'PMDP%'

IBM DB2 select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; ' from syscat.tables where tabname like 'PMDP %'

TERADATA select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where tablename like 'PMDP%' and databasename = 'database_name' where database_name is the name of the repository database.

Purging Old Data Profiles Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose Target Warehouse>Connect and connect to the profiling warehouse. Choose Target Warehouse>Purge to open the purging tool.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

242 of 954

Data Quality Mapping Rules Challenge Use PowerCenter to create data quality mapping rules to enhance the usability of the data in your system.

Description The issue of poor data quality is one that frequently hinders the success of data integration projects. It can produce inconsistent or faulty results and ruin the credibility of the system with the business users. This Best Practice focuses on techniques for use with PowerCenter and third-party or add-on software. Comments that are specific to the use of PowerCenter are enclosed in brackets. Bear in mind that you can augment or supplant the data quality handling capabilities of PowerCenter with Informatica Data Quality (IDQ), the Informatica application suite dedicated to data quality issues. Data analysis and data enhancement processes, or plans, defined in IDQ can deliver significant data quality improvements to your project data. A data project that has built-in data quality steps, such as those described in the Analyze and Design phases of Velocity, enjoys a significant advantage over a project that has not audited and resolved issues of poor data quality. If you have added these data quality steps to your project, you are likely to avoid the issues described below. A description of the range of IDQ capabilities is beyond the scope of this document. For a summary of Informatica’s data quality methodology, as embodied in IDQ, consult the Best Practice Data Cleansing.

Common Questions to Consider Data integration/warehousing projects often encounter general data problems that may not merit a full-blown data quality project, but which nonetheless must be addressed. This document discusses some methods to ensure a base level of data quality; much of the content discusses specific strategies to use with PowerCenter. The quality of data is important in all types of projects, whether it be data warehousing,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

243 of 954

data synchronization, or data migration. Certain questions need to be considered for all of these projects, with the answers driven by the project’s requirements and the business users that are being serviced. Ideally, these questions should be addressed during the Design and Analyze Phases of the project because they can require a significant amount of re-coding if identified later. Some of the areas to consider are:

Text Formatting The most common hurdle here is capitalization and trimming of spaces. Often, users want to see data in its “raw” format without any capitalization, trimming, or formatting applied to it. This is easily achievable as it is the default behavior, but there is danger in taking this requirement literally since it can lead to duplicate records when some of these fields are used to identify uniqueness and the system is combining data from various source systems. One solution to this issue is to create additional fields that act as a unique key to a given table, but which are formatted in a standard way. Since the “raw” data is stored in the table, users can still see it in this format, but the additional columns mitigate the risk of duplication. Another possibility is to explain to the users that “raw” data in unique, identifying fields is not as clean and consistent as data in a common format. In other words, push back on this requirement. This issue can be particularly troublesome in data migration projects where matching the source data is a high priority. Failing to trim leading/trailing spaces from data can often lead to mismatched results since the spaces are stored as part of the data value. The project team must understand how spaces are handled from the source systems to determine the amount of coding required to correct this. (When using PowerCenter and sourcing flat files, the options provided while configuring the File Properties may be sufficient.). Remember that certain RDBMS products use the data type CHAR, which then stores the data with trailing blanks. These blanks need to be trimmed before matching can occur. It is usually only advisable to use CHAR for 1-character flag fields.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

244 of 954

Note that many fixed-width files do not use a null as space. Therefore, developers must put one space beside the text radio button, and also tell the product that the space is repeating to fill out the rest of the precision of the column. The strip trailing blanks facility then strips off any remaining spaces from the end of the data value. Embedding database text manipulation functions in lookup transformations is not recommended because a developer must then cache the lookup table due to the presence of a SQL override. (In PowerCenter, avoid embedding database text manipulation functions in lookup transformations.) On very large tables, caching is not always realistic or feasible.

Datatype Conversions It is advisable to use explicit tool functions when converting the data type of a particular data value. [In PowerCenter, if the TO_CHAR function is not used, an implicit conversion is performed, and 15 digits are carried forward, even when they are not needed or desired. PowerCenter can handle some conversions without function calls (these are detailed in the product documentation), but this may cause subsequent support or INFORMATICA CONFIDENTIAL

BEST PRACTICES

245 of 954

maintenance headaches.]

Dates Dates can cause many problems when moving and transforming data from one place to another because an assumption must be made that all data values are in a designated format. [Informatica recommends first checking a piece of data to ensure it is in the proper format before trying to convert it to a Date data type. If the check is not performed first, then a developer increases the risk of transformation errors, which can cause data to be lost]. An example piece of code would be: IIF(IS_DATE(in_RECORD_CREATE_DT, ‘YYYYMMDD'), TO_DATE(in_RECORD_CREATE_DT, 'YYYYMMDD'), NULL) If the majority of the dates coming from a source system arrive in the same format, then it is often wise to create a reusable expression that handles dates, so that the proper checks are made. It is also advisable to determine if any default dates should be defined, such as a low date or high date. These should then be used throughout the system for consistency. However, do not fall into the trap of always using default dates as some are meant to be NULL until the appropriate time (e.g., birth date or death date). The NULL in the example above could be changed to one of the standard default dates described here.

Decimal Precision With numeric data columns, developers must determine the expected or required precisions of the columns. (By default, to increase performance, PowerCenter treats all numeric columns as 15 digit floating point decimals, regardless of how they are defined in the transformations. The maximum numeric precision in PowerCenter is 28 digits.) If it is determined that a column realistically needs a higher precision, then the Enable Decimal Arithmetic in the Session Properties option needs to be checked. However, be aware that enabling this option can slow performance by as much as 15 percent. The Enable Decimal Arithmetic option must be enabled when comparing two numbers for equality.

Trapping Poor Data Quality Techniques

INFORMATICA CONFIDENTIAL

BEST PRACTICES

246 of 954

The most important technique for ensuring good data quality is to prevent incorrect, inconsistent, or incomplete data from ever reaching the target system. This goal may be difficult to achieve in a data synchronization or data migration project, but it is very relevant when discussing data warehousing or ODS. This section discusses techniques that you can use to prevent bad data from reaching the system.

Checking Data for Completeness Before Loading When requesting a data feed from an upstream system, be sure to request an audit file or report that contains a summary of what to expect within the feed. Common requests here are record counts or summaries of numeric data fields. If you have performed a data quality audit, as specified in the Analyze Phase these metrics and others should be readily available. Assuming that the metrics can be obtained from the source system, it is advisable to then create a pre-process step that ensures your input source matches the audit file. If the values do not match, stop the overall process from loading into your target system. The source system can then be alerted to verify where the problem exists in its feed.

Enforcing Rules During Mapping Another method of filtering bad data is to have a set of clearly defined data rules built into the load job. The records are then evaluated against these rules and routed to an Error or Bad Table for further re-processing accordingly. An example of this is to check all incoming Country Codes against a Valid Values table. If the code is not found, then the record is flagged as an Error record and written to the Error table. A pitfall of this method is that you must determine what happens to the record once it has been loaded to the Error table. If the record is pushed back to the source system to be fixed, then a delay may occur until the record can be successfully loaded to the target system. In fact, if the proper governance is not in place, the source system may refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the data manually and risk not matching with the source system; or 2) relax the business rule to allow the record to be loaded. Often times, in the absence of an enterprise data steward, it is a good idea to assign a team member the role of data steward. It is this person’s responsibility to patrol these tables and push back to the appropriate systems as necessary, as well as help to make decisions about fixing or filtering bad data. A data steward should have a good command of the metadata, and he/she should also understand the consequences to the user community of data decisions.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

247 of 954

Another solution applicable in cases with a small number of code values is to try to anticipate any mistyped error codes and translate them back to the correct codes. The cross-reference translation data can be accumulated over time. Each time an error is corrected, both the incorrect and correct values should be put into the table and used to correct future errors automatically.

Dimension Not Found While Loading Fact The majority of current data warehouses are built using a dimensional model. A dimensional model relies on the presence of dimension records existing before loading the fact tables. This can usually be accomplished by loading the dimension tables before loading the fact tables. However, there are some cases where a corresponding dimension record is not present at the time of the fact load. When this occurs, consistent rules need to handle this so that data is not improperly exposed to, or hidden from, the users. One solution is to continue to load the data to the fact table, but assign the foreign key a value that represents Not Found or Not Available in the dimension. These keys must also exist in the dimension tables to satisfy referential integrity, but they provide a clear and easy way to identify records that may need to be reprocessed at a later date. Another solution is to filter the record from processing since it may no longer be relevant to the fact table. The team will most likely want to flag the row through the use of either error tables or process codes so that it can be reprocessed at a later time. A third solution is to use dynamic caches and load the dimensions when a record is not found there, even while loading the fact table. This should be done very carefully since it may add unwanted or junk values to the dimension table. One occasion when this may be advisable is in cases where dimensions are simply made up of the distinct combination values in a data set. Thus, this dimension may require a new record if a new combination occurs. It is imperative that all of these solutions be discussed with the users before making any decisions since they will eventually be the ones making decisions based on the reports.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

248 of 954

Data Quality Project Estimation and Scheduling Factors Challenge This Best Practice is intended to assist project managers who must estimate the time and resources necessary to address data quality issues within data integration or other data-dependent projects. Its primary concerns are the project estimation issues that arise when you add a discrete data quality stage to your data project. However, it also examines the factors that determine when, or whether, you need to build a larger data quality element into your project.

Description At a high level, there are three ways to add data quality to your project: ●

●

●

Add a discrete and self-contained data quality stage, such as that enabled by using pre-built Informatica Data Quality (IDQ) processes, or plans, in conjunction with Informatica Data Cleanse and Match. Add an expanded but finite set of data quality actions to the project, for example in cases where pre-built plans do not fit the project parameters. Incorporate data quality actions throughout the project.

This document should help you decide which of these methods best suits your project and assist in estimating the time and resources needed for the first and second methods.

Using Pre-Built Plans with Informatica Data Cleanse and Match Informatica Data Cleanse and Match is a cross-application solution that enables PowerCenter users to add data quality processes defined in IDQ to custom transformations in PowerCenter. It incorporates the following components: ●

●

●

Data Quality Workbench, a user-interface application for building and executing data quality processes, or plans. Data Quality Integration, a plug-in component for PowerCenter that integrates PowerCenter and IDQ. At least one set of reference data files that can be read by data quality plans to validate and enrich certain types of project data. For example, Data Cleanse and Match can be used with the North America Content Pack, which includes pre-built data quality plans and complete address reference datasets for the United States and Canada.

Data Quality Engagement Scenarios Data Cleanse and Match delivers its data quality capabilities “out of the box”; a PowerCenter user can select data quality plans and add them to a Data Quality transformation without leaving PowerCenter. In this way, Data Cleanse and Match capabilities can be added into a project plan as a relatively short and INFORMATICA CONFIDENTIAL

BEST PRACTICES

249 of 954

discrete stage. In a more complex scenario, a Data Quality Developer may wish to modify the underlying data quality plans or create new plans to focus on quality analysis or enhancements in particular areas. This expansion of the data quality operations beyond the pre-built plans can also be handled within a discrete data quality stage. The Project Manager may decide to implement a more thorough approach to data quality and integrate data quality actions throughout the project plan. In many cases, a convincing case can be made for enlarging the data quality aspect to encompass the full data project. (Velocity contains several tasks and subtasks concerned with such an endeavor.) This is well worth considering. Often, businesses do not realize the extent to which their business and project goals depend on the quality of their data. The project impact of these three types of data quality activity can be summarized as follows:

DQ approach

Estimated Project impact

Simple stage

10 days, 1-2 Data Quality Developers

Expanded data quality stage

15-20 days, 2 Data Quality Developers, high visibility to business

Data quality integrated with data project Duration of data project, 2 or more project roles, impact on business and project objectives Note: The actual time that should be allotted to the data quality stages noted above depends on the factors discussed in the remainder of this document.

Factors Influencing Project Estimation The factors influencing project estimation for a data quality stage range from high-level project parameters to lower-level data characteristics. The main factors are listed below and explained in detail later in this document. ●

Base and target levels of data quality

●

Overall project duration/budget

●

Overlap of sources/Complexity of data joins

●

Quantity of data sources

●

Matching requirements

●

Data volumes

●

Complexity and quantity of data rules

●

Geography

Determine which scenario — out of the box (Data Cleanse and Match), expanded Data Cleanse and Match, or a thorough data quality integration —best fits your data project by considering the project’s overall objectives and its mix of factors.

The Simple Data Quality Stage INFORMATICA CONFIDENTIAL

BEST PRACTICES

250 of 954

Project managers can consider the use of pre-built plans with Data Cleanse and Match as a simple scenario with a predictable number of function points that can be added to the project plan as a single package. You can add the North America Content Pack plans to your project if the project meets most of the following criteria. Similar metrics apply to other types of pre-built plans: ●

Baseline functionality of the pre-built data quality plans meets 80 percent of the project needs.

●

Complexity of data rules is relatively low.

●

Business rules present in pre-built plans need minimum fine-tuning.

●

Target data quality level is achievable (i.e., 3 million records

Multi-processor server, 2GB RAM, RAID 5 hard disk

Single Processor vs. Multi-Processor With IDQ Runtime, it is possible to run multiple processes in parallel. Matching plans, whether they are file-based or database-based, can be split into multiple plans to take advantage of multiple processors on a server. Be aware however, that this requires additional effort to create the groups and consolidate the match output. Also, matching plans split across four processors do not run four times faster than a single-processor matching plan. As a result, multi-processor matching may not significantly improve performance in every case. INFORMATICA CONFIDENTIAL

BEST PRACTICES

266 of 954

Using IDQ with PowerCenter and taking advantage of PowerCenter's partitioning capabilities may also improve throughput. This approach has the advantage that splitting plans into multiple independent plans is not typically required. The following table can help in estimating the execution time between a single and multi-processor match plan.

Plan Type

Single Processor

Multiprocessor

Standardardization/ grouping

Depends on operations and size of data set.

Single processor time plus 20 percent.

(Time equals Y) Matching

Est 1 million comparisons a minute. (Time equals X)

(Time equals Y * 1.20) Time for single processor matching divided by no or processors (NP) multiplied by 25 percent. (Time equals [(X / NP) * 1.25])

For example, if a single processor plan takes one hour to group and standardize the data and eight hours to match, a four-processor match plan should require approximately one hour and 20 minute to group and standardize and two and one half hours to match. The time difference between a single- and multi-processor plan in this case would be more than five hours (i.e., nine hours for the single processor plan versus three hours and 50 minutes for the quad-processor plan).

Deterministic vs. Probabilistic Comparisons No best-practice research has yet been completed on which type of comparison is most effective at determining a match. Each method has strengths and weaknesses. A 2006 article by Forrester Research stated a preference for deterministic comparisons since they remove the burden of identifying a universal match threshold from the user. Bear in mind that IDQ supports deterministic matching operations only. However, IDQ’s Weight Based Analyzer component lets plan designers calculate weighted match scores for matched fields.

Database vs. File-Based Matching File-based matching and database matching perform essentially the same operations. The major differences between the two methods revolve around how data is stored and how the outputs can be manipulated after matching is complete. With regards to selecting one method or the other, there are no best practice recommendations since this is largely defined by requirements. The following table outlines the strengths and weakness of each method:

File-Based Method

Database Method

Ease of implementation

Easy to implement

Requires SQL knowledge

Performance

Fastest method

Slower than file-based method

Space utilization

Requires more hard-disk space

Lower hard-disk space requirement

Operating system restrictions

Possible limit to number of groups that can be created

None

INFORMATICA CONFIDENTIAL

BEST PRACTICES

267 of 954

Ability to control/ manipulate output

Low

High

High-Volume Data Matching Techniques This section discusses the challenges facing IDQ matching plan designers in opti-mizing their plans for speed of execution and quality of results. It highlights the key factors affecting matching performance and discusses the results of IDQ performance testing in single and multi-processor environments. Checking for duplicate records where no clear connection exists among data elements is a resource-intensive activity. In order to detect matching information, a record must be compared against every other record in a dataset. For a single data source, the quantity of comparisons required to check an entire dataset increases geometrically as the volume of data increases. A similar situation arises when matching between two datasets, where the number of comparisons required is a multiple of the volumes of data in each dataset. When the volume of data increases into the tens of millions, the number of comparisons required to identify matches — and consequently, the amount of time required to check for matches — reaches impractical levels.

Approaches to High-Volume Matching Two key factors control the time it takes to match a dataset: ●

The number of comparisons required to check the data.

●

The number of comparisons that can be performed per minute.

The first factor can be controlled in IDQ through grouping, which involves logically segmenting the dataset into distinct elements, or groups, so that there is a high probability that records within a group are not duplicates of records outside of the group. Grouping data greatly reduces the total number of required comparisons without affecting match accuracy. IDQ affects the number of comparisons per minute in two ways: ●

●

Its matching components maximize the comparison activities assigned to the com-puter processor. This reduces the amount of disk I/O communication in the system and increases the number of comparisons per minute. Therefore, hard-ware with higher processor speeds has higher match throughputs. IDQ architecture also allows matching tasks to be broken into smaller tasks and shared across multiple processors. The use of multiple processors to handle matching operations greatly enhances IDQ scalability with regard to high-volume matching problems.

The following section outlines how a multi-processor matching solution can be imple-mented and illustrates the results obtained in Informatica Corporation testing.

Multi-Processor Matching: Solution Overview IDQ does not automatically distribute its load across multiple processors. To scale a matching plan to take advantage of a multi-processor environment, the plan designer must develop multiple plans for execution in parallel. To develop this solution, the plan designer first groups the data to prevent the plan from running low-probability comparisons. Groups are then subdivided into one or more subgroups (the number of subgroups depends on the plan being run and the number of processors in use). Each subgroup is assigned to a discrete matching plan, and INFORMATICA CONFIDENTIAL

BEST PRACTICES

268 of 954

the plans are executed in parallel. The following diagram outlines how multi-processor matching can be implemented in a database model. Source data is first grouped and then subgrouped according to the number of processors available to the job. Each subgroup of data is loaded into a sepa-rate staging area, and the discrete match plans are run in parallel against each table. Results from each plan are consolidated to generate a single match result for the orig-inal source data.

Informatica Corporation Match Plan Tests Informatica Corporation performed match plan tests on a 2GHz Intel Xeon dual-processor server running Windows 2003 (Server edition). Two gigabytes of RAM were available. The hyper-threading ability of the Xeon processors effectively provided four CPUs on which to run the tests. Several tests were performed using file-based and database-based matching methods and single and multiple processor methods. The tests were performed on one million rows of data. Grouping of the data limited the total number of comparisons to approximately 500,000,000. Test results using file-based and database-based methods showed a near linear scal-ability as the number of available processors increased. As the number of processors increased, so too did the demand on disk I/O resources. As the processor capacity began to scale upward, disk I/O in this configuration eventually limited the benefits of adding additional processor capacity. This is demonstrated in the graph below.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

269 of 954

Execution times for multiple processors were based on the longest execution time of the jobs run in parallel. Therefore, having an even distribution of records across all proc-essors was important to maintaining scalability. When the data was not evenly distributed, some match plans ran longer than others, and the benefits of scaling over multiple processors was not as evident.

Last updated: 26-May-08 17:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

270 of 954

Effective Data Standardizing Techniques Challenge To enable users to streamline their data cleansing and standardization processes (or plans) with Informatica Data Quality (IDQ). The intent is to shorten development timelines and ensure a consistent and methodological approach to cleansing and standardizing project data.

Description Data cleansing refers to operations that remove non-relevant information and “noise” from the content of the data. Examples of cleansing operations include the removal of person names, “care of” information, excess character spaces, or punctuation from postal address. Data standardization refers to operations related to modifying the appearance of the data, so that it takes on a more uniform structure and to enriching the data by deriving additional details from existing content.

Cleansing and Standardization Operations Data can be transformed into a “standard” format appropriate for its business type. This is typically performed on complex data types such as name and address or product data. A data standardization operation typically profiles data by type (e.g., word, number, code) and parses data strings into discrete components. This reveals the content of the elements within the data as well as standardizing the data itself. For best results, the Data Quality Developer should carry out these steps in consultation with a member of the business. Often, this individual is the data steward, the person who best understands the nature of the data within the business scenario. ●

Within IDQ, the Profile Standardizer is a powerful tool for parsing unsorted data into the correct fields. However, when using the Profile Standardizer, be aware that there is a finite number of profiles (500) that can be contained within a cleansing plan. Users can extend the number of profiles by using the first 500 profiles within one component and then feeding the data overflow into a second Profile Standardizer via the Token Parser component.

After the data is parsed and labeled, it should be evident if reference dictionaries will be needed to further standardize the data. It may take several iterations of dictionary construction and review before the data is standardized to an acceptable level. Once acceptable standardization has been achieved, data quality scorecard or dashboard reporting can be introduced. For information on dashboard reporting, see the Report Viewer chapter of the Informatica Data Quality 3.1 User Guide.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

271 of 954

Discovering Business Rules At this point, the business user may discover and define business rules applicable to the data. These rules should be documented and converted to logic that can be contained within a data quality plan. When building a data quality plan, be sure to group related business rules together in a single rules component whenever possible; otherwise the plan may become very difficult to read. If there are rules that do not lend themselves easily to regular IDQ components (i.e, when standardizing product data information), it may be necessary to perform some custom scripting using IDQ’s scripting component. This requirement may arise when a string or an element within a string needs to be treated as an array.

Standard and Third-Party Reference Data Reference data can be a useful tool when standardizing data. Terms with variant formats or spellings can be standardized to a single form. IDQ installs with several reference dictionary files that cover common name and address and business terms. The illustration below shows part of a dictionary of street address suffixes.

Common Issues when Cleansing and Standardizing Data If the customer has expectations of a bureau-style service, it may be advisable to re-emphasize the score-carding and graded-data approach to cleansing and standardizing. This helps to ensure that the customer develops reasonable expectations of what can be achieved with the data set within an agreed-upon timeframe.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

272 of 954

Standardizing Ambiguous Data Data values can often appear ambiguous, particularly in name and address data where name, address, and premise values can be interchangeable. For example, Hill, Park, and Church are all common surnames. In some cases, the position of the value is important. “ST” can be a suffix for street or a prefix for Saint, and sometimes they can both occur in the same string. The address string “St Patrick’s Church, Main St” can reasonably be interpreted as “Saint Patrick’s Church, Main Street.” In this case, if the delimiter is a space (thus ignoring any commas and periods), the string has five tokens. You may need to write business rules using the IDQ Scripting component, as you are treating the string as an array. St with position 1 within the string would be standardized to meaning_1, whereas St with position 5 would be standardized to meaning_2. Each data value can then be compared to a discrete prefix and suffix dictionary.

Conclusion Using the data cleansing and standardization techniques described in this Best Practice can help an organization to recognize the value of incorporating IDQ into their development methodology. Because data quality is an iterative process, the business rules initially developed may require ongoing modification, as the results produced by IDQ will be affected by the starting condition of the data and the requirements of the business users. When data arrives in multiple languages, it is worth creating similar IDQ plans for each country and applying the same rules across these plans. The data would typically be staged in a database, and the plans developed using a SQL statement as input, with a “where country_code= ‘DE’” clause, for example. Country dictionaries are identifiable by country code to facilitate such statements. Remember that IDQ installs with a large set of reference dictionaries and additional dictionaries are available from Informatica. IDQ provides several components that focus on verifying and correcting the accuracy of name and postal address data. These components leverage address reference data that originates from national postal carriers such as the United States Postal Service. Such datasets enable IDQ to validate an address to premise level. Please note, the reference datasets are licensed and installed as discrete Informatica products, and thus it is important to discuss their inclusion in the project with the business in advance so as to avoid budget and installation issues. Several types of reference data, with differing levels of address granularity, are available from Informatica. Pricing for the licensing of these components may vary and should be discussed with the Informatica Account Manager.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

BEST PRACTICES

273 of 954

Integrating Data Quality Plans with PowerCenter Challenge This Best Practice outlines the steps to integrate an Informatica Data Quality (IDQ) plan into a PowerCenter mapping. This document assumes that the appropriate setup and configuration of IDQ and PowerCenter have been completed as part of the software installation process and these steps are not included in this document.

Description Preparing IDQ Plans for PowerCenter Integration IDQ plans are typically developed and tested by executing from workbench. Plans running locally from workbench can use any of the available IDQ Source and Sink components. This is not true for plans that are integrated into PowerCenter as they can only use Source and Sink components that contain the “Enable Real-time processing” check box. Specifically those components are CSV Source, CSV Match Source, CSV Sink and CSV Match Sink. In addition, the Real-time Source and Sink can be used; however, they require additional setup as each field name and length must be defined. Database source and sinks are not allowed in PC integration. When IDQ plans are integrated within a PowerCenter mapping, the source and sink need to be enabled by setting the enable real-time processing option on them. Consider the following points when developing a plan for integration in PC. ●

●

●

●

●

●

●

If the IDQ was plan developed using database source and/or sink, you must replace them with CSV Sink/ Source or CSV Match Sink/Source. If the IDQ plan was developed using group sink/source (or dual group sink), you must replace them with either CSV Sink/Source or CSV Match Sink/Source depending on the functionality you are replacing. When replacing group sink you also must add functionality to the PC mapping to replicate the grouping. This is done by placing a join and sort prior to the IDQ plan containing the match. PowerCenter only sees the input and output ports of the IDQ plan from within the PC mapping. This is driven by the input file used for the workbench plan and the fields selected as output in the sink. If you don’t see a field after the plan is integrated in PowerCenter, it means the field is not in the input file or not selected as output. PowerCenter integration does not allow input ports to be selected as output if the IDQ transformation is defined as a passive transformation. If the IDQ transformation is configured as active this is not an issue as you must select all fields needed as output from the IDQ transformation within the sink transformation of the IDQ plan. Passive and active IDQ transformations follow the general restrictions and rules for active and passive transformations in PowerCenter. The delimiter of the Source and Sink must be comma for integration IDQ plans. Other fields such as Pipe will cause an error within the PowerCenter Designer. If you encounter this error, go back to workbench, change the delimiter to comma, save the plan and then go back to PowerCenter Designer and perform the import of the plan again. For reusability of IDQ plans, use generic naming conventions for the input and output ports. For example, rather than naming a field Customer address1, customer address2, customer city, name the field address1, address2, city, etc. Thus, if the same standardization and cleansing is needed by multiple sources you can integrate the same IDQ plan, which will reduce development time as well as ongoing maintenance. Use only necessary fields as input to each mapping plan. If you are working with an input file that has 50 fields and you only really need 10 fields for the IDQ plan, create a file that contains only the necessary field names, save it as a comma delimited file and then point to that newly created file from the source of the IDQ plan. This changes the input field reference to only those fields that must be visible in the PowerCenter integration.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

274 of 954

●

Once the source and sink are converted to real time, you cannot run the plan within workbench, only within the PowerCenter mapping. However, you may change the check box at any time to revert to standalone processing. Be careful not to refresh the IDQ plan in the mapping within PowerCenter while real time is not enabled. If you do so, the PowerCenter mapping will display an error message and will not allow that mapping to be integrated until the Runtime enable is active again.

Integrating IDQ Plans into PowerCenter Mappings After the IDQ Plans are converted to real time-enabled, they are ready to integrate into a PowerCenter mapping. Integrating into PowerCenter requires proper installation and configuration of the IDQ/PowerCenter integration, including: ●

Making appropriate changes to environment variables (to .profile for UNIX)

●

Installing IDQ on the PowerCenter server

●

Running IDQ Integration and Content install on the server

●

Registering IDQ plug-in via the PowerCenter Admin console Note: The plug-in must be registered in each repository from which an IDQ transformation is to be developed.

●

Installing IDQ workbench on the workstation

●

Installing IDQ Integration and Content on the workstation using the PowerCenter Designer

When all of the above steps are executed correctly, the IDQ transformation icon, shown below, is visible in the PowerCenter repository.

To integrate an IDQ plan, open the mapping, and click on the IDQ icon. Then click in the mapping workspace to insert the transformation into the mapping. The following dialog box appears:

Select Active or Passive, as appropriate. Typically, an active transformation is necessary only for a matching plan. If selecting Active, IDQ plan input needs to have all input fields passed through, as typical PowerCenter INFORMATICA CONFIDENTIAL

BEST PRACTICES

275 of 954

rules apply to Active and Passive transformation processing. As the following figure illustrates, the IDQ transformation is “empty” in its initial, un-configured state. Notice all ports are currently blank; they will be populated upon import/integration of the IDQ plan.

Double-click on the title bar for the IDQ transformation to open it for editing.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

276 of 954

Then select the far right tab, “Configuration”.

When first integrating an IDQ plan, the connection and repository displays are blank. Click the Connect button to establish a connection to the appropriate IDQ repository.

In the Host Name box, specify the name of the computer on which the IDQ repository is installed. This is usually the PowerCenter server. If the default Port Number (3306) was changed during installation, specify the correct value. Next, click Test Connection. INFORMATICA CONFIDENTIAL

BEST PRACTICES

277 of 954

Note: In some cases if the User Name has not been granted privileges on the Host server you will not be allowed to connect. The procedure for granting privileges to the IDQ (MySQL) repository is explained at the end of this document. When the connection is established, click the down arrow to the right of the Plan Name box, and the following dialog is displayed:

Browse to the plan you want to import, then click on the Validate button. If there is an error in the plan, a dialog box appears. For example, if the Source and Sink have not been configured correctly, the following dialog box appears.

If the plan is valid for PowerCenter integration, the following dialog is displayed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

278 of 954

After a valid plan has been configured, the PowerCenter ports (equivalent to the IDQ Source and Sink fields, are visible and can be connected just as any other PowerCenter transformation.

Refreshing IDQ Plans for PowerCenter Integration After Data Quality Plans are integrated in PowerCenter, changes made to the IDQ plan in Workbench are not reflected in the PowerCenter mapping until the plan is manually refreshed in the PowerCenter mapping. When you save an IDQ plan, it is saved in the MySQL repository. When you integrate that plan into PowerCenter, a copy of that plan is then integrated in the PowerCenter metadata; the MySQL repository and the PowerCenter repository do not communicate updates automatically. The following paragraphs detail the process for refreshing integrated IDQ plans when necessary to reflect changes made in workbench. ●

Double-click on IDQ transformation in PowerCenter Mapping

●

Select the Configurations tab:

●

Select Refresh. This reads the current version of the plan and refreshes it within PowerCenter.

●

Select apply. If any PowerCenter-specific errors were created when the plan was modified, an error dialog is displayed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

279 of 954

●

Update input, output, and pass-through ports as necessary, then save the mapping in PowerCenter, and test the changes.

Saving IDQ Plans to the Appropriate Repository – MySQL Permissions Plans that are to be integrated into PowerCenter mappings must be saved to an IDQ Repository that is visible to the PowerCenter Designer prior to integration. The usual practice is to save the plan to the IDQ repository located on the PowerCenter server.

In order for a Workbench client to save a plan to that repository, the client machine must be granted permissions to the MySQL on the server. If the client machine has not been granted access, the client receives an error message when attempting to access the server repository. The person at your organization who has login rights to the server on which IDQ is installed needs to perform this task for all users who will need to save or retrieve plans from the IDQ Server. This procedure is detailed below. ●

Identify the IP address for any client machine that needs to be granted access.

●

Login to the server on which the MySQL repository is located and login to MySQL: mysql –u root

●

For a user to connect to IDQ server, save and retrieve plans, enter the following command: grant all privileges on *.* to ‘admin’@’’

●

For a user to integrate an IDQ plan into PowerCenter, grant the following privilege:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

280 of 954

grant all privileges on *.* to ‘root’@’’

Last updated: 20-May-08 23:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

281 of 954

Managing Internal and External Reference Data Challenge To provide guidelines for the development and management of the reference data sources that can be used with data quality plans in Informatica Data Quality (IDQ). The goal is to ensure the smooth transition from development to production for reference data files and the plans with which they are associated.

Description Reference data files can be used by a plan to verify or enhance the accuracy of the data inputs to the plan. A reference data file is a list of verified-correct terms and, where appropriate, acceptable variants on those terms. It may be a list of employees, package measurements, or valid postal addresses — any data set that provides an objective reference against which project data sources can be checked or corrected. Reference files are essential to some, but not all data quality processes. Reference data can be internal or external in origin. Internal data is specific to a particular project or client. Such data is typically generated from internal company information. It may be custom-built for the project. External data has been sourced or purchased from outside the organization. External data is used when authoritative, independently-verified data is needed to provide the desired level of data quality to a particular aspect of the source data. Examples include the dictionary files that install with IDQ, postal address data sets that have been verified as current and complete by a national postal carrier, such as United States Postal Service, or company registration and identification information from an industrystandard source such as Dun & Bradstreet. Reference data can be stored in a file format recognizable to Informatica Data Quality or in a format that requires intermediary (third-party) software in order to be read by Informatica applications. Internal data files, as they are often created specifically for data quality projects, are typically saved in the dictionary file format or as delimited text files, which are easily portable into dictionary format. Databases can also be used as a source for internal data. External files are more likely to remain in their original format. For example, external data may be contained in a database or in a library whose files cannot be edited or opened on the desktop to reveal discrete data values.

Working with Internal Data Obtaining Reference Data Most organizations already possess much information that can be used as reference data — for example, employee tax numbers or customer names. These forms of data may or may not be part of the project source data, and they may be stored in different parts of the organization. INFORMATICA CONFIDENTIAL

BEST PRACTICES

282 of 954

The question arises, are internal data sources sufficiently reliable for use as reference? Bear in mind that in some cases the reference data does not need to be 100 percent accurate. It can be good enough to compare project data against reference data and to flag inconsistencies between them, particularly in cases where both sets of data are highly unlikely to share common errors.

Saving the Data in .DIC File Format IDQ installs with a set of reference dictionaries that have been created to handle many types of business data. These dictionaries are created using a proprietary .DIC file name extension. DIC is abbreviated from dictionary, and dictionary files are essentially comma delimited text files. You can create a new dictionary in three ways: ●

●

●

You can save an appropriately formatted delimited file as a .DIC file into the Dictionaries folders of your IDQ (client or server) installation. You can use the Dictionary Manager within Data Quality Workbench. This method allows you to create text and database dictionaries. You can write from plan files directly to a dictionary using the IDQ Report Viewer (see below).

The figure below shows a dictionary file open in IDQ Workbench and its underlying .DIC file open in a text editor. Note that the dictionary file has at least two columns of data. The Label column contains the correct or standardized form of each datum from the dictionary’s perspective. The Item columns contain versions of each datum that the dictionary recognizes as identical to or coterminous with the Label entry. Therefore, each datum in the dictionary must have at least two entries in the DIC file (see the text editor illustration below). A dictionary can have multiple Item columns.

To edit a dictionary value, open the DIC file and make your changes. You can make changes either through a text editor or by opening the dictionary in the Dictionary Manager.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

283 of 954

To add a value to a dictionary, open the DIC file in Dictionary Manager, place the cursor in an empty row, and add a Label string and at least one Item string. You can also add values in a text editor by placing the cursor on a new line and typing Label and Item values separated by commas. Once saved, the dictionary is ready for use in IDQ. Note: IDQ users with database expertise can create and specify dictionaries that are linked to database tables, and that thus can be updated dynamically when the underlying data is updated. Database dictionaries are useful when the reference data has been originated for other purposes and is likely to change independently of data quality. By making use of a dynamic connection, data quality plans can always point to the current version of the reference data.

Sharing Reference Data Across the Organization As you can publish or export plans from a local Data Quality repository to server repositories, so you can copy dictionaries across the network. The File Manager within IDQ Workbench provides an Explorer-like mechanism for moving files to other machines across the network. Bear in mind that Data Quality looks for .DIC files in pre-set locations within the IDQ installation when running a plan. By default, Data Quality relies on dictionaries being located in the following locations: ●

The Dictionaries folders installed with Workbench and Server.

●

The user’s file space in the Data Quality service domain.

IDQ does not recognize a dictionary file that is not in such a location, even if you can browse to the file when designing the data quality plan. Thus, any plan that uses a dictionary in a non-standard location will fail. This is most relevant when you publish or export a plan to another machine on the network. You must ensure that copies of any dictionary files used in the local plan are available in a suitable location on the service domain — in the user space on the server, or at a location in the server’s Dictionaries folders that corresponds to the dictionaries’ location on Workbench — when the plan is copied to the server-side repository. Note: You can change the locations in which IDQ looks for plan dictionaries by editing the config.xml file. However, this is the master configuration file for the product and you should not edit it without consulting Informatica Support. Bear in mind that Data Quality looks only in the locations set in the config.xml file.

Version Controlling Updates and Managing Rollout from Development to Production Plans can be version-controlled during development in Workbench and when published to a domain repository. You can create and annotate multiple versions of a plan, and review/roll back to earlier versions when necessary. Dictionary files are not version controlled by IDQ, however. You should define a process to log changes and back-up your dictionaries using version control software if possible or a manual method. If modifications are to be made to the versions of dictionary files installed by the software, it is recommended that these modifications be made to a copy of the original file, renamed or relocated as desired. This approach avoids the risk that a subsequent installation might overwrite changes. INFORMATICA CONFIDENTIAL

BEST PRACTICES

284 of 954

Database reference data can also be version controlled, although this presents difficulties if the database is very large in size. Bear in mind that third-party reference data, such as postal address data, should not ordinarily be changed, and so the need for a versioning strategy for these files is debatable.

Working with External Data Formatting Data into Dictionary Format External data may or may not permit the copying of data into text format — for example, external data contained in a database or in library files. Currently, third-party postal address validation data is provided to Informatica users in this manner, and IDQ leverages software from the vendor to read these files. (The third-party software has a very small footprint.) However, some software files can be amenable to data extraction to file.

Obtaining Updates for External Reference Data External data vendors produce regular data updates, and it’s vital to refresh your external reference data when updates become available. The key advantage of external data — its reliability — is lost if you do not apply the latest files from the vendor. If you obtained third-party data through Informatica, you will be kept up to date with the latest data as it becomes available for as long as your data subscription warrants. You can check that you possess the latest versions of third-party data by contacting your Informatica Account Manager.

Managing Reference Updates and Rolling Out Across the Organization If your organization has a reference data subscription, you will receive either regular data files on compact disc or regular information on how to download data from Informatica or vendor web sites. You must develop a strategy for distributing these updates to all parties who run plans with the external data. This may involve installing the data on machines in a service domain. Bear in mind that postal address data vendors update their offerings every two or three months, and that a significant percentage of postal addresses can change in such time periods. You should plan for the task of obtaining and distributing updates in your organization at frequent intervals. Depending on the number of IDQ installations that must be updated, updating your organization with thirdparty reference data can be a sizable task.

Strategies for Managing Internal and External Reference Data Experience working with reference data leads to a series of best practice tips for creating and managing reference data files.

Using Workbench to Build Dictionaries With IDQ Workbench, you can select data fields or columns from a dataset and save them in a dictionarycompatible format.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

285 of 954

Let’s say you have designed a data quality plan that identifies invalid or anomalous records in a customer database. Using IDQ, you can create an exception file of these bad records, and subsequently use this file to create a dictionary-compatible file. For example, let’s say you have an exception file containing suspect or invalid customer account records. Using a very simple data quality plan, you can quickly parse the account numbers from this file to create a new text file containing the account serial numbers only. This file effectively constitutes the labels column of your dictionary. By opening this file in Microsoft Excel or a comparable program and copying the contents of Column A into Column B, and then saving the spreadsheet as a CSV file, you create a file with Label and Item1 columns. Rename the file with a .DIC suffix and add it to the Dictionaries folder of your IDQ installation: the dictionary is now visible to the IDQ Dictionary Manager. You now have a dictionary file of bad account numbers that you can use in any plans checking the validity of the organization's account records.

Using Report Viewer to Build Dictionaries The IDQ Report Viewer allows you to create exception files and dictionaries on-the-fly from report data. The figure below illustrates how you can drill-down into report data, right-click on a column, and save the column data as a dictionary file. This file will be populated with Label and Item1 entries corresponding to the column data. In this case, the dictionary created is a list of serial numbers from invalid customer records (specifically, records containing bad zip codes). The plan designer can now create plans to check customer databases against these serial numbers. You can also append data to an existing dictionary file in this manner.

As a general rule, it is a best practice to follow the dictionary organization structure installed by the application, adding to that structure as necessary to accommodate specialized and supplemental dictionaries. Subsequent users are then relieved of the need to examine the config.xml file for possible modifications, thereby lowering the risk of accidental errors during migration. When following the original dictionary organization structure is not practical or contravenes other requirements, take care to document INFORMATICA CONFIDENTIAL

BEST PRACTICES

286 of 954

the customizations. Since external data may be obtained from third parties and may not be in file format, the most efficient way to share its content across the organization is to locate it on the Data Quality Server machine. (Specifically, this is the machine that hosts the Execution Service.)

Moving Dictionary Files After IDQ Plans are Built This is a similar issue to that of sharing reference data across the organization. If you must move or relocate your reference data files post-plan development, you have three options: ● ●

●

You can reset the location to which IDQ looks by default for dictionary files. You can reconfigure the plan components that employ the dictionaries to point to the new location. Depending on the complexity of the plan concerned, this can be very labor-intensive. If deploying plans in a batch or scheduled task, you can append the new location to the plan execution command. You can do this by appending a parameter file to the plan execution instructions on the command line. The parameter file is an xml file that can contain a simple command to use one file path instead of another.

Last updated: 08-Feb-07 17:09

INFORMATICA CONFIDENTIAL

BEST PRACTICES

287 of 954

Real-Time Matching Using PowerCenter Challenge

This Best Practice describes the rationale for matching in real-time along with the concepts and strategies used in planning for and developing a real-time matching solution. It also provides step-by-step instructions on how to build this process using Informatica’s PowerCenter and Data Quality. The cheapest and most effective way to eliminate duplicate records from a system is to prevent them from ever being entered in the first place. Whether the data is coming from a website, an application entry, EDI feeds messages on a queue, changes captured from a database, or other common data feeds, taking these records and matching them against existing master data that already exists allows for only the new, unique records to be added. ●

Benefits of preventing duplicate records include:

●

Better ability to service customer, with the most accurate and complete information readily available

●

Reduced risk of fraud or over-exposure

●

Trusted information at the source

●

Less effort in BI, data warehouse, and/or migration projects

Description Performing effective real-time matching involves multiple puzzle pieces. 1. There is a master data set (or possibly multiple master data sets) that contain clean and unique customers, prospects, suppliers, products, and/or many other types of data. 2. To interact with the master data set, there is an incoming transaction; typically thought to be a new item. This transaction can be anything from a new customer signing up on the web to a list of new products; this is anything that is assumed to be new and intended to be added to master. 3. There must be a process to determine if a “new” item really is new or if it already exists within the master data set. In a perfect world of consistent id’s, spellings, and representations of data across all companies and systems, checking for duplicates would simply be some sort of exact lookup into the master to see if the item already exists. Unfortunately, this is not the case and even being creative and using %LIKE% syntax does not provide thorough results. For example, comparing Bob to Robert or GRN to Green requires a more sophisticated approach.

Standardizing Data in Advance of Matching The first prerequisite for successful matching is to cleanse and standardize the master data set. This process requires well-defined rules for important attributes. Applying these rules to the data should result in complete, consistent, conformant, valid data, which really means trusted data. These rules should also be reusable so they can be used with the incoming transaction data prior to matching. The more compromises made in the quality of master data by failing to cleanse and standardize, the more effort will need to be put into the matching logic, and the less value the organization will derive from it. There will be many more chances of missed matches allowing duplicates to enter the system. Once the master data is cleansed, the next step is to develop criteria for candidate selection. For efficient matching, there is no need to compare records that are so dissimilar that they cannot meet the business rules for matching. On the other hand, the set of candidates must be sufficiently broad to minimize the chance that similar records will not be compared. For example, when matching consumer data on name and address, it may be sensible to limit candidate pull records to those having the same zip code and the same first letter of the last name, because we can reason that if those elements are different between two records, those two records will not match. INFORMATICA CONFIDENTIAL

BEST PRACTICES

288 of 954

There also may be cases where multiple candidate sets are needed. This would be the case if there are multiple sets of match rules that the two records will be compared against. Adding to the previous example, think of matching on name and address for one set of match rules and name and phone for a second. This would require selecting records from the master that have the same phone number and first letter of the last name. Once the candidate selection process is resolved, the matching logic can be developed. This can consist of matching one to many elements of the input record to each candidate pulled from the master. Once the data is compared each pair of records, one input and one candidate, will have a match score or a series of match scores. Scores below a certain threshold can then be discarded and potential matches can be output or displayed. The full real-time match process flow includes: 1. The input record coming into the server 2. The server then standardizes the incoming record and retrieves candidate records from the master data source that could match the incoming record 3. Match pairs are then generated, one for each candidate, consisting of the incoming record and the candidate 4. The match pairs then go through the matching logic resulting in a match score 5. Records with a match score below a given threshold are discarded 6. The returned result set consists of the candidates that are potential matches to the incoming record

Developing an Effective Candidate Selection Strategy Determining which records from the master should be compared with the incoming record is a critical decision in an effective real-time matching system. For most organizations it is not realistic to match an incoming record to all master records. Consider even a modest customer master data set with one million records; the amount of processing, and thus the wait in real-time would be unacceptable. Candidate selection for real-time matching is synonymous to grouping or blocking for batch matching. The goal of candidate selection is to select only that subset of the records from the master that are definitively related by a field, part of a field, or combination of multiple parts/fields. The selection is done using a candidate key or group key. Ideally this key would be constructed and stored in an indexed field within the master table(s) allowing for the quickest retrieval. There are many instances where multiple keys are used to allow for one key to be missing or different, while another pulls in the record as a candidate. What specific data elements the candidate key should consist of very much depends on the scenario and the match rules. The one common theme with candidate keys is the data elements used should have the highest levels of completeness and validity possible. It is also best to use elements that can be verified as valid, such as a postal code INFORMATICA CONFIDENTIAL

BEST PRACTICES

289 of 954

or a National ID. The table below lists multiple common matching elements and how group keys could be used around the data. The ideal size of the candidate record sets, for sub-second response times, should be under 300 records. For acceptable two to three second response times, candidate record counts should be kept under 5000 records.

Step by Step Development The following instructions further explain the steps for building a solution to real-time matching using the Informatica suite. They involve the following applications: ●

Informatica PowerCenter 8.5.1 - utilizing Web Services Hub

●

Informatica Data Explorer 5.0 SP4

●

Informatica Data Quality 8.5 SP1 – utilizing North American Country Pack

●

SQL Server 2000

INFORMATICA CONFIDENTIAL

BEST PRACTICES

290 of 954

Scenario: ●

A customer master file is provided with the following structure

●

In this scenario, we are performing a name and address match

●

●

Because address is part of the match, we will use the recommended address grouping strategy for our candidate key (see table1) The desire is that different applications from the business will be able to make a web service call to determine if the data entry represents a new customer or an existing customer

Solution: 1. The first step is to analyze the customer master file. Assume that this analysis shows the postcode field is complete for all records and the majority of it is of high accuracy. Assume also that neither the first name or last name field is completely populated; thus the match rules we must account for blank names. 2. The next step is to load the customer master file into the database. Below is a list of tasks that should be implemented in the mapping that loads the customer master data into the database: ●

●

● ●

●

Standardize and validate the address, outputting the discreet address components such as house number, street name, street type, directional, and suite number. (Pre-built mapplet to do this; country pack) Generate the candidate key field, populate that with the selected strategy (assume it is the first 3 characters of the zip, house number, and the first character of street name), and generate an index on that field. (Expression, output of previous mapplet, hint: substr(in_ZIPCODE, 0, 3)|| in_HOUSE_NUMBER||substr(in_STREET_NAME, 0, 1)) Standardize the phone number. (Pre-built mapplet to do this; country pack) Parse the name field into individual fields. Although the data structure indicates names are already parsed into first, middle, and last, assume there are examples where the names are not properly fielded. Also remember to output a value to handle of nicknames. (Pre-built mapplet to do this; country pack) Once complete, your customer master table should look something like this:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

291 of 954

3. Now that the customer master has been loaded, a Web Service mapping must be created to handle real-time matching. For this project, assume that the incoming record will include a full name field, address, city, state, zip, and a phone number. All fields will be free-form text. Since we are providing the Service, we will be using a Web Service Provider source and target. Follow these steps to build the source and target definitions. ●

●

Within PowerCenter Designer, go to the source analyzer and select the source menu. From there select Web Service Provider and the Create Web Service Definition.

You will see a screen like the one below where the Service can be named and input and output ports can be created. Since this is a matching scenario, the potential that multiple records will be returned must be taken into account. Select the Multiple Occurring Elements checkbox for the output ports section. Also add a match score output field to return the percentage at which the input record matches the different potential matching records from the master.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

292 of 954

●

Both the source and target should now be present in the project folder.

4. An IDQ match plan must be build to use within the mapping. In developing a plan for real-time, using a CSV source and CSV sink, both enabled for real-time is the most significant difference from a similar match plan designed for use in IDQ standalone. The source will have the _1 and the _2 fields that a Group Source would supply built into it, e.g. Firstname_1 & Firstname_2. Another difference from batch matching in PowerCenter is that the DQ transformation can be set to passive. The following steps illustrate converting the North America Country Pack’s Individual Name and Address Match Plan from a plan built for use in a batch mapping to a plan built for use in a real-time mapping. ●

●

Open the DCM_NorthAmerica project and from within the Match folder make a copy of the “Individual Name and Address Match” plan. Rename it to “RT Individual Name and Address Match”. Create a new stub CSV file with only the header row. This will be used to generate a new CSV Source within the plan. This header must use all of the input fields used by the plan before modification. For convenience, a sample stub header is listed below. The header for the stub file will duplicate all of the fields, with one set having a suffix of _1 and the other _2. IN_GROUP_KEY_1,IN_FIRSTNAME_1,IN_FIRSTNAME_ALT_1, IN_MIDNAME_1,IN_LASTNAME_1,IN_POSTNAME_1, IN_HOUSE_NUM_1,IN_STREET_NAME_1,IN_DIRECTIONAL_1, IN_ADDRESS2_1,IN_SUITE_NUM_1,IN_CITY_1,IN_STATE_1,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

293 of 954

IN_POSTAL_CODE_1,IN_GROUP_KEY_2,IN_FIRSTNAME_2, IN_FIRSTNAME_ALT_2,IN_MIDNAME_2,IN_LASTNAME_2, IN_POSTNAME_2,IN_HOUSE_NUM_2,IN_STREET_NAME_2, IN_DIRECTIONAL_2,IN_ADDRESS2_2,IN_CITY_2,IN_STATE_2, IN_POSTAL_CODE_2 ●

●

●

●

Now delete the CSV Match Source from the plan and add a new CSV Source, and point it at the new stub file. Because the components were originally mapped to the CSV Match Source and that was deleted, the fields within your plan need to be reselected. As you open the different match components and RBAs, you can see the different instances that need to be reselected as they appear with a red diamond, as seen below.

Also delete the CSV Match Sink and replace it with a CSV Sink. Only the match score field(s) must be selected for output. This plan will be imported into a passive transformation. Consequently, data can be passed around it and does not need to be carried through the transformation. With this implementation you can output multiple match scores so it is possible to see why two records matched or didn’t match on a field by field basis. Select the check box for Enable Real-time Processing in both the source and the sink and the plan will be ready to be imported into PowerCenter.

5. The mapping will consist of: a. The source and target previously generated b. An IDQ transformation importing the plan just built c. The same IDQ cleansing and standardization transformations used to load then master data (Refer to step 2 for specifics) d. An Expression transformation to generate the group key and build a single directional field e. A SQL transformation to get the candidate records for the master table f. A Filter transformation to filter those records that match score below a certain threshold g. A Sequence transformation to build a unique key for each matching record returned in the SOAP response INFORMATICA CONFIDENTIAL

BEST PRACTICES

294 of 954

●

●

●

●

●

●

Within PowerCenter Designer, create a new mapping and drag the web service source and target previously created into the mapping. Add the following country pack mapplets to standardize and validate the incoming record from the web service: ❍

mplt_dq_p_Personal_Name_Standardization_FML

❍

mplt_dq_p_USA_Address_Validation

❍

mplt_dq_p_USA_Phone_Standardization_Validation

Add an Expression Transformation and build the candidate key from the Address Validation mapplet output fields. Remember to use the same logic as in the mapping that loaded the customer master. Also within the expression, concatenate the pre and post directional field into a single directional field for matching purposes. Add a SQL transformation to the mapping. The SQL transform will present a dialog box with a few questions related to the SQL transformation. For this example select Query mode, MS SQL Server (change as desired), and a Static connection. For details on the other options refer to the PowerCenter help. Connect all necessary fields from the source qualifier, DQ mapplets, and Expression transformation to the SQL transformation. These fields should include: ❍

XPK_n4_Envelope (This is the Web Service message key)

❍

Parsed name elements

❍

Standardized and parsed address elements, which will be used for matching.

❍

Standardized phone number

The next step is to build the query from within the SQL transformation to select the candidate records. Make sure that the output fields agree with the query in number, name, and type.

The output of the SQL transform will be the incoming customer record along with the candidate record. INFORMATICA CONFIDENTIAL

BEST PRACTICES

295 of 954

These will be stacked records where the Input/Output fields will represent the input record and the Output only fields will represent the Candidate record. A simple example of this is shown in the table below where a single incoming record will be paired with two candidate records:

●

●

●

●

●

Comparing the new record to the candidates is done by embedding the IDQ plan converted in step 4 into the mapping through the use of the Data Quality transformation. When this transformation is created, select passive as the transformation type. The output of the Data Quality transformation will be a match score. This match score will be in a float type format between 0.0 and 1.0. Using a filter transformation, all records that have a match score below a certain threshold will get filtered off. For this scenario, the cut-off will be 80%. (Hint: TO_FLOAT(out_match_score) >= .80) Any record coming out of the filter transformation is a potential match that exceeds the specified threshold, and the record will be included in the response. Each of these records needs a new Unique ID so the Sequence Generator transformation will be used. To complete the mapping, the output of the Filter and Sequence Generator transformations need to be mapped to the target. Make sure to map the input primary key field (XPK_n4_Envelope_output) to the primary key field of the envelope group in the target (XPK_n4_Envelope) and to the foreign key of the response element group in the target (FK_n4_Envelope). Map the output of the Sequence Generator to the primary key field of the response element group. The mapping should look like this:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

296 of 954

6. Before testing the mapping, create a workflow. ●

●

●

Using the Workflow Manager, generate a new workflow and session for this mapping using all the defaults. Once created, edit the session task. On the Mapping tab select the SQL transformation and make sure the connection type is relational. Also make sure to select the proper connection. For more advanced tweaking and web service settings see the PowerCenter documentation.

The final step is to expose this workflow as a Web Service. This is done by editing the Workflow. The workflow needs to be Web Services enabled and this is done by selecting the enabled checkbox for Web Services. Once the Web Service is enabled, it should be configured. For all the specific details of this please refer to the PowerCenter documentation, but for the purpose of this scenario: a. Give the service the name you would like to see exposed to the outside world

INFORMATICA CONFIDENTIAL

BEST PRACTICES

297 of 954

b. Set the timeout to 30 seconds c. Allow 2 concurrent runs d. Set the workflow to be visible and runnable

7. The web service is ready for testing.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

298 of 954

Testing Data Quality Plans Challenge To provide a guide for testing data quality processes or plans created using Informatica Data Quality (IDQ) and to manage some of the unique complexities associated with data quality plans.

Description Testing data quality plans is an iterative process that occurs as part of the Design Phase of Velocity. Plan testing often precedes the project’s main testing activities, as the tested plan outputs will be used as inputs in the Build Phase. It is not necessary to formally test the plans used in the Analyze Phase of Velocity. The development of data quality plans typically follows a prototyping methodology of create, execute, analyze. Testing is performed as part of the third step, in order to determine that the plans are being developed in accordance with design and project requirements. This method of iterative testing helps support rapid identification and resolution of bugs. Bear in mind that data quality plans are designed to analyze and resolve data content issues. These are not typically cut-and-dry problems, but more often represent a continuum of data improvement issues where it is possible that every data instance is unique and there is a target level of data quality rather than a “right or wrong answer”. Data quality plans tend to resolve problems in terms of percentages and probabilities that a problem is fixed. For example, the project may set a target of 95 percent accuracy in its customer addresses. The level of inaccuracy acceptability is also likely to change over time, based upon the importance of a given data field to the underlying business process. As well, accuracy should continuously improve as the data quality rules are applied and the existing data sets adhere to a higher standard of quality.

Common Questions in Data Quality Plan Testing ●

What dataset will you use to test the plans? While the ideal situation is to use a data set that exactly mimics the project production data, you may not gain access to this data. If you obtain a full cloned set of the project data for testing purposes, bear in mind that some plans (specifically some data

INFORMATICA CONFIDENTIAL

BEST PRACTICES

299 of 954

matching plans) can take several hours to complete. Consider testing data matching plans overnight. ●

●

●

Are the plans using reference dictionaries? Reference dictionary management is an important factor since it is possible to make changes to a reference dictionary independently of IDQ and without making any changes to the plan itself. When you pass an IDQ plan as tested, you must ensure that no additional work is carried out on any dictionaries referenced in the plan. Moreover, you must ensure that the dictionary files reside in locations that are valid IDQ. How will the plans be executed? Will they be executed on a remote IDQ Server and/or via a scheduler? In cases like these, it’s vital to ensure that your plan resources, including source data files and reference data files, are in valid locations for use by the Data Quality engine. For details on the local and remote locations to which IDQ looks for source and reference data files, refer to the Informatica Data Quality 8.5 User Guide. Will the plans be integrated into a PowerCenter transformation? If so, the plans must have real-time enabled data source and sink components.

Strategies for Testing Data Quality Plans The best practice steps for testing plans can be grouped under two headings.

Testing to Validate Rules 1. Identify a small, representative sample of source data. 2. To determine the results to expect when the plans are run, manually process the data based on the rules for profiling, standardization or matching that the plans will apply. 3. Execute the plans on the test dataset and validate the plan results against the manually-derived results.

Testing to Validate Plan Effectiveness This process is concerned with establishing that a data enhancement plan has been properly designed; that is, that the plan delivers the required improvements in data quality. This is largely a matter of comparing the business and project requirements for data quality and establishing if the plans are on course to deliver these. If not, the plans may need a thorough redesign – or the business and project targets may need to be revised. In either case, discussions should be held with the key business stakeholders to review the results of the IDQ plan and determine the appropriate course of action. In INFORMATICA CONFIDENTIAL

BEST PRACTICES

300 of 954

addition, once the entire data set is processed against the business rules, there may be other data anomalies that were unaccounted for that may require additional modifications to the underlying business rules and IDQ plans.

Last updated: 05-Dec-07 16:02

INFORMATICA CONFIDENTIAL

BEST PRACTICES

301 of 954

Tuning Data Quality Plans Challenge This document gives an insight into the type of considerations and issues a user needs to be aware of when making changes to data quality processes defined in Informatica Data Quality (IDQ). In IDQ, data quality processes are called plans. The principal focus of this best practice is to know how to tune your plans without adversely affecting the plan logic. This best practice is not intended to replace training materials but serve as a guide for decision making in the areas of adding, removing or changing the operational components that comprise a data quality plan.

Description You should consider the following questions prior to making changes to a data quality plan: ●

●

●

●

●

What is the purpose of changing the plan? You should consider changing a plan if you believe the plan is not optimally configured, or the plan is not functioning properly and there is a problem at execution time or the plan is not delivering expected results as per the plan design principles. Are you trained to change the plan? Data quality plans can be complex. You should not alter a plan unless you have been trained or are highly experienced with IDQ methodology. Is the plan properly documented? You should ensure all plan documentation on the data flow and the data components are up-to-date. For guidelines on documenting IDQ plans, see the Sample Deliverable Data Quality Plan Design. Have you backed up the plan before editing? If you are using IDQ in a client-server environment, you can create a baseline version of the plan using IDQ version control functionality. In addition, you should copy the plan to a new project folder (viz., Work_Folder) in the Workbench for changing and testing, and leave the original plan untouched during testing. Is the plan operating directly on production data? This applies especially to standardization plans. When editing a plan, always work on staged data (database or flat-file). You can later migrate the plan to the production environment after complete and thorough testing.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

302 of 954

You should have a clear goal whenever you plan to change an existing plan. An event may prompt the change: for example, input data changing (in format or content), or changes in business rules or business/project targets. You should take into account all current change-management procedures, and the updated plans should be thoroughly tested before production processes are updated. This includes integration and regression testing too. (See also Testing Data Quality Plans.) Bear in mind that at a high level there are two types of data quality plans: data analysis and data enhancement plans. ●

●

Data analysis plans produce reports on data patterns and data quality across the input data. The key objective in data analysis is to determine the levels of completeness, conformity, and consistency in the dataset. In pursuing these objectives, data analysis plans can also identify cases of missing, inaccurate or “noisy” data. Data enhancement plans corrects completeness, conformity and consistency problems; they can also identify duplicate data entries and fix accuracy issues through the use of reference data.

Your goal in a data analysis plan is to discover the quality and usability of your data. It is not necessarily your goal to obtain the best scores for your data. Your goal in a data enhancement plan is to resolve the data quality issues discovered in the data analysis.

Adding Components In general, simply adding a component to a plan is not likely to directly affect results if no further changes are made to the plan. However, once the outputs from the new component are integrated into existing components, the data process flow is changed and the plan must be re-tested and results reviewed in detail before migrating the plan into production. Bear in mind, particularly in data analysis plans, that improved plan statistics do not always mean that the plan is performing better. It is possible to configure a plan that moves “beyond the point of truth” by focusing on certain data elements and excluding others. When added to existing plans, some components have a larger impact than others. For example, adding a “To Upper” component to convert text into upper case may not cause the plan results to change meaningfully, although the presentation of the output data will change. However, adding and integrating a Rule Based Analyzer component

INFORMATICA CONFIDENTIAL

BEST PRACTICES

303 of 954

(designed to apply business rules) may cause a severe impact, as the rules are likely to change the plan logic. As well as adding a new component — that is, a new icon — to the plan, you can add a new instance to an existing component. This can have the same effect as adding and integrating a new component icon. To avoid overloading a plan with too many components, it is a good practice to add multiple instances to a single component, within reason. Good plan design suggests that instances within a single component should be logically similar and work on the selected inputs in similar ways. The overall name for the component should also be changed to reflect the logic of the instances contained in the component. If you add a new instance to a component, and that instance behaves very differently to the other instances in that component — for example, if it acts on an unrelated set of outputs or performs an unrelated type of action on the data — you should probably add a new component for this instance. This will also help you keep track of your changes onscreen. To avoid making plans over-complicated, it is often a good practice to split tasks into multiple plans where a large amount of data quality measures need to be checked. This makes plans and business rules easier to maintain and provides a good framework for future development. For example, in an environment where a large number of attributes must be evaluated against the six standard data quality criteria (i.e., completeness, conformity, consistency, accuracy, duplication and consolidation) using one plan per data quality criterion may be a good way to move forward. Alternatively, splitting plans up by data entity may be advantageous. Similarly, during standardization, you can create plans for specific function areas (e.g,. address, product, or name) as opposed to adding all standardization tasks to a single large plan. For more information on the six standard data quality criteria, see Data Cleansing

Removing Components Removing a component from a plan is likely to have a major impact since, in most cases, data flow in the plan will be broken. If you remove an integrated component, configuration changes will be required to all components that use the outputs from the component. The plan cannot run without these configuration changes being completed. The only exceptions to this case are when the output(s) of the removed component are solely used by CSV Sink component or by a frequency component. However, in these cases, you must note that the plan output changes since the column(s) no longer appear in the result set.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

304 of 954

Editing Component Configurations Changing the configuration of a component can have a comparable impact on the overall plan as adding or removing a component – the plan’s logic changes, and therefore, so do the results that it produces. However, although adding or removing a component may make a plan non-executable, changing the configuration of a component can impact the results in more subtle ways. For example, changing the reference dictionary used by a parsing component does not “break” a plan, but may have a major impact on the resulting output. Similarly, changing the name of a component instance output does not break a plan. By default, component output names “cascade” through the other components in the plan, so when you change an output name, all subsequent components automatically update with the new output name. It is not necessary to change the configuration of dependent components.

Last updated: 26-May-08 11:12

INFORMATICA CONFIDENTIAL

BEST PRACTICES

305 of 954

Using Data Explorer for Data Discovery and Analysis Challenge To understand and make full use of Informatica Data Explorer’s potential to profile and define mappings for your project data. Data profiling and mapping provide a firm foundation for virtually any project involving data movement, migration, consolidation or integration, from data warehouse/data mart development, ERP migrations, and enterprise application integration to CRM initiatives and B2B integration. These types of projects rely on an accurate understanding of the true structure of the source data in order to correctly transform the data for a given target database design. However, the data’s actual form rarely coincides with its documented or supposed form. The key to success for data-related projects is to fully understand the data as it actually is, before attempting to cleanse, transform, integrate, mine, or otherwise operate on it. Informatica Data Explorer is a key tool for this purpose. This Best Practice describes how to use Informatica Data Explorer (IDE) in data profiling and mapping scenarios.

Description Data profiling and data mapping involve a combination of automated and human analyses to reveal the quality, content and structure of project data sources. Data profiling analyzes several aspects of data structure and content, including characteristics of each column or field, the relationships between fields, and the commonality of data values between fields— often an indicator of redundant data.

Data Profiling Data profiling involves the explicit analysis of source data and the comparison of observed data characteristics against data quality standards. Data quality and integrity issues include invalid values, multiple formats within a field, non-atomic fields (such as long address strings), duplicate entities, cryptic field names, and others. Quality standards may either be the native rules expressed in the source data’s metadata, or an external standard (e.g., corporate, industry, or government) to which the source data must be mapped in order to be assessed.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

306 of 954

Data profiling in IDE is based on two main processes: ●

Inference of characteristics from the data

●

Comparison of those characteristics with specified standards, as an assessment of data quality

Data mapping involves establishing relationships among data elements in various data structures or sources, in terms of how the same information is expressed or stored in different ways in different sources. By performing these processes early in a data project, IT organizations can preempt the “code/load/explode” syndrome, wherein a project fails at the load stage because the data is not in the anticipated form. Data profiling and mapping are fundamental techniques applicable to virtually any project. The following figure summarizes and abstracts these scenarios into a single depiction of the IDE solution.

The overall process flow for the IDE Solution is as follows: INFORMATICA CONFIDENTIAL

BEST PRACTICES

307 of 954

1. Data and metadata are prepared and imported into IDE.

2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas. 3. The resultant metadata are exported to and managed in the IDE Repository. 4. In a derived-target scenario, the project team designs the target database by modeling the existing data sources and then modifying the model as required to meet current business and performance requirements. In this scenario, IDE is used to develop the normalized schema into a target database. The normalized and target schemas are then exported to IDE’s FTM/XML tool, which documents transformation requirements between fields in the source, normalized, and target schemas. OR 5. In a fixed-target scenario, the design of the target database is a given (i.e., because another organization is responsible for developing it, or because an off-the-shelf package or industry standard is to be used). In this scenario, the schema development process is bypassed. Instead, FTM/XML is used to map the source data fields to the corresponding fields in an externally-specified target schema, and to document transformation requirements between fields in the normalized and target schemas. FTM is used for SQL-based metadata structures, and FTM/XML is used to map SQL and/or XML-based metadata structures. Externally specified targets are typical for ERP package migrations, business-tobusiness integration projects, or situations where a data modeling team is independently designing the target schema. 6. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE applications.

IDE's Methods of Data Profiling IDE employs three methods of data profiling: Column profiling - infers metadata from the data for a column or set of columns. IDE infers both the most likely metadata and alternate metadata which is consistent with the data.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

308 of 954

Table Structural profiling - uses the sample data to infer relationships among the columns in a table. This process can discover primary and foreign keys, functional dependencies, and sub-tables.

Cross-Table profiling - determines the overlap of values across a set of columns, which may come from multiple tables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

309 of 954

Profiling against external standards requires that the data source be mapped to the standard before being assessed (as shown in the following figure). Note that the mapping is performed by IDE’s Fixed Target Mapping tool (FTM). IDE can also be used in the development and application of corporate standards, making them relevant to existing systems as well as to new systems.

Data profiling projects may involve iterative profiling and cleansing as well since data cleansing may improve the quality of the results obtained through dependency and redundancy profiling. Note that Informatica Data Quality should be considered as an alternative tool for data cleansing.

IDE and Fixed-Target Migration Fixed-target migration projects involve the conversion and migration of data from one or more sources to an externally defined or fixed-target. IDE is used to profile the data and develop a normalized schema representing INFORMATICA CONFIDENTIAL

BEST PRACTICES

310 of 954

the data source(s), while IDE’s Fixed Target Mapping tool (FTM) is used to map from the normalized schema to the fixed target. The general sequence of activities for a fixed-target migration project, as shown in the figure below, is as follows: 1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team. 3. The resultant metadata are exported to and managed by the IDE Repository. 4. FTM maps the source data fields to the corresponding fields in an externally specified target schema, and documents transformation requirements between fields in the normalized and target schemas. Externallyspecified targets are typical for ERP migrations or projects where a data modeling team is independently designing the target schema. 5. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE and FTM. 6. The cleansing, transformation, and formatting specs can be used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms or configure an ETL product to perform the data conversion and migration.

The following screen shot shows how IDE can be used to generate a suggested normalized schema, which may discover ‘hidden’ tables within tables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

311 of 954

Depending on the staging architecture used, IDE can generate the data definition language (DDL) needed to establish several of the staging databases between the sources and target, as shown below:

Derived-Target Migration Derived-target migration projects involve the conversion and migration of data from one or more sources to a target database defined by the migration team. IDE is used to profile the data and develop a normalized schema representing the data source(s), and to further develop the normalized schema into a target schema by adding tables and/or fields, eliminating unused tables and/or fields, changing the relational structure, and/or denormalizing the schema to enhance performance. When the target schema is developed from the normalized schema within IDE, the product automatically maintains the mappings from the source to normalized schema, and from the normalized to target schemas. The figure below shows that the general sequence of activities for a derived-target migration project is as follows: INFORMATICA CONFIDENTIAL

BEST PRACTICES

312 of 954

1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE is used to profile the data, generate accurate metadata (including a normalized schema), and document cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team. 3. IDE is used to modify and develop the normalized schema into a target schema. This generally involves removing obsolete or spurious data elements, incorporating new business requirements and data elements, adapting to corporate data standards, and denormalizing to enhance performance. 4. The resultant metadata are exported to and managed by the IDE Repository. 5. FTM is used to develop and document transformation requirements between the normalized and target schemas. The mappings between the data elements are automatically carried over from the IDE-based schema development process. 6. The IDE Repository is used to export an XSLT document containing the transformation and the formatting specs developed with IDE and FTM/XML. 7. The cleansing, transformation, and formatting specs are used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms of configure an ETL product to perform the data conversion and migration.

Last updated: 09-Feb-07 12:55

INFORMATICA CONFIDENTIAL

BEST PRACTICES

313 of 954

Working with Pre-Built Plans in Data Cleanse and Match Challenge To provide a set of best practices for users of the pre-built data quality processes designed for use with the Informatica Data Cleanse and Match (DC&M) product offering. Informatica Data Cleanse and Match is a cross-application data quality solution that installs two components to the PowerCenter system: ●

●

Data Cleanse and Match Workbench, the desktop application in which data quality processes - or plans - plans can be designed, tested, and executed. Workbench installs with its own Data Quality repository, where plans are stored until needed. Data Quality Integration, a plug-in component that integrates Informatica Data Quality and PowerCenter. The plug-in adds a transformation to PowerCenter, called the Data Quality Integration transformation; PowerCenter Designer users can connect to the Data Quality repository and read data quality plan information into this transformation.

Informatica Data Cleanse and Match has been developed to work with Content Packs developed by Informatica. This document focuses on the plans that install with the North America Content Pack, which was developed in conjunction with the components of Data Cleanse and Match. The North America Content Pack delivers data parsing, cleansing, standardization, and de-duplication functionality to United States and Canadian name and address data through a series of pre-built data quality plans and address reference data files. This document focuses on the following areas: ●

when to use one plan vs. another for data cleansing.

●

what behavior to expected from the plans.

●

how best to manage exception data.

Description The North America Content Pack installs several plans to the Data Quality Repository: ● ●

Plans 01-04 are designed to parse, standardize, and validate United States name and address data. Plans 05-07 are designed to enable single-source matching operations (identifying duplicates within a data set) or dual source matching operations (identifying matching records between two datasets).

The processing logic for data matching is split between PowerCenter and Informatica Data Quality (IDQ) applications.

Plans 01-04: Parsing, Cleansing, and Validation These plans provide modular solutions for name and address data. The plans can operate on highly unstructured and wellstructured data sources. The level of structure contained in a given data set determines the plan to be used. The following diagram demonstrates how the level of structure in address data maps to the plans required to standardize and validate an address.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

314 of 954

In cases where the address is well structured and specific data elements (i.e., city, state, and zip) are mapped to specific fields, only the address validation plan may be required. Where the city, state, and zip are mapped to address fields, but not specifically labeled as such (e.g., as Address1 through Address5), a combination of the address standardization and validation plans is required. In extreme cases, where the data is not mapped to any address columns, a combination of the general parser, address standardization, and validation plans may be required to obtain meaning from the data. The purpose of making the plans modular is twofold: ●

●

It is possible to apply these plans on an individual basis to the data. There is no requirement that the plans be run in sequence with each other. For example, the address validation plan (plan 03) can be run successfully to validate input addresses discretely from the other plans. In fact, the Data Quality Developer will not run all seven plans consecutively on the same dataset. Plans 01 and 02 are not designed to operate in sequence, nor are plans 06 and 07. Modular plans facilitate faster performance. Designing a single plan to perform all the processing tasks contained in the seven plans, even if it were desirable from a functional point of view, would result in significant performance degradation and extremely complex plan logic that would be difficult to modify and maintain.

01 General Parser The General Parser plan was developed to handle highly unstructured data and to parse it into type-specific fields. For example, consider data stored in the following format:

Field1

Field2

Field3

Field4

Field5

100 Cardinal Way

Informatica Corp

CA 94063

[email protected]

Redwood City

Redwood City

38725

100 Cardinal Way

CA 94063

[email protected]

While it is unusual to see data fragmented and spread across a number of fields in this way, it can and does happen. In cases such as this, data is not stored in any specific fields. Street addresses, email addresses, company names, and dates are scattered throughout the data. Using a combination of dictionaries and pattern recognition, the General Parser plan sorts such data into typespecific fields of address, names, company names, Social Security Numbers, dates, telephone numbers, and email addresses, depending on the profile of the content. As a result, the above data will be parsed into the following format:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

315 of 954

Address1

Address2

Address3

E-mail

100 Cardinal Way

CA 94063

Redwood City

[email protected]

Redwood City

100 Cardinal Way

CA 94063

[email protected]

Date

Company Informatica Corp

08/01/2006

The General Parser does not attempt to apply any structure or meaning to the data. Its purpose is to identify and sort data by information type. As demonstrated with the address fields in the above example, the address fields are labeled as addresses; the contents are not arranged in a standard address format, they are flagged as addresses in the order in which they were processed in the file. The General Parser does not attempt to validate the correctness of a field. For example, the dates are accepted as valid because they have a structure of symbols and numbers that represents a date. A value of 99/99/9999 would also be parsed as a date. The General Parser does not attempt to handle multiple information types in a single field. For example, if a person name and address element are contained in the same field, the General Parser would label the entire field either a name or an address - or leave it unparsed - depending on the elements in the field it can identify first (if any). While the General Parser does not make any assumption about the data prior to parsing, it parses based on the elements of data that it can make sense of first. In cases where no elements of information can be labeled, the field is left in a pipe-delimited form containing unparsed data. The effectiveness of the General Parser to recognize various information types is a function of the dictionaries used to identify that data and the rules used to sort them. Adding or deleting dictionary entries can greatly affect the effectiveness of this plan. Overall, the General Parser is likely only be used in limited cases, where certain types of information may be mixed together, (e.g., telephone and email in the same contact field), or in cases where the data has been badly managed, such as when several files of differing structures have been merged into a single file.

02 Name Standardization The Name Standardization plan is designed to take in person name or company name information and apply parsing and standardization logic to it. Name Standardization follows two different tracks: one for person names and one for company names. The plan input fields include two inputs for company names. Data that is entered in these fields are assumed to be valid company names, and no additional tests are performed to validate that the data is an existing company name. Any combination of letters, numbers, and symbols can represent a company; therefore, in the absence of an external reference data source, further tests to validate a company name are not likely to yield usable results. Any data entered into the company name fields is subjected to two processes. First, the company name is standardized using the Word Manager component, standardizing any company suffixes included in the field. Second, the standardized company name is matched against the company_names.dic dictionary, which returns the standardized Dun & Bradstreet company name, if found. The second track for name standardization is person names standardization. While this track is dedicated to standardizing person names, it does not necessarily assume that all data entered here is a person name. Person names in North America tend to follow a set structure and typically do not contain company suffixes or digits. Therefore, values entered in this field that contain a company suffix or a company name are taken out of the person name track and moved to the company name track. Additional logic is applied to identify people whose last name is similar (or equal) to a valid company name (for example John Sears); inputs that contain an identified first name and a company name are treated as a person name. If the company name track inputs are already fully populated for the record in question, then any company name detected in a person name column is moved to a field for unparsed company name output. If the name is not recognized as a company name (e. g., by the presence of a company suffix) but contains digits, the data is parsed into the non-name data output field. Any remaining data is accepted as being a valid person name and parsed as such. North American person names are typically entered in one of two different styles: either in a “firstname middlename surname” format or “surname, firstname middlename” format. Name parsing algorithms have been built using this assumption. Name parsing occurs in two passes. The first pass applies a series of dictionaries to the name fields, attempting to parse out name INFORMATICA CONFIDENTIAL

BEST PRACTICES

316 of 954

prefixes, name suffixes, firstnames, and any extraneous data (“noise”) present. Any remaining details are assumed to be middle name or surname details. A rule is applied to the parsed details to check if the name has been parsed correctly. If not, “best guess” parsing is applied to the field based on the possible assumed formats. When name details have been parsed into first, last, and middle name formats, the first name is used to derive additional details including gender and the name prefix. Finally, using all parsed and derived name elements, salutations are generated. In cases where no clear gender can be generated from the first name, the gender field is typically left blank or indeterminate. The salutation field is generated according to the derived gender information. This can be easily replicated outside the data quality plan if the salutation is not immediately needed as an output from the process (assuming the gender field is an output). Depending on the data entered in the person name fields, certain companies may be treated as person names and parsed according to person name processing rules. Likewise, some person names may be identified as companies and standardized according to company name processing logic. This is typically a result of the dictionary content. If this is a significant problem when working with name data, some adjustments to the dictionaries and the rule logic for the plan may be required. Non-name data encountered in the name standardization plan may be standardized as names depending on the contents of the fields. For example, an address datum such as “Corporate Parkway” may be standardized as a business name, as “Corporate” is also a business suffix. Any text data that is entered in a person name field is always treated as a person or company, depending on whether or not the field contains a recognizable company suffix in the text. To ensure that the name standardization plan is delivering adequate results, Informatica strongly recommends pre- and postexecution analysis of the data. Based on the following input:

ROW ID

IN NAME1

1

Steven King

2

Chris Pope Jr.

3

Shannon C. Prince

4

Dean Jones

5

Mike Judge

6

Thomas Staples

7

Eugene F. Sears

8

Roy Jones Jr.

9

Thomas Smith, Sr

10

Eddie Martin III

11

Martin Luther King, Jr.

12

Staples Corner

13

Sears Chicago

14

Robert Tyre

15

Chris News

The following outputs are produced by the Name Standardization plan:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

317 of 954

The last entry (Chris News) is identified as a company in the current plan configuration – such results can be refined by changing the underlying dictionary entries used to identify company and person names.

03 US Canada Standardization This plan is designed to apply basic standardization processes to city, state/province, and zip/postal code information for United States and Canadian postal address data. The purpose of the plan is to deliver basic standardization to address elements where processing time is critical and one hundred percent validation is not possible due to time constraints. The plan also organizes key search elements into discrete fields, thereby speeding up the validation process. The plan accepts up to six generic address fields and attempts to parse out city, state/province, and zip/postal code information. All remaining information is assumed to be address information and is absorbed into the address line 1-3 fields. Any information that cannot be parsed into the remaining fields is merged into the non-address data field. The plan makes a number of assumptions that may or may not suit your data: ●

●

●

●

When parsing city, state, and zip details, the address standardization dictionaries assume that these data elements are spelled correctly. Variation in town/city names is very limited, and in cases where punctuation differences exist or where town names are commonly misspelled, the standardization plan may not correctly parse the information. Zip codes are all assumed to be five-digit. In some files, zip codes that begin with “0” may lack this first number and so appear as a four-digit codes, and these may be missed during parsing. Adding four-digit zips to the dictionary is not recommended, as these will conflict with the “Plus 4” element of a zip code. Zip codes may also be confused with other five-digit numbers in an address line such as street numbers. City names are also commonly found in street names and other address elements. For example, “United” is part of a country (United States of America) and is also a town name in the U.S. Bear in mind that the dictionary parsing operates from right to left across the data, so that country name and zip code fields are analyzed before city names and street addresses. Therefore, the word “United” may be parsed and written as the town name for a given address before the actual town name datum is reached. The plan appends a country code to the end of a parsed address if it can identify it as U.S. or Canadian. Therefore, there is no need to include any country code field in the address inputs when configuring the plan.

Most of these issues can be dealt with, if necessary, by minor adjustments to the plan logic or to the dictionaries, or by adding some pre-processing logic to a workflow prior to passing the data into the plan. The plan assumes that all data entered into it are valid address elements. Therefore, once city, state, and zip details have been parsed out, the plan assumes all remaining elements are street address lines and parses them in the order they occurred as address lines 1-3.

04 NA Address Validation The purposes of the North America Address Validation plan are: ●

To match input addresses against known valid addresses in an address database, and

●

To parse, standardize, and enrich the input addresses.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

318 of 954

Performing these operations is a resource-intensive process. Using the US Canada Standardization plan before the NA Address Validation plan helps to improve validation plan results in cases where city, state, and zip code information are not already in discrete fields. City, state, and zip are key search criteria for the address validation engine, and they need to be mapped into discrete fields. Not having these fields correctly mapped prior to plan execution leads to poor results and slow execution times. The address validation APIs store specific area information in memory and continue to use that information from one record to the next, when applicable. Therefore, when running validation plans, it is advisable to sort address data by zip/postal code in order to maximize the usage of data in memory. In cases where status codes, error codes, or invalid results are generated as plan outputs, refer to the Informatica Data Quality 3.1 User Guide for information on how to interpret them.

Plans 05-07: Pre-Match Standardization, Grouping, and Matching These plans take advantage of PowerCenter and IDQ capabilities and are commonly used in pairs. Users will use either plan 05 and 06 or plans 05 and 07. There plans work as follows: ●

05 Match Standardization and Grouping. This plan is used to perform basic standardization and grouping operations on the data prior to matching.

●

06 Single Source Matching. Single source matching seeks to identify duplicate records within a single data set.

●

07 Dual Source Matching. Dual source matching seeks to identify duplicate records between two datasets.

Note that the matching plans are designed for use within a PowerCenter mapping and do not deliver optimal results when executed directly from IDQ Workbench. Note also that the Standardization and Matching plans are geared towards North American English data. Although they work with datasets in other languages, the results may be sub-optimal.

Matching Concepts To ensure the best possible matching results and performance, match plans usually use a pre-processing step to standardize and group the data. The aim for standardization here is different from a classic standardization plan – the intent is to ensure that different spellings, abbreviations, etc. are as similar to each other as possible to return better match set. For example, 123 Main Rd. and 123 Main Road will obtain an imperfect match score, although they clearly refer to the same street address. Grouping, in a matching context, means sorting input records based on identical values in one or more user-selected fields. When a matching plan is run on grouped data, serial matching operations are performed on a group-by-group basis, so that data records within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing time while minimizing the likelihood of missed matches in the dataset. Grouping performs two functions. It sorts the records in a dataset to increase matching plan performance, and it creates new data columns to provide group key options for the matching plan. (In PowerCenter, the Sorter transformation can organize the data to facilitate matching performance. Therefore, the main function of grouping in a PowerCenter context is to create candidate group keys. In both Data Quality and PowerCenter, grouping operations do not affect the source dataset itself.) Matching on un-grouped data involves a large number of comparisons that realistically will not generate a meaningful quantity of additional matches. For example, when looking for duplicates in a customer list, there is little value in comparing the record for John Smith with the record for Angela Murphy as they are obviously not going to be considered as duplicate entries. The type of grouping used depends on the type of information being matched; in general, productive fields for grouping name and address data are location-based (e.g. city name, zip codes) or person/company based (surname and company name composites). For more information on grouping strategies for best result/performance relationship, see the Best Practice Effective Data Matching Techniques. Plan 05 (Match Standardization and Grouping) performs cleansing and standardization operations on the data before

group keys are generated. It offers a number of grouping options. The plan generates the following group keys:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

319 of 954

●

OUT_ZIP_GROUP: first 5 digits of ZIP code

●

OUT_ZIP_NAME3_GROUP: first 5 digits of ZIP code and the first 3 characters of the last name

●

OUT_ZIP_NAME5_GROUP: first 5 digits of ZIP code and the first 5 characters of the last name

●

OUT_ZIP_COMPANY3_GROUP: first 5 digits of ZIP code and the first 3 characters of the cleansed company name

●

OUT_ZIP_COMPANY5_GROUP: first 5 digits of ZIP code and the first 5 characters of the cleansed company name

The grouping output used depends on the data contents and data volume.

Plans 06 Single Source Matching and 07 Dual Source Matching Plans 06 and 07 are set up in similar ways and assume that person name, company name, and address data inputs will be used. However, in PowerCenter, plan 07 requires the additional input of a Source tag, typically generated by an Expression transform upstream in the PowerCenter mapping. A number of matching algorithms are applied to the address and name elements. To ensure the best possible result, a weightbased component and a custom rule are applied to the outputs from the matching components. For further information on IDQ matching components, consult the Informatica Data Quality 3.1 User Guide. By default the plans are configured to write as output all records that match with an 85% percent or higher degree of certainty. The Data Quality Developer can easily adjusted this figure in each plan.

PowerCenter Mappings When configuring the Data Quality Integration transformation for the matching plan, the Developer must select a valid grouping field.

To ensure best matching results, the PowerCenter mapping that contains plan 05 should include a Sorter transformation that sorts data according to the group key to be used during matching. This transformation should follow standardization and grouping operations. Note that a single mapping can contain multiple Data Quality Integration transformations, so that the Data Quality Developer or Data Integration Developer can add plan 05 to one Integration transformation and plan 06 or 07 to another in the same mapping. The standardization plan requires a passive transformation, whereas the matching plan requires an active INFORMATICA CONFIDENTIAL

BEST PRACTICES

320 of 954

transformation.

The developer can add a Sequencer transformation to the mapping to generate a unique identifier for each input record if these not present in the source data. (Note that a unique identifier is not required for matching processes). When working with the dual source matching plan, additional PowerCenter transformations are required to pre-process the data for the Integration transformation. Expression transformations are used to label each input with a source tag of A and B respectively. The data from the two sources is then joined together using a Union transformation, before being passed to the Integration transformation containing the standardization and grouping plan. From here on, the mapping has the same design as the single source version.

Last updated: 09-Feb-07 13:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

321 of 954

Designing Data Integration Architectures Challenge Develop a sound data integration architecture that can serve as a foundation for data integration solutions.

Description Historically, organizations have approached the development of a "data warehouse" or "data mart" as a departmental effort, without considering an enterprise perspective. The result has been silos of corporate data and analysis, which very often conflict with each other in terms of both detailed data and the business conclusions implied by it. Data integration efforts are often the cornerstone in today's IT initiatives. Taking an enterprise-wide, architect stance in developing data integration solutions provides many advantages, including: ●

●

●

●

A sound architectural foundation ensures the solution can evolve and scale with the business over time. Proper architecture can isolate the application component (business context) of the data integration solution from the technology. Broader data integration efforts will be simplified by using an holisitc enterprise-based approach. Lastly, architectures allow for reuse - reuse of skills, design objects, and knowledge.

As the evolution of data integration solutions (and the corresponding nomenclature) has progressed, the necessity of building these solutions on a solid architectural framework has become more and more clear. To understand why, a brief review of the history of data integration solutions and their predecessors is warranted. As businesses become more global, Service Oriented Architecture (SOA) becomes more of an Information Technology standard. Having a solid architecture is paramount to the success of data Integration efforts.

Historical Perspective Online Transaction Processing Systems (OLTPs) have always provided a very detailed, transaction-oriented view of an organization's data. While this view was indispensable for the dayto-day operation of a business, its ability to provide a "big picture" view of the operation, critical for management decision-making, was severely limited. Initial attempts to address this problem took several directions: Reporting directly against the production system. This approach minimized the effort associated with developing management reports, but introduced a number of significant issues: INFORMATICA CONFIDENTIAL

BEST PRACTICES

322 of 954

The nature of OLTP data is, by definition, "point-in-time." Thus, reports run at different times of the year, month, or even the day, were inconsistent with each other. Ad hoc queries against the production database introduced uncontrolled performance issues, resulting in slow reporting results and degradation of OLTP system performance. Trending and aggregate analysis was difficult (or impossible) with the detailed data available in the OLTP systems. ●

●

Mirroring the production system in a reporting database . While this approach alleviated the performance degradation of the OLTP system, it did nothing to address the other issues noted above. Reporting databases . To address the fundamental issues associated with reporting against the OLTP schema, organizations began to move toward dedicated reporting databases. These databases were optimized for the types of queries typically run by analysts, rather than those used by systems supporting data entry clerks or customer service representatives. These databases may or may not have included pre-aggregated data, and took several forms, including traditional RDBMS as well as newer technology Online Analytical Processing (OLAP) solutions.

The initial attempts at reporting solutions were typically point solutions; they were developed internally to provide very targeted data to a particular department within the enterprise. For example, the Marketing department might extract sales and demographic data in order to infer customer purchasing habits. Concurrently, the Sales department was also extracting sales data for the purpose of awarding commissions to the sales force. Over time, these isolated silos of information became irreconcilable, since the extracts and business rules applied to the data during the extract process differed for the different departments The result of this evolution was that the Sales and Marketing departments might report completely different sales figures to executive management, resulting in a lack of confidence in both departments' "data marts." From a technical perspective, the uncoordinated extracts of the same data from the source systems multiple times placed undue strain on system resources. The solution seemed to be the "centralized" or "galactic" data warehouse. This warehouse would be supported by a single set of periodic extracts of all relevant data into the data warehouse (or Operational Data Store), with the data being cleansed and made consistent as part of the extract process. The problem with this solution was its enormous complexity, typically resulting in project failure. The scale of these failures led many organizations to abandon the concept of the enterprise data warehouse in favor of the isolated, "stovepipe" data marts described earlier. While these solutions still had all of the issues discussed previously, they had the clear advantage of providing individual departments with the data they needed without the unmanageability of the enterprise solution. As individual departments pursued their own data and data integration needs, they not only created data stovepipes, they also created technical islands. The approaches to populating the data marts and performing the data integration tasks varied widely, resulting in a single enterprise evaluating, INFORMATICA CONFIDENTIAL

BEST PRACTICES

323 of 954

purchasing, and being trained on multiple tools and adopting multiple methods for performing these tasks. If, at any point, the organization did attempt to undertake an enterprise effort, it was likely to face the daunting challenge of integrating the disparate data as well as the widely varying technologies. To deal with these issues, organizations began developing approaches that considered the enterprise-level requirements of a data integration solution.

Centralized Data Warehouse The first approach to gain popularity was the centralized data warehouse. Designed to solve the decision support needs for the entire enterprise at one time, with one effort, the data integration process extracts the data directly from the operational systems. It transforms the data according to the business rules and loads it into a single target database serving as the enterprise-wide data warehouse.

Advantages The centralized model offers a number of benefits to the overall architecture, including: ●

●

Centralized control . Since a single project drives the entire process, there is centralized control over everything occurring in the data warehouse. This makes it easier to manage a production system while concurrently integrating new components of the warehouse. Consistent metadata . Because the warehouse environment is contained in a single database and the metadata is stored in a single repository, the entire enterprise can be queried whether you are looking at data from Finance, Customers, or Human Resources.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

324 of 954

●

●

Enterprise view . Developing the entire project at one time provides a global view of how data from one workgroup coordinates with data from others. Since the warehouse is highly integrated, different workgroups often share common tables such as customer, employee, and item lists. High data integrity . A single, integrated data repository for the entire enterprise would naturally avoid all data integrity issues that result from duplicate copies and versions of the same business data.

Disadvantages Of course, the centralized data warehouse also involves a number of drawbacks, including: ●

●

●

●

●

Lengthy implementation cycle. With the complete warehouse environment developed simultaneously, many components of the warehouse become daunting tasks, such as analyzing all of the source systems and developing the target data model. Even minor tasks, such as defining how to measure profit and establishing naming conventions, snowball into major issues. Substantial up-front costs . Many analysts who have studied the costs of this approach agree that this type of effort nearly always runs into the millions. While this level of investment is often justified, the problem lies in the delay between the investment and the delivery of value back to the business. Scope too broad . The centralized data warehouse requires a single database to satisfy the needs of the entire organization. Attempts to develop an enterprise-wide warehouse using this approach have rarely succeeded, since the goal is simply too ambitious. As a result, this wide scope has been a strong contributor to project failure. Impact on the operational systems . Different tables within the warehouse often read data from the same source tables, but manipulate it differently before loading it into the targets. Since the centralized approach extracts data directly from the operational systems, a source table that feeds into three different target tables is queried three times to load the appropriate target tables in the warehouse. When combined with all the other loads for the warehouse, this can create an unacceptable performance hit on the operational systems. Potential integration challenges. A centralized data warehouse has the disadvantage of limited scalability. As businesses change and consolidate, adding new interfaces and/or merging a potentially disparate data source into the centralized data warehouse can be a challenge.

Independent Data Mart The second warehousing approach is the independent data mart, which gained popularity in 1996 when DBMS magazine ran a cover story featuring this strategy. This architecture is based on the same principles as the centralized approach, but it scales down the scope from solving the warehousing needs of the entire company to the needs of a single department or workgroup. Much like the centralized data warehouse, an independent data mart extracts data directly from the operational sources, manipulates the data according to the business rules, and loads a single INFORMATICA CONFIDENTIAL

BEST PRACTICES

325 of 954

target database serving as the independent data mart. In some cases, the operational data may be staged in an Operational Data Store (ODS) and then moved to the mart.

Advantages The independent data mart is the logical opposite of the centralized data warehouse. The disadvantages of the centralized approach are the strengths of the independent data mart: ●

●

●

●

Impact on operational databases localized . Because the independent data mart is trying to solve the DSS needs of a single department or workgroup, only the few operational databases containing the information required need to be analyzed. Reduced scope of the data model . The target data modeling effort is vastly reduced since it only needs to serve a single department or workgroup, rather than the entire company. Lower up-front costs . The data mart is serving only a single department or workgroup; thus hardware and software costs are reduced. Fast implementation . The project can be completed in months, not years. The process of defining business terms and naming conventions is simplified since "players from the same team" are working on the project.

Disadvantages

INFORMATICA CONFIDENTIAL

BEST PRACTICES

326 of 954

Of course, independent data marts also have some significant disadvantages: ●

●

●

●

Lack of centralized control . Because several independent data marts are needed to solve the decision support needs of an organization, there is no centralized control. Each data mart or project controls itself, but there is no central control from a single location. Redundant data . After several data marts are in production throughout the organization, all of the problems associated with data redundancy surface, such as inconsistent definitions of the same data object or timing differences that make reconciliation impossible. Metadata integration . Due to their independence, the opportunity to share metadata - for example, the definition and business rules associated with the Invoice data object - is lost. Subsequent projects must repeat the development and deployment of common data objects. Manageability . The independent data marts control their own scheduling routines and therefore store and report their metadata differently, with a negative impact on the manageability of the data warehouse. There is no centralized scheduler to coordinate the individual loads appropriately or metadata browser to maintain the global metadata and share development work among related projects.

Dependent Data Marts (Federated Data Warehouses) The third warehouse architecture is the dependent data mart approach supported by the hub-andspoke architecture of PowerCenter and PowerExchange. After studying more than one hundred different warehousing projects, Informatica introduced this approach in 1998, leveraging the benefits of the centralized data warehouse and independent data mart. The more general term being adopted to describe this approach is the "federated data warehouse." Industry analysts have recognized that, in many cases, there is no "one size fits all" solution. Although the goal of true enterprise architecture, with conformed dimensions and strict standards, is laudable, it is often impractical, particularly for early efforts. Thus, the concept of the federated data warehouse was born. It allows for the relatively independent development of data marts, but leverages a centralized PowerCenter repository for sharing transformations, source and target objects, business rules, etc. Recent literature describes the federated architecture approach as a way to get closer to the goal of a truly centralized architecture while allowing for the practical realities of most organizations. The centralized warehouse concept is sacrificed in favor of a more pragmatic approach, whereby the organization can develop semi-autonomous data marts, so long as they subscribe to a common view of the business. This common business model is the fundamental, underlying basis of the federated architecture, since it ensures consistent use of business terms and meanings throughout the enterprise. With the exception of the rare case of a truly independent data mart, where no future growth is planned or anticipated, and where no opportunities for integration with other business areas exist, the federated data warehouse architecture provides the best framework for building a data integration solution.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

327 of 954

Informatica's PowerCenter and PowerExchange products provide an essential capability for supporting the federated architecture: the shared Global Repository. When used in conjunction with one or more Local Repositories, the Global Repository serves as a sort of "federal" governing body, providing a common understanding of core business concepts that can be shared across the semi-autonomous data marts. These data marts each have their own Local Repository, which typically include a combination of purely local metadata and shared metadata by way of links to the Global Repository.

This environment allows for relatively independent development of individual data marts, but also supports metadata sharing without obstacles. The common business model and names described above can be captured in metadata terms and stored in the Global Repository. The data marts use the common business model as a basis, but extend the model by developing departmental metadata and storing it locally. A typical characteristic of the federated architecture is the existence of an Operational Data Store (ODS). Although this component is optional, it can be found in many implementations that extract data from multiple source systems and load multiple targets. The ODS was originally designed to extract and hold operational data that would be sent to a centralized data warehouse, working as a time-variant database to support end-user reporting directly from operational systems. A typical ODS had to be organized by data subject area because it did not retain the data model from the operational system. Informatica's approach to the ODS, by contrast, has virtually no change in data model from the operational system, so it need not be organized by subject area. The ODS does not permit direct INFORMATICA CONFIDENTIAL

BEST PRACTICES

328 of 954

end-user reporting, and its refresh policies are more closely aligned with the refresh schedules of the enterprise data marts it may be feeding. It can also perform more sophisticated consolidation functions than a traditional ODS.

Advantages The Federated architecture brings together the best features of the centralized data warehouse and independent data mart: ●

●

●

●

●

●

Room for expansion . While the architecture is designed to quickly deploy the initial data mart, it is also easy to share project deliverables across subsequent data marts by migrating local metadata to the Global Repository. Reuse is built in. Centralized control . A single platform controls the environment from development to test to production. Mechanisms to control and monitor the data movement from operational databases into the data integration environment are applied across the data marts, easing the system management task. Consistent metadata . A Global Repository spans all the data marts, providing a consistent view of metadata. Enterprise view . Viewing all the metadata from a central location also provides an enterprise view, easing the maintenance burden for the warehouse administrators. Business users can also access the entire environment when necessary (assuming that security privileges are granted). High data integrity . Using a set of integrated metadata repositories for the entire enterprise removes data integrity issues that result from duplicate copies of data. Minimized impact on operational systems . Frequently accessed source data, such as customer, product, or invoice records is moved into the decision support environment once, leaving the operational systems unaffected by the number of target data marts.

Disadvantages Disadvantages of the federated approach include: ●

●

Data propagation . This approach moves data twice-to the ODS, then into the individual data mart. This requires extra database space to store the staged data as well as extra time to move the data. However, the disadvantage can be mitigated by not saving the data permanently in the ODS. After the warehouse is refreshed, the ODS can be truncated, or a rolling three months of data can be saved. Increased development effort during initial installations . For each table in the target, there needs to be one load developed from the ODS to the target, in addition to all the loads from the source to the targets.

Operational Data Store Using a staging area or ODS differs from a centralized data warehouse approach since the ODS is not organized by subject area and is not customized for viewing by end users or even for reporting. INFORMATICA CONFIDENTIAL

BEST PRACTICES

329 of 954

The primary focus of the ODS is in providing a clean, consistent set of operational data for creating and refreshing data marts. Separating out this function allows the ODS to provide more reliable and flexible support. Data from the various operational sources is staged for subsequent extraction by target systems in the ODS. In the ODS, data is cleaned and remains normalized, tables from different databases are joined, and a refresh policy is carried out (a change/capture facility may be used to schedule ODS refreshes, for instance). The ODS and the data marts may reside in a single database or be distributed across several physical databases and servers. Characteristics of the Operational Data Store are: ●

Normalized

●

Detailed (not summarized)

●

Integrated

●

Cleansed

●

Consistent

Within an enterprise data mart, the ODS can consolidate data from disparate systems in a number of ways: ●

●

●

Normalizes data where necessary (such as non-relational mainframe data), preparing it for storage in a relational system. Cleans data by enforcing commonalties in dates, names and other data types that appear across multiple systems. Maintains reference data to help standardize other formats; references might range from zip codes and currency conversion rates to product-code-to-product-name translations. The ODS may apply fundamental transformations to some database tables in order to reconcile common definitions, but the ODS is not intended to be a transformation processor for end-user reporting requirements.

Its role is to consolidate detailed data within common formats. This enables users to create wide varieties of data integration reports, with confidence that those reports will be based on the same detailed data, using common definitions and formats. The following table compares the key differences in the three architectures: Architecture

Centralized Data Warehouse

INFORMATICA CONFIDENTIAL

Independent Data Mart

BEST PRACTICES

Federated Data Warehouse

330 of 954

Centralized Control

Yes

No

Yes

Consistent Metadata

Yes

No

Yes

Cost effective

No

Yes

Yes

Enterprise View

Yes

No

Yes

Fast Implementation

No

Yes

Yes

High Data Integrity

Yes

No

Yes

Immediate ROI

No

Yes

Yes

Repeatable Process

No

Yes

Yes

The Role of Enterprise Architecture The federated architecture approach allows for the planning and implementation of an enterprise architecture framework that addresses not only short-term departmental needs, but also the longterm enterprise requirements of the business. This does not mean that the entire architectural investment must be made in advance of any application development. However, it does mean that development is approached within the guidelines of the framework, allowing for future growth without significant technological change. The remainder of this chapter will focus on the process of designing and developing a data integration solution architecture using PowerCenter as the platform.

Fitting Into the Corporate Architecture Very few organizations have the luxury of creating a "green field" architecture to support their decision support needs. Rather, the architecture must fit within an existing set of corporate guidelines regarding preferred hardware, operating systems, databases, and other software. The Technical Architect, if not already an employee of the organization, should ensure that he/she has a thorough understanding of the existing (and future vision of) technical infrastructure. Doing so will eliminate the possibility of developing an elegant technical solution that will never be implemented because it defies corporate standards.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

331 of 954

Development FAQs Challenge Using the PowerCenter product suite to effectively develop, name, and document components of the data integration solution. While the most effective use of PowerCenter depends on the specific situation, this Best Practice addresses some questions that are commonly raised by project teams. It provides answers in a number of areas, including Logs, Scheduling, Backup Strategies, Server Administration, Custom Transformations, and Metadata. Refer to the product guides supplied with PowerCenter for additional information.

Description The following pages summarize some of the questions that typically arise during development and suggest potential resolutions.

Mapping Design Q: How does source format affect performance? (i.e., is it more efficient to source from a flat file rather than a database?) In general, a flat file that is located on the server machine loads faster than a database located on the server machine. Fixedwidth files are faster than delimited files because delimited files require extra parsing. However, if there is an intent to perform intricate transformations before loading to target, it may be advisable to first load the flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters, custom transformations, and custom SQL SELECTs where appropriate. Q: What are some considerations when designing the mapping? (i.e., what is the impact of having multiple targets populated by a single map?) With PowerCenter, it is possible to design a mapping with multiple targets. If each target has a separate source qualifier, you can then load the targets in a specific order using Target Load Ordering. However, the recommendation is to limit the amount of complex logic in a mapping. Not only is it easier to debug a mapping with a limited number of objects, but such mappings can also be run concurrently and make use of more system resources. When using multiple output files (targets), consider writing to multiple disks or file systems simultaneously. This minimizes disk writing contention and applies to a session writing to multiple targets, and to multiple sessions running simultaneously. Q: What are some considerations for determining how many objects and transformations to include in a single mapping? The business requirement is always the first consideration, regardless of the number of objects it takes to fulfill the requirement. Beyond this, consideration should be given to having objects that stage data at certain points to allow both easier debugging and better understandability, as well as to create potential partition points. This should be balanced against the fact that more objects means more overhead for the DTM process. It should also be noted that the most expensive use of the DTM is passing unnecessary data through the mapping. It is best to use filters as early as possible in the mapping to remove rows of data that are not needed. This is the SQL equivalent of the WHERE clause. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to increase the performance of the mapping. If this is not possible, a filter or router transformation can be used instead.

Log File Organization Q: How does PowerCenter handle logs? The Service Manager provides accumulated log events from each service in the domain and for sessions and workflows. To perform the logging function, the Service Manager runs a Log Manager and a Log Agent. The Log Manager runs on the master gateway node. It collects and processes log events for Service Manager domain operations and application services. The log events contain operational and error messages for a domain. The Service INFORMATICA CONFIDENTIAL

BEST PRACTICES

332 of 954

Manager and the application services send log events to the Log Manager. When the Log Manager receives log events, it generates log event files, which can be viewed in the Administration Console. The Log Agent runs on the nodes to collect and process log events for session and workflows. Log events for workflows include information about tasks performed by the Integration Service, workflow processing, and workflow errors. Log events for sessions include information about the tasks performed by the Integration Service, session errors, and load summary and transformation statistics for the session. You can view log events for the last workflow run with the Log Events window in the Workflow Monitor. Log event files are binary files that the Administration Console Log Viewer uses to display log events. When you view log events in the Administration Console, the Log Manager uses the log event files to display the log events for the domain or application service. For more information, please see Chapter 16: Managing Logs in the Administrator Guide. Q: Where can I view the logs? Logs can be viewed in two locations: the Administration Console or the Workflow Monitor. The Administration Console displays domain-level operational and error messages. The Workflow Monitor displays session and workflow level processing and error messages. Q: Where is the best place to maintain Session Logs? One often-recommended location is a shared directory location that is accessible to the gateway node. If you have more than one gateway node, store the logs on a shared disk. This keeps all the logs in the same directory. The location can be changed in the Administration Console. If you have more than one PowerCenter domain, you must configure a different directory path for each domain’s Log Manager. Multiple domains can not use the same shared directory path. For more information, please refer to Chapter 16: Managing Logs of the Administrator Guide. Q: What documentation is available for the error codes that appear within the error log files? Log file errors and descriptions appear in Chapter 39: LGS Messages of the PowerCenter Trouble Shooting Guide. Error information also appears in the PowerCenter Help File within the PowerCenter client applications. For other database-specific errors, consult your Database User Guide.

Scheduling Techniques Q: What are the benefits of using workflows with multiple tasks rather than a workflow with a stand-alone session? Using a workflow to group logical sessions minimizes the number of objects that must be managed to successfully load the warehouse. For example, a hundred individual sessions can be logically grouped into twenty workflows. The Operations group can then work with twenty workflows to load the warehouse, which simplifies the operations tasks associated with loading the targets. Workflows can be created to run tasks sequentially or concurrently, or have tasks in different paths doing either. ●

●

A sequential workflow runs sessions and tasks one at a time, in a linear sequence. Sequential workflows help ensure that dependencies are met as needed. For example, a sequential workflow ensures that session1 runs before session2 when session2 is dependent on the load of session1, and so on. It's also possible to set up conditions to run the next session only if the previous session was successful, or to stop on errors, etc. A concurrent workflow groups logical sessions and tasks together, like a sequential workflow, but runs all the tasks at one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' symmetric multiprocessing (SMP) architecture.

Other workflow options, such as nesting worklets within workflows, can further reduce the complexity of loading the warehouse. This capability allows for the creation of very complex and flexible workflow streams without the use of a third-party scheduler. INFORMATICA CONFIDENTIAL

BEST PRACTICES

333 of 954

Q: Assuming a workflow failure, does PowerCenter allow restart from the point of failure? No. When a workflow fails, you can choose to start a workflow from a particular task but not from the point of failure. It is possible, however, to create tasks and flows based on error handling assumptions. If a previously running real-time workflow fails, first recover and then restart that workflow from the Workflow Monitor. Q: How can a failed workflow be recovered if it is not visible from the Workflow Monitor? Start the Workflow Manager and open the corresponding workflow. Find the failed task and right click to "Recover Workflow From Task." Q: What guidelines exist regarding the execution of multiple concurrent sessions / workflows within or across applications? Workflow Execution needs to be planned around two main constraints: ●

Available system resources

●

Memory and processors

The number of sessions that can run efficiently at one time depends on the number of processors available on the server. The load manager is always running as a process. If bottlenecks with regards to I/O and network are addressed, a session will be compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive, so the DTM always runs. However, some sessions require more I/O, so they use less processor time. A general rule is that a session needs about 120 percent of a processor for the DTM, reader, and writer in total. For concurrent sessions: One session per processor is about right; you can run more, but that requires a "trial and error" approach to determine what number of sessions starts to affect session performance and possibly adversely affect other executing tasks on the server. If possible, sessions should run at "off-peak" hours to have as many available resources as possible. Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory usage is more difficult than the processors calculation; it tends to vary according to system load and number of PowerCenter sessions running. The first step is to estimate memory usage, accounting for: ●

Operating system kernel and miscellaneous processes

●

Database engine

●

Informatica Load Manager

Next, each session being run needs to be examined with regard to the memory usage, including the DTM buffer size and any cache/memory allocations for transformations such as lookups, aggregators, ranks, sorters and joiners. At this point, you should have a good idea of what memory is utilized during concurrent sessions. It is important to arrange the production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may be able to run only one large session, or several small sessions concurrently. Load-order dependencies are also an important consideration because they often create additional constraints. For example, load the dimensions first, then facts. Also, some sources may only be available at specific times; some network links may become saturated if overloaded; and some target tables may need to be available to end users earlier than others. Q: Is it possible to perform two "levels" of event notification? At the application level and the PowerCenter Server level to notify the Server Administrator? The application level of event notification can be accomplished through post-session email. Post-session email allows you to INFORMATICA CONFIDENTIAL

BEST PRACTICES

334 of 954

create two different messages; one to be sent upon successful completion of the session, the other to be sent if the session fails. Messages can be a simple notification of session completion or failure, or a more complex notification containing specifics about the session. You can use the following variables in the text of your post-session email:

Email Variable Description %s

Session name

%l

Total records loaded

%r

Total records rejected

%e

Session status

%t

Table details, including read throughput in bytes/second and write throughput in rows/second

%b

Session start time

%c

Session completion time

%i

Session elapsed time (session completion time-session start time)

%g

Attaches the session log to the message

%m

Name and version of the mapping used in the session

%d

Name of the folder containing the session

%n

Name of the repository containing the session

%a

Attaches the named file. The file must be local to the Informatica Server. The following are valid filenames: %a or %a On Windows NT, you can attach a file of any type. On UNIX, you can only attach text files. If you attach a nontext file, the send may fail. Note: The filename cannot include the Greater Than character (>) or a line break.

The PowerCenter Server on UNIX uses rmail to send post-session email. The repository user who starts the PowerCenter server must have the rmail tool installed in the path in order to send email. To verify the rmail tool is accessible: 1. 2. 3. 4.

Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server. Type rmail at the prompt and press Enter. Type '.' to indicate the end of the message and press Enter. You should receive a blank email from the PowerCenter user's email account. If not, locate the directory where rmail

INFORMATICA CONFIDENTIAL

BEST PRACTICES

335 of 954

resides and add that directory to the path. 5. When you have verified that rmail is installed correctly, you are ready to send post-session email. The output should look like the following: Session complete. Session name: sInstrTest Total Rows Loaded = 1 Total Rows Rejected = 0 Completed

Rows Loaded

Rows Rejected

ReadThroughput (bytes/sec)

WriteThroughput Table Name (rows/sec)

0

30

1

Status 1

t_Q3_sales

No errors encountered. Start Time: Tue Sep 14 12:26:31 1999 Completion Time: Tue Sep 14 12:26:41 1999 Elapsed time: 0: 00:10 (h:m:s) This information, or a subset, can also be sent to any text pager that accepts email.

Backup Strategy Recommendation Q: Can individual objects within a repository be restored from the backup or from a prior version? At the present time, individual objects cannot be restored from a backup using the PowerCenter Repository Manager (i.e., you can only restore the entire repository). But, it is possible to restore the backup repository into a different database and then manually copy the individual objects back into the main repository. It should be noted that PowerCenter does not restore repository backup files created in previous versions of PowerCenter. To correctly restore a repository, the version of PowerCenter used to create the backup file must be used for the restore as well. An option for the backup of individual objects is to export them to XML files. This allows for the granular re-importation of individual objects, mappings, tasks, workflows, etc. Refer to Migration Procedures - PowerCenter for details on promoting new or changed objects between development, test, QA, and production environments.

Server Administration Q: What built-in functions does PowerCenter provide to notify someone in the event that the server goes down, or some other significant event occurs? The Repository Service can be used to send messages notifying users that the server will be shut down. Additionally, the Repository Service can be used to send notification messages about repository objects that are created, modified, or deleted by another user. Notification messages are received through the PowerCenter Client tools. Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels? The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes. INFORMATICA CONFIDENTIAL

BEST PRACTICES

336 of 954

Pmprocs is a script that combines the ps and ipcs commands. It is available through Informatica Technical Support. The utility provides the following information: ●

CPID - Creator PID (process ID)

●

LPID - Last PID that accessed the resource

●

Semaphores - used to sync the reader and writer

●

0 or 1 - shows slot in LM shared memory

A variety of UNIX and Windows NT commands and utilities are also available. Consult your UNIX and/or Windows NT documentation. Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash? If the UNIX server crashes, you should first check to see if the repository database is able to come back up successfully. If this is the case, then you should try to start the PowerCenter server. Use the pmserver.err log to check if the server has started correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running.

Custom Transformations Q: What is the relationship between the Java or SQL transformation and the Custom transformation? Many advanced transformations, including Java and SQL, were built using the Custom transformation. Custom transformations operate in conjunction with procedures you create outside of the Designer interface to extend PowerCenter functionality. Other transformations that were built using Custom transformations include HTTP, SQL, Union , XML Parser, XML Generator, and many others. Below is a summary of noticeable differences.

Transformation

# of Input Groups

# of Output Groups

Type

Custom

Multiple

Multiple

Active/Passive

HTTP

One

One

Passive

Java

One

One

Active/Passive

SQL

One

One

Active/Passive

Union

Multiple

One

Active

XML Parser

One

Multiple

Active

XML Generator

Multiple

One

Active

For further details, please see the Transformation Guide. Q: What is the main benefit of a Custom transformation over an External Procedure transformation? A Custom transformation allows for the separation of input and output functions, whereas an External Procedure transformation handles both the input and output simultaneously. Additionally, an External Procedure transformation’s parameters consist of all the ports of the transformation. The ability to separate input and output functions is especially useful for sorting and aggregation, which require all input rows to INFORMATICA CONFIDENTIAL

BEST PRACTICES

337 of 954

be processed before outputting any output rows. Q: How do I change a Custom transformation from Active to Passive, or vice versa? After the creation of the Custom transformation, the transformation type cannot be changed. In order to set the appropriate type, delete and recreate the transformation. Q: What is the difference between active and passive Java transformations? When should one be used over the other? An active Java transformation allows for the generation of more than one output row for each input row. Conversely, a passive Java transformation only allows for the generation of one output row per input row. Use active if you need to generate multiple rows with each input. For example, a Java transformation contains two input ports that represent a start date and an end date. You can generate an output row for each date between the start and end date. Use passive when you need one output row for each input. Q: What are the advantages of a SQL transformation over a Source Qualifier? A SQL transformation allows for the processing of SQL queries in the middle of a mapping. It allows you to insert, delete, update, and retrieve rows from a database. For example, you might need to create database tables before adding new transactions. The SQL transformation allows for the creation of these tables from within the workflow. Q: What is the difference between the SQL transformation’s Script and Query modes? Script mode allows for the execution of externally located ANSI SQL scripts. Query mode executes a query that you define in a query editor. You can pass strings or parameters to the query to define dynamic queries or change the selection parameters. For more information, please see Chapter 22: SQL Transformation in the Transformation Guide.

Metadata Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that may be extracted from the PowerCenter repository and used in others? With PowerCenter, you can enter description information for all repository objects, sources, targets, transformations, etc, but the amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column level and give descriptions of the columns in a table if necessary. All information about column size and scale, data types, and primary keys are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it is also very time consuming to do so. Therefore, this decision should be made on the basis of how much metadata is likely to be required by the systems that use the metadata. There are some time-saving tools that are available to better manage a metadata strategy and content, such as third-party metadata software and, for sources and targets, data modeling tools. Q: What procedures exist for extracting metadata from the repository? Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. All of these tools store, retrieve, and manage their metadata in Informatica's PowerCenter repository. The motivation behind the original Metadata Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository. Today, Informatica and several key Business Intelligence (BI) vendors, including Brio, Business Objects, Cognos, and MicroStrategy, are effectively using the MX views to report and query the Informatica metadata. Informatica strongly discourages accessing the repository directly, even for SELECT access because some releases of PowerCenter change the look and feel of the repository tables, resulting in a maintenance task for you. Rather, views have INFORMATICA CONFIDENTIAL

BEST PRACTICES

338 of 954

been created to provide access to the metadata stored in the repository. Additionally, Informatica's Metadata Manager and Data Analyzer, allow for more robust reporting against the repository database and are able to present reports to the end-user and/or management.

Versioning Q: How can I keep multiple copies of the same object within PowerCenter? A: With PowerCenter, you can use version control to maintain previous copies of every changed object. You can enable version control after you create a repository. Version control allows you to maintain multiple versions of an object, control development of the object, and track changes. You can configure a repository for versioning when you create it, or you can upgrade an existing repository to support versioned objects. When you enable version control for a repository, the repository assigns all versioned objects version number 1 and each object has an active status. You can perform the following tasks when you work with a versioned object: ●

●

●

●

View object version properties. Each versioned object has a set of version properties and a status. You can also configure the status of a folder to freeze all objects it contains or make them active for editing. Track changes to an object. You can view a history that includes all versions of a given object, and compare any version of the object in the history to any other version. This allows you to determine changes made to an object over time. Check the object version in and out. You can check out an object to reserve it while you edit the object. When you check in an object, the repository saves a new version of the object and allows you to add comments to the version. You can also find objects checked out by yourself and other users. Delete or purge the object version. You can delete an object from view and continue to store it in the repository. You can recover, or undelete, deleted objects. If you want to permanently remove an object version, you can purge it from the repository.

Q: Is there a way to migrate only the changed objects from Development to Production without having to spend too much time on making a list of all changed/affected objects? A: Yes there is. You can create Deployment Groups that allow you to group versioned objects for migration to a different repository. You can create the following types of deployment groups: ●

Static. You populate the deployment group by manually selecting objects.

●

Dynamic. You use the result set from an object query to populate the deployment group.

To make a smooth transition/migration to Production, you need to have a query associated with your Dynamic deployment group. When you associate an object query with the deployment group, the Repository Agent runs the query at the time of deployment. You can associate an object query with a deployment group when you edit or create a deployment group. If the repository is enabled for versioning, you may also copy the objects in a deployment group from one repository to another. Copying a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects to copy, rather than the entire contents of a folder.

Performance Q: Can PowerCenter sessions be load balanced? INFORMATICA CONFIDENTIAL

BEST PRACTICES

339 of 954

A: Yes, if the PowerCenter Enterprise Grid Option option is available. The Load Balancer is a component of the Integration Service that dispatches tasks to Integration Service processes running on nodes in a grid. It matches task requirements with resource availability to identify the best Integration Service process to run a task. It can dispatch tasks on a single node or across nodes. Tasks can be dispatched in three ways: Round-robin, Metric-based, and Adaptive. Additionally, you can set the Service Levels to change the priority of each task waiting to be dispatched. This can be changed in the Administration Console’s domain properties. For more information, please refer to Chapter 11: Configuring the Load Balancer in the Administrator Guide.

Web Services Q: How does Web Services Hub work in PowerCenter? A: The Web Services Hub is a web service gateway for external clients. It processes SOAP requests from web service clients that want to access PowerCenter functionality through web services. Web service clients access the Integration Service and Repository Service through the Web Services Hub. The Web Services Hub hosts Batch and Real-time Web Services. When you install PowerCenter Services, the PowerCenter installer installs the Web Services Hub. Use the Administration Console to configure and manage the Web Services Hub. For more information, please refer to Creating and Configuring the Web Services Hub in the Administrator Guide. The Web Services Hub connects to the Repository Server and the PowerCenter Server through TCP/IP. Web service clients log in to the Web Services Hub through HTTP(s). The Web Services Hub authenticates the client based on repository user name and password. You can use the Web Services Hub console to view service information and download Web Services Description Language (WSDL) files necessary for running services and workflows.

Last updated: 06-Dec-07 15:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

340 of 954

Event Based Scheduling Challenge In an operational environment, the beginning of a task often needs to be triggered by some event, either internal or external, to the Informatica environment. In versions of PowerCenter prior to version 6.0, this was achieved through the use of indicator files. In PowerCenter 6.0 and forward, it is achieved through use of the EventRaise and EventWait Workflow and Worklet tasks, as well as indicator files.

Description Event-based scheduling with versions of PowerCenter prior to 6.0 was achieved through the use indicator files. Users specified the indicator file configuration in the session configuration under advanced options. When the session started, the PowerCenter Server looked for the specified file name; if it wasn’t there, it waited until it appeared, then deleted it, and triggered the session. In PowerCenter 6.0 and above, event-based scheduling is triggered by Event-Wait and Event-Raise tasks. These tasks can be used to define task execution order within a workflow or worklet. They can even be used to control sessions across workflows. ● ●

An Event-Raise task represents a user-defined event (i.e., an indicator file). An Event-Wait task waits for an event to occurwithin a workflow. After the event triggers, the PowerCenter Server continues executing the workflow from the Event-Wait task forward.

The following paragraphs describe events that can be triggered by an Event-Wait task.

Waiting for Pre-Defined Events To use a pre-defined event, you need a session, shell command, script, or batch file to create an indicator file. You must create the file locally or send it to a directory local to the PowerCenter Server. The file can be any format recognized by the PowerCenter Server operating system. You can choose to have the PowerCenter Server delete the indicator file after it detects the file, or you can manually delete the indicator file. The PowerCenter Server marks the status of the Event-Wait task as "failed" if it cannot delete the indicator file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

341 of 954

When you specify the indicator file in the Event-Wait task, specify the directory in which the file will appear and the name of the indicator file. Do not use either a source or target file name as the indicator file name. You must also provide the absolute path for the file and the directory must be local to the PowerCenter Server. If you only specify the file name, and not the directory, Workflow Manager looks for the indicator file in the system directory. For example, on Windows NT, the system directory is C:/winnt/ system32. You can enter the actual name of the file or use server variables to specify the location of the files. The PowerCenter Server writes the time the file appears in the workflow log. Follow these steps to set up a pre-defined event in the workflow: 1. Create an Event-Wait task and double-click the Event-Wait task to open the Edit Tasks dialog box. 2. In the Events tab of the Edit Task dialog box, select Pre-defined. 3. Enter the path of the indicator file. 4. If you want the PowerCenter Server to delete the indicator file after it detects the file, select the Delete Indicator File option in the Properties tab. 5. Click OK.

Pre-defined Event A pre-defined event is a file-watch event. For pre-defined events, use an Event-Wait task to instruct the PowerCenter Server to wait for the specified indicator file to appear before continuing with the rest of the workflow. When the PowerCenter Server locates the indicator file, it starts the task downstream of the Event-Wait.

User-defined Event A user-defined event is defined at the workflow or worklet level and the Event-Raise task triggers the event at one point of the workflow/worklet. If an Event-Wait task is configured in the same workflow/worklet to listen for that event, then execution will continue from the Event-Wait task forward. The following is an example of using user-defined events: Assume that you have four sessions that you want to execute in a workflow. You want P1_session and P2_session to execute concurrently to save time. You also want to execute Q3_session after P1_session completes. You want to execute Q4_session only when P1_session, P2_session, and Q3_session complete. Follow these steps:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

342 of 954

1. Link P1_session and P2_session concurrently. 2. Add Q3_session after P1_session 3. Declare an event called P1Q3_Complete in the Events tab of the workflow properties 4. In the workspace, add an Event-Raise task after Q3_session. 5. Specify the P1Q3_Complete event in the Event-Raise task properties. This allows the Event-Raise task to trigger the event when P1_session and Q3_session complete. 6. Add an Event-Wait task after P2_session. 7. Specify the Q1 Q3_Complete event for the Event-Wait task. 8. Add Q4_session after the Event-Wait task. When the PowerCenter Server processes the Event-Wait task, it waits until the Event-Raise task triggers Q1Q3_Complete before it executes Q4_session. The PowerCenter Server executes the workflow in the following order: 1. 2. 3. 4. 5. 6. 7.

The PowerCenter Server executes P1_session and P2_session concurrently. When P1_session completes, the PowerCenter Server executes Q3_session. The PowerCenter Server finishes executing P2_session. The Event-Wait task waits for the Event-Raise task to trigger the event. The PowerCenter Server completes Q3_session. The Event-Raise task triggers the event, Q1Q3_complete. The Informatica Server executes Q4_session because the event, Q1Q3_Complete, has been triggered.

Be sure to take carein setting the links though. If they are left as the default and if Q3 fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the workflow will run until it is stopped. To avoid this, check the workflow option ‘suspend on error’. With this option, if a session fails, the whole workflow goes into suspended mode and can send an email to notify developers.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

343 of 954

Key Management in Data Warehousing Solutions Challenge Key management refers to the technique that manages key allocation in a decision support RDBMS to create a single view of reference data from multiple sources. Informatica recommends a concept of key management that ensures loading everything extracted from a source system into the data warehouse. This Best Practice provides some tips for employing the Informatica-recommended approach of key management, an approach that deviates from many traditional data warehouse solutions that apply logical and data warehouse (surrogate) key strategies where errors are loaded and transactions rejected from referential integrity issues.

Description Key management in a decision support RDBMS comprises three techniques for handling the following common situations: ●

Key merging/matching

●

Missing keys

●

Unknown keys

All three methods are applicable to a Reference Data Store, whereas only the missing and unknown keys are relevant for an Operational Data Store (ODS). Key management should be handled at the data integration level, thereby making it transparent to the Business Intelligence layer.

Key Merging/Matching When companies source data from more than one transaction system of a similar type, the same object may have different, non-unique legacy keys. Additionally, a single key may have several descriptions or attributes in each of the source systems. The independence of these systems can result in incongruent coding, which poses a greater problem than records being sourced from multiple systems.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

344 of 954

A business can resolve this inconsistency by undertaking a complete code standardization initiative (often as part of a larger metadata management effort) or applying a Universal Reference Data Store (URDS). Standardizing code requires an object to be uniquely represented in the new system. Alternatively, URDS contains universal codes for common reference values. Most companies adopt this pragmatic approach, while embarking on the longer term solution of code standardization. The bottom line is that nearly every data warehouse project encounters this issue and needs to find a solution in the short term.

Missing Keys A problem arises when a transaction is sent through without a value in a column where a foreign key should exist (i.e., a reference to a key in a reference table). This normally occurs during the loading of transactional data, although it can also occur when loading reference data into hierarchy structures. In many older data warehouse solutions, this condition would be identified as an error and the transaction row would be rejected. The row would have to be processed through some other mechanism to find the correct code and loaded at a later date. This is often a slow and cumbersome process that leaves the data warehouse incomplete until the issue is resolved. The more practical way to resolve this situation is to allocate a special key in place of the missing key, which links it with a dummy 'missing key' row in the related table. This enables the transaction to continue through the loading process and end up in the warehouse without further processing. Furthermore, the row ID of the bad transaction can be recorded in an error log, allowing the addition of the correct key value at a later time. The major advantage of this approach is that any aggregate values derived from the transaction table will be correct because the transaction exists in the data warehouse rather than being in some external error processing file waiting to be fixed. Simple Example: PRODUCT

CUSTOMER

Audi TT18

Doe10224

SALES REP

QUANTITY

UNIT PRICE

1

35,000

In the transaction above, there is no code in the SALES REP column. As this row is processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a record in the SALES REP table. A data warehouse key (8888888) is also added to the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

345 of 954

transaction. PRODUCT

CUSTOMER SALES REP

QUANTITY

UNIT PRICE

DWKEY

Audi TT18

Doe10224

1

35,000

8888888

9999999

The related sales rep record may look like this: REP CODE

REP NAME

1234567

David Jones

7654321

Mark Smith

9999999

Missing Rep

REP MANAGER Mark Smith

An error log entry to identify the missing key on this transaction may look like: ERROR CODE

TABLE NAME

MSGKEY

ORDERS

KEY NAME SALES REP

KEY 8888888

This type of error reporting is not usually necessary because the transactions with missing keys can be identified using standard end-user reporting tools against the data warehouse.

Unknown Keys Unknown keys need to be treated much like missing keys except that the load process has to add the unknown key value to the referenced table to maintain integrity rather than explicitly allocating a dummy key to the transaction. The process also needs to make two error log entries. The first, to log the fact that a new and unknown key has been added to the reference table and a second to record the transaction in which the unknown key was found. Simple example: The sales rep reference data record might look like the following:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

346 of 954

DWKEY

REP NAME

REP MANAGER

1234567

David Jones

Mark Smith

7654321

Mark Smith

9999999

Missing Rep

A transaction comes into ODS with the record below: PRODUCT

CUSTOMER SALES REP

QUANTITY

UNIT PRICE

Audi TT18

Doe10224

1

35,000

2424242

In the transaction above, the code 2424242 appears in the SALES REP column. As this row is processed, a new row has to be added to the Sales Rep reference table. This allows the transaction to be loaded successfully. DWKEY

REP NAME

2424242

Unknown

REP MANAGER

A data warehouse key (8888889) is also added to the transaction. PRODUCT

CUSTOMER SALES REP

QUANTITY

UNIT PRICE

DWKEY

Audi TT18

Doe10224

1

35,000

8888889

2424242

Some warehouse administrators like to have an error log entry generated to identify the addition of a new reference table entry. This can be achieved simply by adding the following entries to an error log. ERROR CODE

TABLE NAME KEY NAME

NEWROW

SALES REP SALES REP

KEY 2424242

A second log entry can be added with the data warehouse key of the transaction in which the unknown key was found.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

347 of 954

ERROR CODE

TABLE NAME KEY NAME

UNKNKEY

ORDERS

SALES REP

KEY 8888889

As with missing keys, error reporting is not essential because the unknown status is clearly visible through the standard end-user reporting. Moreover, regardless of the error logging, the system is self-healing because the newly added reference data entry will be updated with full details as soon as these changes appear in a reference data feed. This would result in the reference data entry looking complete. DWKEY

REP NAME

REP MANAGER

2424242

David Digby

Mark Smith

Employing the Informatica recommended key management strategy produces the following benefits: ●

All rows can be loaded into the data warehouse

●

All objects are allocated a unique key

●

Referential integrity is maintained

●

Load dependencies are removed

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

348 of 954

Mapping Auto-Generation Challenge In the course of developing mappings for PowerCenter, situations can arise where a set of similar functions/procedures must be executed for each mapping. The first reaction to this issue is generally to employ a mapplet. These objects are suited to situations where all of the individual fields/data are the same across uses of the mapplet. However, in cases where the fields are different – but the ‘process’ is the same – a requirement emerges to ‘generate’ multiple mappings using a standard template of actions and procedures. The potential benefits of Autogeneration are focused on a reduction in the Total Cost of Ownership (TCO) of the integration application and include: ●

Reduced build time

●

Reduced requirement for skilled developer resources

●

Promotion of pattern-based design

●

Built in quality and consistency

●

Reduced defect rate through elimination of manual errors

●

Reduced support overhead

Description From the outset, it should be emphasized that auto-generation should be integrated into the overall development strategy. It is probable that some components will still need to be manually developed and many of the disciplines and best practices that are documented elsewhere in Velocity still apply. It is best to regard autogeneration as a productivity aid in specific situations and not as a technique that works in all situations. Currently, the autogeneration of 100% of the components required is not a realistic objective. All of the techniques discussed here revolve around the generation of an XML file which shares the standard format of exported PowerCenter components as defined in the powrmart.dtd schema definition. After being generated, the resulting XML document is imported into PowerCenter using standard facilities available through the user interface or via command line. With Informatica technology, there are a number of options for XML targeting which can be leveraged to implement autogeneration. Thus you can exploit these features to make the technology self-generating. The stages in implementing an autogeneration strategy are: 1. 2. 3. 4.

Establish the Scope for Autogeneration Design the Assembly Line(s) Build the Assembly Line Implement the QA and Testing Strategies

These stages are discussed in more detail in the following sections.

1. Establish the Scope for Autogeneration There are three types of opportunities for manufacturing components: ●

Pattern-Driven

●

Rules-Driven

●

Metadata-Driven

A Pattern-Driven build is appropriate when a single pattern of transformation is to be replicated for multiple source-target INFORMATICA CONFIDENTIAL

BEST PRACTICES

349 of 954

combinations. For example, the initial extract in a standard data warehouse load typically extracts some source data with standardized filters, and then adds some load metadata before populating a staging table which essentially replicates the source structure. The potential for Rules-Driven build typically arises when non-technical users are empowered to articulate transformation requirements in a format which is the source for a process generating components. Usually, this is accomplished via a spreadsheet which defines the source-to-target mapping and uses a standardized syntax to define the transformation rules. To implement this type of autogeneration, it is necessary to build an application (typically based on a PowerCenter mapping) which reads the spreadsheet, matches the sources and targets against the metadata in the repository and produces the XML output. Finally, the potential for Metadata-Driven build arises when the import of source and target metadata enables transformation requirements to be inferred which also requires a mechanism for mapping sources to target. For example, when a text source column is mapped to a numeric target column the inferred rule is to test for data type compatibility. The first stage in the implementation of an autogeneration strategy is to decide which of these autogeneration types is applicable and to ensure that the appropriate technology is available. In most case, it is the Pattern-Driven build which is the main area of interest; this is precisely the requirement which the mapping generation license option within PowerCenter is designed to address. This option uses the freely distributed Informatica Data Stencil design tool for Microsoft Visio and freely distributed Informatica Velocity-based mapping templates to accelerate and automate mapping design. Generally speaking, applications which involve a small number of highly-complex flows of data tailored to very specific source/ target attributes are not good candidates for pattern-driven autogeneration. Currently, there is a great deal of product innovation in the areas of Rules-Driven and Metadata-driven autogeneration One option includes using PowerCenter via an XML target to generate the required XML files later used as import mappings.. Depending on the scale and complexity of both the autogeneration-rules and the functionality of the generated components, it may be advisable to acquire a license for the PowerCenter Unstructured Data option. In conclusion, at the end of this stage the type of autogeneration should be identified and all the required technology licenses should be acquired.

2. Design the Assembly Line It is assumed that the standard development activities in the Velocity Architect and Design phases have been undertaken and at this stage, the development team should understand the data and the value to be added to it. It should be possible to identify the patterns of data movement. The main stages in designing the assembly line are: ●

Manually develop a prototype

●

Distinguish between the generic and the flow-specific components

●

Establish the boundaries and inter-action between generated and manually built components

●

Agree the format and syntax for the specification of the rules (usually Excel)

●

Articulate the rules in the agreed format

●

Incorporate component generation in the overall development process

●

Develop the manual components (if any)

It is recommended that a prototype is manually developed for a representative subset of the sources and targets since the adoption of autogeneration techniques does not obviate the need for a re-usability strategy. Even if some components are generated rather than built, it is still necessary to distinguish between the generic and the flow-specific components. This will allow the generic functionality to be mapped onto the appropriate re-usable PowerCenter components – mapplets, transformations, user defined functions etc. The manual development of the prototype also allows the scope of the autogeneration to be established. It is unlikely that every INFORMATICA CONFIDENTIAL

BEST PRACTICES

350 of 954

single required PowerCenter component can be generated; and may be restricted by the current capabilities of the PowerCenter Visio Stencil. It is necessary to establish the demarcation between generated and manually-built components. It will also be necessary to devise a customization strategy if the autogeneration is seen as a repeatable process. How are manual modifications to the generated component to be implemented? Should this be isolated in discrete components which are called from the generated components? If the autogeneration strategy is based on an application rather than the Visio stencil mapping generation option, ensure that the components you are planning to generate are consistent with the restrictions on the XML export file by referring to the product documentation.

TIP If you modify an exported XML file, you need to make sure that the XML file conforms to the structure of powrmart.dtd. You also need to make sure the metadata in the XML file conforms to Designer and Workflow Manager rules. For example, when you define a shortcut to an object, define the folder in which the referenced object resides as a shared folder. Although PowerCenter validates the XML file before importing repository objects from it, it might not catch all invalid changes. If you import into the repository an object that does not conform to Designer or Workflow Manager rules, you may cause data inconsistencies in the repository. Do not modify the powrmart.dtd file. CRCVALUE Codes Informatica restricts which elements you can modify in the XML file. When you export a Designer object, the PowerCenter Client might include a Cyclic Redundancy Checking Value (CRCVALUE) code in one or more elements in the XML file. The CRCVALUE code is another attribute in an element. When the PowerCenter Client includes a CRCVALUE code in the exported XML file, you can modify some attributes and elements before importing the object into a repository. For example, VSAM source objects always contain a CRCVALUE code, so you can only modify some attributes in a VSAM source object. If you modify certain attributes in an element that contains a CRCVALUE code, you cannot import the object

For more information, refer to the Chapter on Exporting and Importing Objects in the PowerCenter Repository Guide.

3. Build the Assembly Line Essentially, the requirements for the autogeneration may be discerned from the XML exports of the manually developed prototype. Autogeneration Based on Visio Data Stencil (Refer to the product documentation for more information on installation, configuration and usage.) It is important to confirm that all the required PowerCenter transformations are supported by the installed version of the Stencil. The use of an external industry-standard interface such as MS Visio allows the tool to be used by Business Analysts rather than PowerCenter specialists. Apart from allowing the mapping patterns to be specified, the Stencil may also be used as a documentation tool. Essentially, there are three usage stages: ●

Implement the Design in a Visio template

●

Publish the Design

●

Generate the PC Components

INFORMATICA CONFIDENTIAL

BEST PRACTICES

351 of 954

A separate Visio template is defined for every pattern identified in the design phase. A template can be created from scratch or imported from a mapping export; an example is shown below:

The icons for transformation objects should be familiar to PowerCenter users. Less easily understood will be the concept of properties for the links (i.e. relationships) between the objects in the Stencil. These link rules define what ports propagate from one transformation to the next and there may be multiple rules in a single link. Essentially, the process of developing the template consists of identifying the dynamic components in the pattern and parameterizing them such as. ●

Source and target table name

●

Source primary key, target primary key

●

Lookup table name and foreign keys

●

Transformations

Once the template is saved and validated, it needs to be “published” which simply makes it available in formats which the generating mechanisms can understand such as: ●

Mapping template parameter xml

●

Mapping template xml

One of the outputs from the publishing is the template for the definition of the parameters specified in the template. An example of a modified file is shown below: INFORMATICA CONFIDENTIAL

BEST PRACTICES

352 of 954

This file is only used in scripted generation. The other output from the publishing is the template in XML format. This file is only used in manual generation. There is a choice of either manual or scripted mechanisms for generating components from the published files. The manual mechanism involves the importation of the published XML template through the Mapping Template Import Wizard in the PowerCenter Designer. The parameters defined in the template are entered manually through the user interface. Alternately, the scripted process is based on a supplied command-line utility – mapgen. The first stage is to manually modify the published parameter file to specify values for all the mappings to be generated. The second stage is to use PowerCenter to export source and target definitions for all the objects referenced in the parameter file. These are required in order to generate the ports. Mapgen requires the following syntax : ●

Visio Drawing File

(i.e., mapping source)

●

ParameterFile

(i.e., parameters)

●

MappingFile

(i.e., output)

●

[-d] TableDefinitionDir

(i.e., metadata sources & targets)

The generated output file is imported using the standard import facilities in PowerCenter.

TIP Even if the scripted option is selected as the main generating mechanism, use the Mapping Template Import Wizard in the PC Designer to generate the first mapping; this allows the early identification of any errors or inconsistencies in the template.

Autogeneration Based on Informatica Application This strategy generates PowerCenter XML but can be implemented through either PowerCenter itself or the Unstructured Data option. Essentially, it will require the same build sub-stages as any other data integration application. The following components are anticipated: ●

Specification of the formats for source to target mapping and transformation rules definition

●

Development of a mapping to load the specification spreadsheets into a table

●

Development of a mapping to validate the specification and report errors

●

Development of a mapping to generate the XML output excluding critical errors

●

Development of a component to automate the importation of the XML output into PowerCenter

One of the main issues to be addressed is whether there is a single generation engine which deals with all of the required patterns, or a series of pattern-specific generation engines. One of the drivers for the design should be the early identification of errors in the specifications. Otherwise the first indication of any problem will be the failure of the XML output to import in PowerCenter. It is very important to define the process around the generation and to allocate responsibilities appropriately. Autogeneration Based on Java Application INFORMATICA CONFIDENTIAL

BEST PRACTICES

353 of 954

Assuming the appropriate skills are available in the development team, an alternative technique is to develop a Java application to generate the mapping XML files. The PowerCenter Mapping SDK is a java API that provides all of the elements required to generate mappings. The mapping SDK can be found in client installation directory. It contains: ●

The javadoc (api directory) describe all the class of the java API

●

The API (lib directory) which contains the jar files used for mapping SDK application

●

Some basic samples which show how java development with Mapping SDK is done

The Java application also requires a mechanism to define the final mapping between source and target structures; the application interprets this data source and combines it with the metadata in the repository in order to output the required mapping XML.

4. Implement the QA and Testing Strategies Presumably there should be less of a requirement for QA and Testing with generated components. This does not mean that the need to test no longer exists. To some extent, the testing effort should be re-directed to the components in the Assembly line itself. There is a great deal of material in Velocity to support QA and Test activities. In particular, refer to Naming Conventions . Informatica suggests adopting a Naming Convention that distinguishes between generated and manually-built components. For more information on the QA strategy refer to Using PowerCenter Metadata Manager and Metadata Exchange Views for Quality Assurance . Otherwise, the main areas of focus for testing are: Last updated: 26-May-08 18:26

INFORMATICA CONFIDENTIAL

BEST PRACTICES

354 of 954

Mapping Design Challenge Optimizing PowerCenter to create an efficient execution environment.

Description Although PowerCenter environments vary widely, most sessions and/or mappings can benefit from the implementation of common objects and optimization procedures. Follow these procedures and rules of thumb when creating mappings to help ensure optimization.

General Suggestions for Optimizing 1. Reduce the number of transformations. There is always overhead involved in moving data between transformations. 2. Consider more shared memory for large number of transformations. Session shared memory between 12MB and 40MB should suffice. 3. Calculate once, use many times. ❍

Avoid calculating or testing the same value over and over.

❍

Calculate it once in an expression, and set a True/False flag.

❍

Within an expression, use variable ports to calculate a value that can be used multiple times within that transformation.

4. Only connect what is used. ❍

❍

❍

Delete unnecessary links between transformations to minimize the amount of data moved, particularly in the Source Qualifier. This is also helpful for maintenance. If a transformation needs to be reconnected, it is best to only have necessary ports set as input and output to reconnect. In lookup transformations, change unused ports to be neither input nor output. This makes the transformations cleaner looking. It also makes the generated SQL override as small as possible, which cuts down on the amount of cache necessary and thereby improves performance.

5. Watch the data types.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

355 of 954

❍ ❍

The engine automatically converts compatible types. Sometimes data conversion is excessive. Data types are automatically converted when types differ between connected ports. Minimize data type changes between transformations by planning data flow prior to developing the mapping.

6. Facilitate reuse. ❍ ❍

❍ ❍

Plan for reusable transformations upfront.. Use variables. Use both mapping variables and ports that are variables. Variable ports are especially beneficial when they can be used to calculate a complex expression or perform a disconnected lookup call only once instead of multiple times. Use mapplets to encapsulate multiple reusable transformations. Use mapplets to leverage the work of critical developers and minimize mistakes when performing similar functions.

7. Only manipulate data that needs to be moved and transformed. ❍

❍

❍

Reduce the number of non-essential records that are passed through the entire mapping. Use active transformations that reduce the number of records as early in the mapping as possible (i.e., placing filters, aggregators as close to source as possible). Select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table for a faster join.

8. Utilize single-pass reads. ❍

❍

Redesign mappings to utilize one Source Qualifier to populate multiple targets. This way the server reads this source only once. If you have different Source Qualifiers for the same source (e.g., one for delete and one for update/insert), the server reads the source for each Source Qualifier. Remove or reduce field-level stored procedures.

9. Utilize Pushdown Optimization. ❍ Design mappings so they can take advantage of the Pushdown Optimization feature. This improves performance by allowing the source and/or target database to perform the mapping logic.

Lookup Transformation Optimizing Tips

INFORMATICA CONFIDENTIAL

BEST PRACTICES

356 of 954

1. When your source is large, cache lookup table columns for those lookup tables of 500,000 rows or less. This typically improves performance by 10 to 20 percent. 2. The rule of thumb is not to cache any table over 500,000 rows. This is only true if the standard row byte count is 1,024 or less. If the row byte count is more than 1,024, then you need to adjust the 500K-row standard down as the number of bytes increase (i.e., a 2,048 byte row can drop the cache row count to between 250K and 300K, so the lookup table should not be cached in this case). This is just a general rule though. Try running the session with a large lookup cached and not cached. Caching is often faster on very large lookup tables. 3. When using a Lookup Table Transformation, improve lookup performance by placing all conditions that use the equality operator = first in the list of conditions under the condition tab. 4. Cache only lookup tables if the number of lookup calls is more than 10 to 20 percent of the lookup table rows. For fewer number of lookup calls, do not cache if the number of lookup table rows is large. For small lookup tables (i.e., less than 5,000 rows), cache for more than 5 to 10 lookup calls. 5. Replace lookup with decode or IIF (for small sets of values). 6. If caching lookups and performance is poor, consider replacing with an unconnected, uncached lookup. 7. For overly large lookup tables, use dynamic caching along with a persistent cache. Cache the entire table to a persistent file on the first run, enable the "update else insert" option on the dynamic cache and the engine never has to go back to the database to read data from this table. You can also partition this persistent cache at run time for further performance gains. 8. When handling multiple matches, use the "Return any matching value" setting whenever possible. Also use this setting if the lookup is being performed to determine that a match exists, but the value returned is irrelevant. The lookup creates an index based on the key ports rather than all lookup transformation ports. This simplified indexing process can improve performance. 9. Review complex expressions. ❍

❍ ❍

Examine mappings via Repository Reporting and Dependency Reporting within the mapping. Minimize aggregate function calls. Replace Aggregate Transformation object with an Expression Transformation object and an Update Strategy Transformation for certain types of Aggregations.

Operations and Expression Optimizing Tips 1. Numeric operations are faster than string operations. 2. Optimize char-varchar comparisons (i.e., trim spaces before comparing). INFORMATICA CONFIDENTIAL

BEST PRACTICES

357 of 954

3. 4. 5. 6. 7.

Operators are faster than functions (i.e., || vs. CONCAT). Optimize IIF expressions. Avoid date comparisons in lookup; replace with string. Test expression timing by replacing with constant. Use flat files. ❍

❍

❍

Using flat files located on the server machine loads faster than a database located in the server machine. Fixed-width files are faster to load than delimited files because delimited files require extra parsing. If processing intricate transformations, consider loading first to a source flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters and custom SQL Selects where appropriate.

8. If working with data that is not able to return sorted data (e.g., Web Logs), consider using the Sorter Advanced External Procedure. 9. Use a Router Transformation to separate data flows instead of multiple Filter Transformations. 10. Use a Sorter Transformation or hash-auto keys partitioning before an Aggregator Transformation to optimize the aggregate. With a Sorter Transformation, the Sorted Ports option can be used even if the original source cannot be ordered. 11. Use a Normalizer Transformation to pivot rows rather than multiple instances of the same target. 12. Rejected rows from an update strategy are logged to the bad file. Consider filtering before the update strategy if retaining these rows is not critical because logging causes extra overhead on the engine. Choose the option in the update strategy to discard rejected rows. 13. When using a Joiner Transformation, be sure to make the source with the smallest amount of data the Master source. 14. If an update override is necessary in a load, consider using a Lookup transformation just in front of the target to retrieve the primary key. The primary key update is much faster than the non-indexed lookup override.

Suggestions for Using Mapplets A mapplet is a reusable object that represents a set of transformations. It allows you to reuse transformation logic and can contain as many transformations as necessary. Use the Mapplet Designer to create mapplets.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

358 of 954

Mapping SDK Challenge Understand how to create PowerCenter repository objects such as mappings, sessions and workflows using Java programming language instead of PowerCenter client tools.

Description PowerCenter’s Mapping Software Developer Kit (SDK) is a set of interfaces that can be used to generate PowerCenter XML documents containing mappings, sessions and workflows. The Mapping SDK is a Java API that provides all of the elements needed to set up mappings in the repository where metadata is stored. These elements are the objects usually used in the PowerCenter Designer and Workflow Manager like source and target definitions, transformations, mapplets, mappings, tasks, sessions and workflows. The Mapping SDK can be found in the PowerCenter client installation. In the Mapping SDK directory, the following components are available: ●

The javadoc (api directory) that describe all the classes of the Java API

●

The API (lib directory) which contains the jar files used for the Mapping SDK application

●

Some basic samples which show how Java development with Mapping SDK can be done

Below is a simplified Class diagram that represents the Mapping SDK:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

359 of 954

The purpose of the Mapping SDK feature is to improve design and development efficiency for repetitive tasks during the implementation. The Mapping SDK can also be used for mapping autogeneration purposes to complete data-flow for repetitive tasks with various structures of data. This can be used to create “on demand” mappings with same transformations between various sources and targets. A particular advantage for a project that has been designed using the mapping autogeneration comes is with project maintenance. The project team will be able to regenerate mappings quickly using the new source or target structure definitions. The sections below are an example of a Mapping SDK implementation for mapping autogeneration purposes. Mapping auto-generation is based on a low level Java API, which means that there are many ways to create mappings. The development of such a tool requires knowledge and skills about PowerCenter objects design as well as Java program development. To implement the mapping auto-generation method, the project team should follow these tasks: ●

Identify repetitive data mappings which will be common for task and methodology.

●

Create samples of these mappings.

●

Define where data structures are stored (e.g., database catalog, file, COBOL copybook).

●

Develop a Java application using the mapping SDK which is able to obtain the data structure of the project and to generate the mapping defined.

Identify Repetitive Data Mappings INFORMATICA CONFIDENTIAL

BEST PRACTICES

360 of 954

In most projects there are some tasks or mappings that are similar and vary only in the structure of the data they transform. Examples of these types of mappings include: ●

loading a table from a flat file

●

performing incremental loads on historical and non-historical tables

●

extracting table data to files

During the design phase of the project, the Business Analyst and the Data Integration developer need to identify which tasks or mappings can be designed as repetitive tasks to improve the future design for similar tasks.

Create A Sample Mapping During the design phase, the Data Integration developer must develop a sample mapping for each repetitive task that has identified. This will help to outline how the data mapping could be designed. For example, define the needed transformations, mappings, tasks and processes needed to create the data mapping. A mapping template can be used for this purpose. Frequently, the repetitive tasks correspond to one of the sample data mappings that have been defined as mapping templates in Informatica’s Customer Portal.

Define The Location Where Data Structures are Stored An important point for the mapping auto-generation method is to define where the data structure can be found that is needed to create the final mapping between the source and target structure. You can build a Java application that will build a PowerCenter mapping with dynamic source and target definitions stored in: ●

A set of data files

●

A database catalog

●

A structured file like copy COBOL or XML Schema file

The final application may contain a set of functionalities to map the source and the target structure definitions.

Develop A Java Application Using The Mapping SDK As a final step during the build phase, develop a Java application that will create (according to the source and target structure definition) the final mapping definition that includes all of the column specifications for the source and target. This application will be based on the Mapping SDK, which provides all of the resources to create an XML file containing the mapping, session and workflow definition. This application has to be developed in such a way as to generate all of the types of mappings that were defined during the design phase.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

361 of 954

Last updated: 29-May-08 13:18

INFORMATICA CONFIDENTIAL

BEST PRACTICES

362 of 954

Mapping Templates Challenge Mapping Templates demonstrate proven solutions for tackling challenges that commonly occur during data integration development efforts. Mapping Templates can be used to make the development phase of a project more efficient. Mapping Templates can also serve as a medium to introduce development standards into the mapping development process that developers need to follow. A wide array of Mapping Template examples can be obtained for the most current PowerCenter version from the Informatica Customer Portal. As "templates," each of the objects in Informatica's Mapping Template Inventory illustrates the transformation logic and steps required to solve specific data integration requirements. These sample templates, however, are meant to be used as examples, not as means to implement development standards.

Description Reuse Transformation Logic Templates can be heavily used in a data integration and warehouse environment, when loading information from multiple source providers into the same target structure, or when similar source system structures are employed to load different target instances. Using templates guarantees that any transformation logic that is developed and tested correctly, once, can be successfully applied across multiple mappings as needed. In some instances, the process can be further simplified if the source/target structures have the same attributes, by simply creating multiple instances of the session, each with its own connection/execution attributes, instead of duplicating the mapping.

Implementing Development Techniques When the process is not simple enough to allow usage based on the need to duplicate transformation logic to load the same target, Mapping Templates can help to reproduce transformation techniques. In this case, the implementation process requires more than just replacing source/target transformations. This scenario is most useful when certain logic (i.e., logical group of transformations) is employed across mappings. In many instances this can be further simplified by making use of mapplets. Additionally user defined functions can be utilized for expression logic reuse and build complex

INFORMATICA CONFIDENTIAL

BEST PRACTICES

363 of 954

expressions using transformation language.

Transport mechanism Once Mapping Templates have been developed, they can be distributed by any of the following procedures: ●

Copy mapping from development area to the desired repository/folder

●

Export mapping template into XML and import to the desired repository/folder.

Mapping template examples The following Mapping Templates can be downloaded from the Informatica Customer Portal and are listed by subject area:

Common Data Warehousing Techniques ●

Aggregation using Sorted Input

●

Tracking Dimension History

●

Constraint-Based Loading

●

Loading Incremental Updates

●

Tracking History and Current

●

Inserts or Updates

Transformation Techniques ●

Error Handling Strategy

●

Flat File Creation with Headers and Footers

●

Removing Duplicate Source Records

●

Transforming One Record into Multiple Records

●

Dynamic Caching

●

Sequence Generator Alternative

●

Streamline a Mapping with a Mapplet

●

Reusable Transformations (Customers)

●

Using a Sorter

INFORMATICA CONFIDENTIAL

BEST PRACTICES

364 of 954

●

Pipeline Partitioning Mapping Template

●

Using Update Strategy to Delete Rows

●

Loading Heterogenous Targets

●

Load Using External Procedure

Advanced Mapping Concepts ●

Aggregation Using Expression Transformation

●

Building a Parameter File

●

Best Build Logic

●

Comparing Values Between Records

●

Transaction Control Transformation

Source-Specific Requirements ●

Processing VSAM Source Files

●

Processing Data from an XML Source

●

Joining a Flat File with a Relational Table

Industry-Specific Requirements ●

Loading SWIFT 942 Messages.htm

●

Loading SWIFT 950 Messages.htm

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

365 of 954

Naming Conventions Challenge A variety of factors are considered when assessing the success of a project. Naming standards are an important, but often overlooked component. The application and enforcement of naming standards not only establishes consistency in the repository, but provides for a developer friendly environment. Choose a good naming standard and adhere to it to ensure that the repository can be easily understood by all developers.

Description Although naming conventions are important for all repository and database objects, the suggestions in this Best Practice focus on the former. Choosing a convention and sticking with it is the key. Having a good naming convention facilitates smooth migrations and improves readability for anyone reviewing or carrying out maintenance on the repository objects. It helps them to understand the processes being affected. If consistent names and descriptions are not used, significant time may be needed to understand the workings of mappings and transformation objects. If no description is provided, a developer is likely to spend considerable time going through an object or mapping to understand its objective. The following pages offer suggested naming conventions for various repository objects. Whatever convention is chosen, it is important to make the selection very early in the development cycle and communicate the convention to project staff working on the repository. The policy can be enforced by peer review and at test phases by adding processes to check conventions both to test plans and to test execution documents.

Suggested Naming Conventions Designer Objects

Suggested Naming Conventions

Mapping

m_{PROCESS}_{SOURCE_SYSTEM}_{TARGET_NAME} or suffix with _ {descriptor} if there are multiple mappings for that single target table

Mapplet

mplt_{DESCRIPTION}

Target

{update_types(s)}_{TARGET_NAME} this naming convention should only occur within a mapping as the actual target name object affects the actual table that PowerCenter will access

Aggregator Transformation

AGG_{FUNCTION} that leverages the expression and/or a name that describes the processing being done.

Application Source Qualifier Transformation

ASQ_{TRANSFORMATION} _{SOURCE_TABLE1}_{SOURCE_TABLE2} represents data from application source.

Custom Transformation

CT_{TRANSFORMATION} name that describes the processing being done.

Data Quality Transform

IDQ_{descriptor}_{plan} with the descriptor describing what this plan is doing with the optional plan name included if desired. EXP_{FUNCTION} that leverages the expression and/or a name that describes the processing being done.

Expression Transformation

External Procedure Transformation

EXT_{PROCEDURE_NAME}

Filter Transformation

FIL_ or FILT_{FUNCTION} that leverages the expression or a name that describes the processing being done.

Flexible Target Key

Fkey{descriptor}

INFORMATICA CONFIDENTIAL

BEST PRACTICES

366 of 954

HTTP Idoc Interpreter Idoc Prepare Java Transformation

http_{descriptor} idoci_{Descriptor}_{IDOC Type} defining what the idoc does and possibly the idoc message. idocp_{Descriptor}_{IDOC Type} defining what the idoc does and possibly the idoc message. JV_{FUNCTION} that leverages the expression or a name that describes the processing being done.

Joiner Transformation

JNR_{DESCRIPTION}

Lookup Transformation

LKP_{TABLE_NAME} or suffix with _{descriptor} if there are multiple lookups on a single table. For unconnected look-ups, use ULKP in place of LKP.

Mapplet Input Transformation

MPLTI_{DESCRIPTOR} indicating the data going into the mapplet.

Mapplet Output Transformation MPLTO_{DESCRIPTOR} indicating the data coming out of the mapplet. MQ Source Qualifier Transformation

MQSQ_{DESCRIPTOR} defines the messaging being selected.

Normalizer Transformation

NRM_{FUNCTION} that leverages the expression or a name that describes the processing being done.

Rank Transformation

RNK_{FUNCTION} that leverages the expression or a name that describes the processing being done.

Router Transformation

RTR_{DESCRIPTOR}

SAP DMI Prepare

dmi_{Entity Descriptor}_{Secondary Descriptor} defining what entity is being loaded and a secondary description if multiple DMI objects are being leveraged in a mapping. SEQ_{DESCRIPTOR} if using keys for a target table entity, then refer to that

Sequence Generator Transformation Sorter Transformation

SRT_{DESCRIPTOR}

Source Qualifier Transformation SQ_{SOURCE_TABLE1}_{SOURCE_TABLE2}. Using all source tables can be impractical if there are a lot of tables in a source qualifier, so refer to the type of information being obtained, for example a certain type of product – SQ_SALES_INSURANCE_PRODUCTS. Stored Procedure Transformation

SP_{STORED_PROCEDURE_NAME}

Transaction Control Transformation

TCT_ or TRANS_{DESCRIPTOR} indicating the function of the transaction control.

Union Transformation

UN_{DESCRIPTOR}

Unstructured Data Transform

UDO_{descriptor} with the descriptor ideintifying the kind of data being parsed by the UDO transform. Update Strategy Transformation UPD_{UPDATE_TYPE(S)} or UPD_{UPDATE_TYPE(S)}_ {TARGET_NAME} if there are multiple targets in the mapping. E.g., UPD_UPDATE_EXISTING_EMPLOYEES Web Service Consumer WSC_{descriptor} XML Generator Transformation XMG_{DESCRIPTOR}defines the target message.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

367 of 954

XML Parser Transformation

XMP_{DESCRIPTOR}defines the messaging being selected.

XML Source Qualifier Transformation

XMSQ_{DESCRIPTOR}defines the data being selected.

Port Names Ports names should remain the same as the source unless some other action is performed on the port. In that case, the port should be prefixed with the appropriate name. When the developer brings a source port into a lookup, the port should be prefixed with ‘in_’. This helps the user immediately identify the ports that are being input without having to line up the ports with the input checkbox. In any other transformation, if the input port is transformed in an output port with the same name, prefix the input port with ‘in_’. Generated output ports can also be prefixed. This helps trace the port value throughout the mapping as it may travel through many other transformations. If it is intended to be able to use the autolink feature based on names, then outputs may be better left as the name of the target port in the next transformation. For variables inside a transformation, the developer can use the prefix ‘v’, 'var_’ or ‘v_' plus a meaningful name. With some exceptions, port standards apply when creating a transformation object. The exceptions are the Source Definition, the Source Qualifier, the Lookup, and the Target Definition ports, which must not change since the port names are used to retrieve data from the database. Other transformations that are not applicable to the port standards are: ●

Normalizer - The ports created in the Normalizer are automatically formatted when the developer configures it.

●

Sequence Generator - The ports are reserved words.

●

●

●

Router - Because output ports are created automatically, prefixing the input ports with an I_ prefixes the output ports with I_ as well. Port names should not have any prefix. Sorter, Update Strategy, Transaction Control, and Filter - These ports are always input and output. There is no need to rename them unless they are prefixed. Prefixed port names should be removed. Union - The group ports are automatically assigned to the input and output; therefore prefixing with anything is reflected in both the input and output. The port names should not have any prefix.

All other transformation object ports can be prefixed or suffixed with: ●

‘in_’ or ‘i_’for Input ports

●

‘o_’ or ‘_out’ for Output ports

●

‘io_’ for Input/Output ports

●

‘v’,‘v_’ or ‘var_’ for variable ports

●

‘lkp_’ for returns from look ups

●

‘mplt_’ for returns from mapplets

Prefixes are preferable because they are generally easier to see; developers do not need to expand the columns to see the suffix for longer port names. Transformation object ports can also: ●

Have the Source Qualifier port name.

●

Be unique.

●

Be meaningful.

●

Be given the target port name.

Transformation Descriptions This section defines the standards to be used for transformation descriptions in the Designer. INFORMATICA CONFIDENTIAL

BEST PRACTICES

368 of 954

●

Source Qualifier Descriptions. Should include the aim of the source qualifier and the data it is intended to select. Should also indicate if any overrides are used. If so, it should describe the filters or settings used. Some projects prefer items such as the SQL statement to be included in the description as well.

●

Lookup Transformation Descriptions. Describe the lookup along the lines of the [lookup attribute] obtained from [lookup table name] to retrieve the [lookup attribute name]. Where: ❍

Lookup attribute is the name of the column being passed into the lookup and is used as the lookup criteria.

❍

Lookup table name is the table on which the lookup is being performed.

❍

Lookup attribute name is the name of the attribute being returned from the lookup. If appropriate, specify the condition when the lookup is actually executed.

It is also important to note lookup features such as persistent cache or dynamic lookup. ●

Expression Transformation Descriptions. Must adhere to the following format: “This expression … [explanation of what transformation does].” Expressions can be distinctly different depending on the situation; therefore the explanation should be specific to the actions being performed. Within each Expression, transformation ports have their own description in the format: “This port … [explanation of what the port is used for].”

●

Aggregator Transformation Descriptions. Must adhere to the following format: “This Aggregator … [explanation of what transformation does].” Aggregators can be distinctly different, depending on the situation; therefore the explanation should be specific to the actions being performed. Within each Aggregator, transformation ports have their own description in the format: “This port … [explanation of what the port is used for].”

●

Sequence Generators Transformation Descriptions. Must adhere to the following format: “This Sequence Generator provides the next value for the [column name] on the [table name].” Where:

●

❍

Table name is the table being populated by the sequence number, and the

❍

Column name is the column within that table being populated.

Joiner Transformation Descriptions. Must adhere to the following format: “This Joiner uses … [joining field names] from [joining table names].” Where: ❍

INFORMATICA CONFIDENTIAL

BEST PRACTICES

369 of 954

Joining field names are the names of the columns on which the join is done, and the ❍

Joining table names are the tables being joined. ●

Normalizer Transformation Descriptions. Must adhere to the following format:: “This Normalizer … [explanation].” Where: ❍

●

explanation describes what the Normalizer does.

Filter Transformation Descriptions. Must adhere to the following format: “This Filter processes … [explanation].” Where: ❍

●

●

●

●

explanation describes what the filter criteria are and what they do.

Stored Procedure Transformation Descriptions. Explain the stored procedure’s functionality within the mapping (i.e., what does it return in relation to the input ports?). Mapplet Input Transformation Descriptions. Describe the input values and their intended use in the mapplet. Mapplet Output Transformation Descriptions. Describe the output ports and the subsequent use of those values. As an example, for an exchange rate mapplet, describe what currency the output value will be in. Answer the questions like: is the currency fixed or based on other data? What kind of rate is used? is it a fixed inter-company rate? an inter-bank rate? business rate or tourist rate? Has the conversion gone through an intermediate currency? Update Strategies Transformation Descriptions. Describe the Update Strategy and whether it is fixed in its function or determined by a calculation.

●

Sorter Transformation Descriptions. Explanation of the port(s) that are being sorted and their sort direction.

●

Router Transformation Descriptions. Describes the groups and their functions.

●

●

●

●

●

●

●

●

Union Transformation Descriptions. Describe the source inputs and indicate what further processing on those inputs (if any) is expected to take place in later transformations in the mapping. Transaction Control Transformation Descriptions. Describe the process behind the transaction control and the function of the control to commit or rollback. Custom Transformation Descriptions. Describe the function that the custom transformation accomplishes and what data is expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure which is used. External Procedure Transformation Descriptions. Describe the function of the external procedure and what data is expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure that is used. Java Transformation Descriptions. Describe the function of the java code and what data is expected as input and what data is generated as output. Also indicate whether the java code determines the object to be an Active or Passive transformation. Rank Transformation Descriptions. Indicate the columns being used in the rank, the number of records returned from the rank, the rank direction, and the purpose of the transformation. XML Generator Transformation Descriptions. Describe the data expected for the generation of the XML and indicate the purpose of the XML being generated. XML Parser Transformation Descriptions. Describe the input XML expected and the output from the parser and indicate the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

370 of 954

purpose of the transformation.

Mapping Comments These comments describe the source data obtained and the structure file, table or facts and dimensions that it populates. Remember to use business terms along with such technical details as table names. This is beneficial when maintenance is required or if issues arise that need to be discussed with business analysts.

Mapplet Comments These comments are used to explain the process that the mapplet carries out. Always be sure to see the notes regarding descriptions for the input and output transformation.

Repository Objects Repositories, as well as repository level objects, should also have meaningful names. Repositories should prefix with either ‘L_’ for local or ‘G’ for global and a descriptor. Descriptors usually include information about the project and/or level of the environment (e.g., PROD, TEST, DEV).

Folders and Groups Working folder names should be meaningful and include project name and, if there are multiple folders for that one project, a descriptor. User groups should also include project name and descriptors, as necessary. For example, folder DW_SALES_US and DW_SALES_UK could both have TEAM_SALES as their user group. Individual developer folders or non-production folders should prefix with ‘z_’ so that they are grouped together and not confused with working production folders.

Shared Objects and Folders Any object within a folder can be shared across folders and maintained in one central location. These objects are sources, targets, mappings, transformations, and mapplets. To share objects in a folder, the folder must be designated as shared. In addition to facilitating maintenance, shared folders help reduce the size of the repository since shortcuts are used to link to the original, instead of copies. Only users with the proper permissions can access these shared folders. These users are responsible for migrating the folders across the repositories and, with help from the developers, for maintaining the objects within the folders. For example, if an object is created by a developer and is to be shared, the developer should provide details of the object and the level at which the object is to be shared before the Administrator accepts it as a valid entry into the shared folder. The developers, not necessarily the creator, control the maintenance of the object, since they must ensure that a subsequent change does not negatively impact other objects. If the developer has an object that he or she wants to use in several mappings or across multiple folders, like an Expression transformation that calculates sales tax, the developer can place the object in a shared folder. Then use the object in other folders by creating a shortcut to the object. In this case, the naming convention is ‘sc_’ (e.g., sc_EXP_CALC_SALES_TAX). The folder should prefix with ‘SC_’ to identify it as a shared folder and keep all shared folders grouped together in the repository.

Workflow Manager Objects WorkFlow Objects

Suggested Naming Convention

Session

s_{MappingName}

Command Object

cmd_{DESCRIPTOR}

Worklet

wk or wklt_{DESCRIPTOR}

Workflow

wkf or wf_{DESCRIPTOR}

INFORMATICA CONFIDENTIAL

BEST PRACTICES

371 of 954

Email Task:

email_ or eml_{DESCRIPTOR}

Decision Task:

dcn_ or dt_{DESCRIPTOR}

Assign Task:

asgn_{DESCRIPTOR}

Timer Task:

timer_ or tmr_{DESCRIPTOR}

Control Task:

ctl_{DESCRIPTOR}Specify when and how the PowerCenter Server is to stop or abort a workflow by using the Control task in the workflow.

Event Wait Task:

wait_ or ew_{DESCRIPTOR}Waits for an event to occur. Once the event triggers, the PowerCenter Server continues executing the rest of the workflow.

Event Raise Task:

raise_ or er_{DESCRIPTOR} Represents a user-defined event. When the PowerCenter Server runs the Event-Raise task, the Event-Raise task triggers the event. Use the Event-Raise task with the Event-Wait task to define events.

ODBC Data Source Names All Open Database Connectivity (ODBC) data source names (DSNs) should be set up in the same way on all client machines. PowerCenter uniquely identifies a source by its Database Data Source (DBDS) and its name. The DBDS is the same name as the ODBC DSN since the PowerCenter Client talks to all databases through ODBC. Also be sure to setup the ODBC DSNs as system DSNs so that all users of a machine can see the DSN. This approach ensures that there is less chance of a discrepancy occuring among users when they use different (i.e., colleagues') machines and have to recreate a new DSN when they use a separate machine. If ODBC DSNs are different across multiple machines, there is a risk of analyzing the same table using different names. For example, machine1 has ODBS DSN Name0 that points to database1. TableA gets analyzed in on machine 1. TableA is uniquely identified as Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that points to database1. TableA gets analyzed in on machine 2. TableA is uniquely identified as Name1.TableA in the repository. The result is that the repository may refer to the same object by multiple names, creating confusion for developers, testers, and potentially end users. Also, refrain from using environment tokens in the ODBC DSN. For example, do not call it dev_db01. When migrating objects from dev, to test, to prod, PowerCenter can wind up with source objects called dev_db01 in the production repository. ODBC database names should clearly describe the database they reference to ensure that users do not incorrectly point sessions to the wrong databases.

Database Connection Information Security considerations may dictate using the company name of the database or project instead of {user}_{database name}, except for developer scratch schemas, which are not found in test or production environments. Be careful not to include machine names or environment tokens in the database connection name. Database connection names must be very generic to be understandable and ensure a smooth migration. The naming convention should be applied across all development, test, and production environments. This allows seamless migration of sessions when migrating between environments. If an administrator uses the Copy Folder function for migration, session information is also copied. If the Database Connection information does not already exist in the folder the administrator is copying to, it is also copied. So, if the developer uses connections with names like Dev_DW in the development repository, they are likely to eventually wind up in the test, and even the production repositories as the folders are migrated. Manual intervention is then necessary to change connection names, user names, passwords, and possibly even connect strings. Instead, if the developer just has a DW connection in each of the three environments, when the administrator copies a folder from the development environment to the test environment, the sessions automatically use the existing connection in the test repository. With the right naming convention, you can migrate sessions from the test to production repository without manual intervention. INFORMATICA CONFIDENTIAL

BEST PRACTICES

372 of 954

TIP At the beginning of a project, have the Repository Administrator or DBA setup all connections in all environments based on the issues discussed in this Best Practice. Then use permission options to protect these connections so that only specified individuals can modify them. Whenever possible, avoid having developers create their own connections using different conventions and possibly duplicating connections.

Administration Console Objects Administration console objects such as domains, nodes, and services should also have meaningful names. Object

Recommended Naming Convention

Example

Domain

DOM_ or DMN_[PROJECT]_[ENVIRONMENT]

DOM_PROCURE_DEV

Node

NODE[#]_[SERVER_NAME]_ [optional_descriptor] NODE02_SERVER_rs_b (backup node for the repository service)

Services: - Integration

INT_SVC_[ENVIRONMENT]_[optional descriptor] INT_SVC_DEV_primary

- Repository

REPO_SVC_[ENVIRONMENT]_[optional descriptor]

- Web Services Hub

WEB_SVC_[ENVIRONMENT]_[optional descriptor] WEB_SVC_PROD

REPO_SVC_TEST

PowerCenter PowerExchange Application/Relational Connections Before the PowerCenter Server can access a source or target in a session, you must configure connections in the Workflow Manager. When you create or modify a session that reads from, or writes to, a database, you can select only configured source and target databases. Connections are saved in the repository. For PowerExchange Client for PowerCenter, you configure relational database and/or application connections. The connection you configure depends on the type of source data you want to extract and the extraction mode (e.g., PWX[MODE_INITIAL]_[SOURCE]_ [Instance_Name]). The following table shows some examples. Source Type/ Extraction Mode

Application Connection/ Relational Connection

Connection Type Recommended Naming Convention

DB2/390 Bulk Mode

Relational

PWX DB2390

PWXB_DB2_Instance_Name

DB2/390 Change Mode

Application

PWX DB2390 CDC Change

PWXC_DB2_Instance_Name

DB2/390 Real Time Mode

Application

PWX DB2390 CDC Real Time

PWXR_DB2_Instance_Name

IMS Batch Mode

Application

PWX NRDB Batch PWXB_IMS_ Instance_Name

IMS Change Mode

Application

PWX NRDB CDC Change

INFORMATICA CONFIDENTIAL

BEST PRACTICES

PWXC_IMS_ Instance_Name

373 of 954

IMS Real Time

Application

PWX NRDB CDC Real Time

PWXR_IMS_ Instance_Name

Oracle Change Mode Application

PWX Oracle CDC PWXC_ORA_Instance_Name Change

Oracle Real Time

PWX Oracle CDC PWXR_ORA_Instance_Name Real

Application

PowerCenter PowerExchange Target Connections The connection you configure depends on the type of target data you want to load. Recommended Naming Convention

Target Type

Connection Type

DB2/390

PWX DB2390 relational database PWXT_DB2_Instance_Name connection

DB2/400

PWX DB2400 relational database PWXT_DB2_Instance_Name connection

Last updated: 05-Dec-07 16:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

374 of 954

Naming Conventions - B2B Data Transformation Challenge As with any development process, the use of clear, consistent, and documented naming conventions contributes to the effective use of Informatica B2B Data Transformation. The purpose of this document is to provide suggested naming conventions for the major structural elements of B2B Data Transformation solutions.

Description The process of creating a B2B Data Transformation solution consists of several logical phases, each of which has implications for naming conventions. Some of these naming conventions are based upon best practices discovered during the creation of B2B Data Transformation solutions; others are restrictions imposed on the naming of solution artifacts that are due to both the use of the underlying file system and the need to make solutions callable from a wide variety of host runtime and development environments. The main phases involved in the construction of a B2B Data Transformation solution are: 1. The creation of one more transformation projects using the B2B Data Transformation Studio (formerly known as ContentMaster Studio) authoring environment. A typical solution may involve the creation of many transformation projects. 2. The publication of the transformation projects as transformation services. 3. The deployment of the transformation services. 4. The creation/configuration of the host integration environment to invoke the published transformation services. Each of these phases has implications for the naming of transformation solution components and artifacts (i.e., projects, TGP scripts, schemas, published services). Several common patterns occur in B2B Data Transformation solutions that have implications for naming: ●

●

●

●

Many components are realized physically as file system objects such as files and directories. For maximum compatibility and portability, it is desirable to name these objects so that they can be transferred between Windows, UNIX and other platforms without having to rename the objects so as to conform to different file system conventions. Inputs and outputs to and from B2B Data Transformation services are often files or entities designated by URLs. Again restrictions of underlying file systems play an important role here. B2B Data Transformation solutions are designed to be embeddable – that is callable from a host application or environment through the use of scripts, programming language APIs provided for languages such as C, C# and Java, and through the use of agents for PowerCenter and other platforms. Hence some of the naming conventions are based on maximizing usability of transformation services from within various host environments or APIs. Within B2B Data Transformation projects, most names and artifacts are global – the scope of names is global to the project.

B2B Data Transformation Studio Designer B2B Data Transformation Studio is the user interface for the development of B2B Data Transformation solutions. It is based on the open source Eclipse environment and inherits many of its characteristics regarding project naming and structure. The workspace is organized as a set of sub-directories, with one sub-directory representing each project. A specially designated directory named “.metatdata” is used to hold metadata about the current workspace. For more information about Studio Designer and the workspace refer to Establishing a B2B Data Transformation Development Architecture .

INFORMATICA CONFIDENTIAL

BEST PRACTICES

375 of 954

At any common level of visibility, B2B Data Transformation requires that all elements have distinct names. Thus no two projects within a repository or workspace may share the same name. Likewise, no two TGP script files, XML schemas, global parser, mapper, serializer or variable definition may share the same name. Within a transformation (such as parser, mapper or serializer) groupings, actions or subsections of a transformation may be assigned names. In this context, the name does not strictly identify the section but is used as both a developer convenience and as a way to identify the section in the event file. In this case, names are allowed to be duplicated and often the name serves as a shorthand comment about the section. In these cases, there are no restrictions on the name although it is recommended that the name is unique, short and intuitively identifies the section. Often the name may be used to refer to elements in the specification (such as Map 835 ISA Segment). Contrary to the convention for global names, spaces are often used for readability. To distinguish between sub-element names that are only used within transformations, and the names of entry points, scripts and variables that are used as service parameters etc., refer to these names as public names.

B2B Data Transformation Studio Best Practices As B2B Data Transformation Studio will load all projects in the current workspace into the studio environment, keeping all projects under design in a single workspace leads to both excessive memory usage and logical clutter between transformations belonging to different, possibly unrelated, solutions. Note: B2B Data Transformation Studio allows for the closing of projects to reduce memory consumption. While this aids with memory consumption it does not address the logical organization aspects of using separate workspaces. Use Separate Workspaces for Separate Solutions For distinct logical solutions, it is recommended to use separate logical workspaces to organize projects relating to separate solutions. Refer to Establishing a B2B Data Transformation Development Architecture for more information. Create Separate Transformation Projects for Each Distinct Service From a logical organization perspective, it is easier to manage data transformation solutions if only one primary service is published from each project. Secondary services from the same project should be reserved for the publication of test or troubleshooting variations of the same primary service. The one exception to this should be where multiple services are substantially the same with the same transformation code but with minor differences to inputs. One alternative to publishing multiple services from the same project is to publish a shared service which is then called by the other services in order to perform the common transformation routines. For ease of maintenance, it is often desirable to name the project after the primary service which it publishes. While these do not have to be the same, it is a useful convention and simplifies the management of projects. Use Names Compatible with Command Line Argument Formats When a transformation service is invoked at runtime, it may be invoked on the command line (via cm_console), via .Net or Java Apis, via integration agents to invoke a service from a hosting platform such as WebMethods, BizTalk or IBM ProcessServer or from PowerCenter via the UDO option for PowerCenter. Use Names Compatible with Programming Language Function Names While the programming APIs allow for the use of any string as the name, to simplify interoperability with future APIs and command line tools, the service name should be compatible with the names for C# and Java variable names, and with argument names for Windows, Unix and other OS command line arguments. INFORMATICA CONFIDENTIAL

BEST PRACTICES

376 of 954

Use Names Compatible with File System Naming on Unix and Windows Due to the files produced behind the scenes, the published service name and project names needs to be compatible with the naming conventions for file and directory names on their target platforms. To allow for optimal cross platform migration in the future, names should be chosen so as to be compatible with file naming restrictions on Windows, Unix and other platforms. Do Not Include Version or Date Information in Public Names It is recommended that project names, published service names, names of publicly accessible transformations and other public names do not include version numbers or date of creation information. Due to the way in which B2B Data Transformation operates, the use of dates or version numbers would make it difficult to use common source code control systems to track changes to projects. Unless the version corresponds to a different version of a business problem – such as dealing with two different versions of an HL7 specification - it is recommended that names do not include version or date information.

Naming B2B Data Transformation Projects When a project is created, the user is prompted for the project name.

Project names will be used by default as the published service name. Both, the directory for the project within a INFORMATICA CONFIDENTIAL

BEST PRACTICES

377 of 954

workspace and the main “cmw” project file name will be based on the project name. Due to the recommendation that a project name is used to define the published service name, the project name should not conflict with the name of an existing service unless the project publishes that service. Note: B2B Data Transformation disallows the use of $, “, ~, ‘, ^, *, ?, >, < , comma, `, \, /, ;, | in project names. Project naming should be clear and consistent within both a repository and workspace. The exact approach to naming will vary depending on an organization’s needs.

Project Naming Best Practices Project Names Must Be Unique Across Workspaces in Which They Occur Also if project generated services will be deployed onto separate production environments, the naming of services will need to be unique on those environments also. Do Not Name a Project after a Published Service, unless the Project Produces that Published Service This requirement can be relaxed if service names distinct from project names are being used. Do Not name a Project “.metadata” This will conflict with the underlying Eclipse metadata. Do Not include Version or Date Information in Project Names While it may be appealing to use version or date indicators in project names, the ideal solution for version tracking of services is to use a source control system such as CVS, Visual Studio SourceSafe, Source Depot or one of the many other commercially available or open-source source control systems. Consider Including the Source Format in the Name If transformations within a project will operate predominantly on one primary data source format, including the data source in the project name may be helpful. For example: TranslateHipaa837ToXml Consider Including the Target Format in the Name If transformations within a project will produce predominantly one target data format, including the data format in the project name may be helpful. For example: TranslateCobolCopybookToSwift Use Short, Descriptive Project Names Include enough descriptive information within the project name to indicate its function. Remember that the project name will also determine the default published service name. For ease of readability in B2B Data Transformation studio, it is also recommended to keep project names to 80 characters or less. Consider also conforming to C identifier names (combinations of a-z, A-Z, 0-9, _) which should provide maximum INFORMATICA CONFIDENTIAL

BEST PRACTICES

378 of 954

conformance. Keep Project Names Compatible with File and Directory Naming Restrictions on Unix, Windows and other Platforms As project names determine file and directory names for a variety of solution artifacts, it is highly recommended that project names conform to file name restrictions across a variety of file systems. While it is possible to use invalid Unix file names as project names on Windows, and invalid Windows file names on Unix projects, it is recommended to avoid OS file system conflicts where possible to maximize future portability. More detailed file system restrictions are identified in the appendix. Briefly, these include: ●

Do not use system file names such as CON, PRN, AUX, CLOCK$, NUL,COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9

●

Do not use reserved eclipse names such as .metadata

●

Do not use characters such as |\?*+[]/ or control characters

●

Optionally exclude spaces and other whitespace characters from service names

Naming Transformation Script Files (TGP Scripts) TGP scripts have the naming restrictions common to all files on the platform on which they are being deployed. INFORMATICA CONFIDENTIAL

BEST PRACTICES

380 of 954

Naming Transformation Components Transformation components such as parsers, mappers, serializers, variables etc., must be unique within a project. There is a single global namespace within any B2B Data Transformation project for all transformation components. One exception exists to this global namespace for component names. That is sequences of actions within a component such as a mapper or parser may be given a name. In this case the name is only used for commentary purposes and to assist in matching events to the sequence of the script actions that produced the event. For these sub-component names, no restrictions apply – although it is recommended that the names are kept short to ease browsing in the events viewer. The remarks attribute should be used for longer descriptive commentary on the actions taken.

Transformation Component Naming Best Practices Use Short Descriptive Names Names of components will show up in event logs, error output and other tracing and logging mechanisms. Keeping names short will ease the need for large amounts of screen real estate when browsing the event view for debugging. Incorporate Source and Target Formats in the Name Optionally Use Prefixes to Annotate Components Used for Internal Use Only When a component such as a parser mapper, is used for internal purposes only, it may be useful to use names of components that are prefixed with a letter sequence indicating the type of component. Type of Component

Prefix

Notes

Variable

v

Do not adorn variables used for external service parameters

Mapper

map

Alternatively use descriptive name – “MapXToY”

Parser

psr

Alternatively use descriptive name i.e “ParseMortgageApplication”

Transformer

tr

Alternatively use descriptive name i.e “RemoveWhitespace”

Serializer

ser

Alternatively use descriptive name “Serialize837”

Preprocessor

pr

Alternatively use name XToY describing preprocessing

In addition, names for components should take into account the following suggested rules: 1. Limit names to a reasonably short length. A limit of 40 characters is suggested. 2. Consider using the name of the input and or output data. 3. Consider limiting names to alphabetic characters, underscores, and numbers. INFORMATICA CONFIDENTIAL

BEST PRACTICES

381 of 954

Variables Exposed as Service Parameters Should be Unadorned When a variable is being used to hold a service parameter, no prefix should be used. Use a reasonably short descriptive name instead.

XML Schema Naming In many B2B Data Transformation solution scenarios, the XML schemas which are the source or target of transformations are defined externally and control over the naming and style of schema definition is limited. However, sometimes a transformation project may require one or more intermediate schemas. The following best practices may help with the use of newly created XML schemas in B2B Data Transformation projects. Use a Target Namespace Using all no namespace schemas leads to a proliferation of types within the B2B Data Transformation studio environment under a single default namespace. By using namespaces on intermediate schemas it can reduce the logical clutter in addition to making intermediate schemas more re-usable. Always Qualify the XML Schema Namespace Qualify the XML Schema namespace even when using qualified elements and attributes for the domain namespace. It makes schema inclusion and import simpler. Consider the use of Explicit Named Complex Types vs. Anonymous Complex Types The use of anonymous complex types reduces namespace clutter in PowerExchange studio. However when multiple copies of schema elements are needed, having the ability to define variables of a complex type simplifies the creation on many transformations. By default, a transformation project allows for the existence of one copy of a schema at a time. Through the use of global complex types, additional variables may be defined to hold secondary copies for interim processing. Example: Use of anonymous type: Use of global type: INFORMATICA CONFIDENTIAL

BEST PRACTICES

382 of 954

Through the use of the second form of the definition, we can create a variable of the type Publication.

Appendix: File Name Restrictions On Different Platforms Reserved Characters and Words Many operating systems prohibit control characters from appearing in file names. Unix-like systems are an exception, as the only control character forbidden in file names is the null character, as that's the end-of-string indicator in C. Trivially, Unix also excludes the path separator / from appearing in filenames. Some operating systems prohibit some particular characters from appearing in file names: Character

Name

Reason

/

slash

used as a path name component separator in Unix-like, MS-DOS and Windows.

\

backslash

treated the same as slash in MS-DOS and Windows, and as the escape character in Unix systems (see Note below)

?

question mark used as a wildcard in Unix, and Windows; marks a single character.

%

percent sign

used as a wildcard in RT-11; marks a single character.

*

asterisk

used as a wildcard in Unix, MS-DOS, RT-11, VMS and Windows. Marks any sequence of characters (Unix, Windows, later versions of MS-DOS) or any sequence of characters in either the basename or extension (thus "*.*" in early versions of MS-DOS means "all files".

:

colon

used to determine the mount point / drive on Windows; used to determine the virtual device or physical device such as a drive on RT-11 and VMS; used as a pathname separator in classic Mac OS. Doubled after a name on VMS, indicates the DECnet nodename (equivalent to a NetBIOS (Windows networking) hostname preceded by "\\".)

|

vertical bar

designates software pipelining in Windows.

"

quotation mark used to mark beginning and end of filenames containing spaces in Windows.

<

less than

used to redirect input, allowed in Unix filenames.

>

greater than

used to redirect output, allowed in Unix filenames.

.

period

allowed but the last occurrence will be interpreted to be the extension separator in VMS, MS-DOS and Windows. In other OSes, usually considered as part of the filename, and more than one full stop may be allowed.

Note: Some applications on Unix-like systems might allow certain characters but require them to be quoted or escaped; for example, the shell requires spaces, , |, \ and some other characters such as : to be quoted:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

383 of 954

five\ and\ six\/ control characters

OS/2 HPFS

caseinsensitive casepreservation

any

|\?*Data Analyzer Metadata Reporting>Todays Logins

Provides the login count and average login duration for users who logged in today.

7

Todays Report Usage Public Folders>Data Analyzer Metadata by Hour Reporting>Todays Report Usage by Hour

8

Top 10 Most Accessed Public Folders>Data Analyzer Metadata Shows the ten most accessed reports Reports this Year Reporting>Top 10 Most Accessed Reports this for the current year. It has an analytic Year workflow that provides access details such as user name and access time.

9

Top 5 Logins (Month To Date)

Public Folders>Data Analyzer Metadata Reporting>Top 5 Logins (Month To Date)

Provides information about users and their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user.

10

Top 5 Longest Running On-Demand Reports (Month To Date)

Public Folders>Data Analyzer Metadata Reporting>Top 5 Longest Running OnDemand Reports (Month To Date)

Shows the five longest running ondemand reports for the current month to date. It displays the average total response time, average DB response time, and the average Data Analyzer response time (all in seconds) for each report shown.

11

Top 5 Longest Running Scheduled Reports (Month To Date)

Public Folders>Data Analyzer Metadata Shows the five longest running Reporting>Top 5 Longest Running Scheduled scheduled reports for the current Reports (Month To Date) month to date. It displays the average response time (in seconds) for each report shown.

12

Total Schedule Errors Public Folders>Data Analyzer Metadata for Today Reporting>Total Schedule Errors for Today

Provides the number of errors encountered during execution of reports attached to schedules. The analytic workflow "Scheduled Report Error Details for Today" is attached to it.

13

User Logins (Month To Public Folders>Data Analyzer Metadata Date) Reporting>User Logins (Month To Date)

Provides information about users and their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user.

14

Users Who Have Never Logged On

Provides information about the number of reports accessed today for each hour. The analytic workflow attached to it provides more details on the reports accessed and users who accessed them during the selected hour.

Public Folders>Data Analyzer Metadata Provides information about users who Reporting>Users Who Have Never Logged On exist in the repository but have never logged in. This information can be used to make administrative decisions about disabling accounts.

Customizing a Report or Creating New Reports Once you select the report, you can customize it by setting the parameter values and/or creating new attributes or metrics. Data Analyzer includes simples steps to create new reports or modify existing ones. Adding filters or modifying filters offers tremendous reporting INFORMATICA CONFIDENTIAL

BEST PRACTICES

594 of 954

flexibility. Additionally, you can setup report templates and export them as Excel files, which can be refreshed as necessary. For more information on the attributes, metrics, and schemas included with the Metadata Reporter, consult the product documentation.

Wildcards The Metadata Reporter supports two wildcard characters: ●

Percent symbol (%) - represents any number of characters and spaces.

●

Underscore (_) - represents one character or space.

You can use wildcards in any number and combination in the same parameter. Leaving a parameter blank returns all values and is the same as using %. The following examples show how you can use the wildcards to set parameters. Suppose you have the following values available to select: items, items_in_promotions, order_items, promotions

The following list shows the return values for some wildcard combinations you can use: Wildcard Combination

Return Values

%

items, items_in_promotions, order_items, promotions

items, items_in_promotions, order_items, promotions

%items

items, order_items

item_

Items

item%

items, items_in_promotions

___m%

items, items_in_promotions, promotions

%pr_mo%

items_in_promotions, promotions

A printout of the mapping object flow is also useful for clarifying how objects are connected. To produce such a printout, arrange the mapping in Designer so the full mapping appears on the screen, and then use Alt+PrtSc to copy the active window to the clipboard. Use Ctrl+V to paste the copy into a Word document. For a detailed description of how to run these reports, consult the Metadata Reporter Guide included in the PowerCenter documentation.

Security Awareness for Metadata Reporter Metadata Reporter uses Data Analyzer for reporting out of the PowerCenter /Data Analyzer repository. Data Analyzer has a robust security mechanism that is inherited by Metadata Reporter. You can establish groups, roles, and/or privileges for users based on their profiles. Since the information in PowerCenter repository does not change often after it goes to production, the Administrator can create some reports and export them to files that can be distributed to the user community. If the numbers of users for Metadata Reporter are limited, you can implement security using report filters or data restriction feature. For example, if a user in PowerCenter repository has access to certain folders, you can create a filter for those folders and apply it to the user's profile. For more information on the ways in which you can implement security in Data Analyzer, refer to the Data Analyzer documentation.

Metadata Exchange: the Second Generation (MX2) The MX architecture was intended primarily for BI vendors who wanted to create a PowerCenter-based data warehouse and display the warehouse metadata through their own products. The result was a set of relational views that encapsulated the underlying repository tables while exposing the metadata in several categories that were more suitable for external parties. Today, Informatica and several key vendors, including Brio, Business Objects, Cognos, and MicroStrategy are effectively using the MX views to report and query the Informatica metadata. INFORMATICA CONFIDENTIAL

BEST PRACTICES

595 of 954

Informatica currently supports the second generation of Metadata Exchange called MX2. Although the overall motivation for creating the second generation of MX remains consistent with the original intent, the requirements and objectives of MX2 supersede those of MX. The primary requirements and features of MX2 are: Incorporation of object technology in a COM-based API. Although SQL provides a powerful mechanism for accessing and manipulating records of data in a relational paradigm, it is not suitable for procedural programming tasks that can be achieved by C, C++, Java, or Visual Basic. Furthermore, the increasing popularity and use of object-oriented software tools require interfaces that can fully take advantage of the object technology. MX2 is implemented in C++ and offers an advanced object-based API for accessing and manipulating the PowerCenter Repository from various programming languages. Self-contained Software Development Kit (SDK). One of the key advantages of MX views is that they are part of the repository database and thus can be used independent of any of the Informatica software products. The same requirement also holds for MX2, thus leading to the development of a self-contained API Software Development Kit that can be used independently of the client or server products. Extensive metadata content, especially multidimensional models for OLAP. A number of BI tools and upstream data warehouse modeling tools require complex multidimensional metadata, such as hierarchies, levels, and various relationships. This type of metadata was specifically designed and implemented in the repository to accommodate the needs of the Informatica partners by means of the new MX2 interfaces. Ability to write (push) metadata into the repository. Because of the limitations associated with relational views, MX could not be used for writing or updating metadata in the Informatica repository. As a result, such tasks could only be accomplished by directly manipulating the repository's relational tables. The MX2 interfaces provide metadata write capabilities along with the appropriate verification and validation features to ensure the integrity of the metadata in the repository. Complete encapsulation of the underlying repository organization by means of an API. One of the main challenges with MX views and the interfaces that access the repository tables is that they are directly exposed to any schema changes of the underlying repository database. As a result, maintaining the MX views and direct interfaces requires a major effort with every major upgrade of the repository. MX2 alleviates this problem by offering a set of object-based APIs that are abstracted away from the details of the underlying relational tables, thus providing an easier mechanism for managing schema evolution. Integration with third-party tools. MX2 offers the object-based interfaces needed to develop more sophisticated procedural programs that can tightly integrate the repository with the third-party data warehouse modeling and query/reporting tools. Synchronization of metadata based on changes from up-stream and down-stream tools. Given that metadata is likely to reside in various databases and files in a distributed software environment, synchronizing changes and updates ensures the validity and integrity of the metadata. The object-based technology used in MX2 provides the infrastructure needed to implement automatic metadata synchronization and change propagation across different tools that access the PowerCenter Repository. Interoperability with other COM-based programs and repository interfaces. MX2 interfaces comply with Microsoft's Component Object Model (COM) interoperability protocol. Therefore, any existing or future program that is COM-compliant can seamlessly interface with the PowerCenter Repository by means of MX2.

Last updated: 27-May-08 12:03

INFORMATICA CONFIDENTIAL

BEST PRACTICES

596 of 954

Repository Tables & Metadata Management Challenge Maintaining the repository for regular backup, quick response, and querying metadata for metadata reports.

Description Regular actions such as backups, testing backup and restore procedures, and deleting unwanted information from the repository maintains the repository for better performance.

Managing Repository The PowerCenter Administrator plays a vital role in managing and maintaining the repository and metadata. The role involves tasks such as securing the repository, managing the users and roles, maintaining backups, and managing the repository through such activities as removing unwanted metadata, analyzing tables, and updating statistics.

Repository backup Repository back up can be performed using the client tool Repository Server Admin Console or the command line program pmrep. Backup using pmrep can be automated and scheduled for regular backups.

This shell script can be scheduled to run as cron job for regular backups. Alternatively, this shell script can be called from PowerCenter via a command task. The command task can be placed in a workflow and scheduled to run daily.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

597 of 954

The following paragraphs describe some useful practices for maintaining backups: Frequency: Backup frequency depends on the activity in repository. For Production repositories, backup is recommended once a month or prior to major release. For development repositories, backup is recommended once a week or once a day, depending upon the team size. Backup file sizes: Because backup files can be very large, Informatica recommends compressing them using a utility such as winzip or gzip. Storage: For security reasons, Informatica recommends maintaining backups on a different physical device that the repository itself. Move backups offline: Review the backups on a regular basis to determine how long they need to remain online. Any that are not required online should be moved offline, to tape, as soon as possible.

Restore repository Although the Repository restore function is used primarily as part of disaster recovery, it can also be useful for testing the validity of the backup files and for testing the recovery process on a regular basis. Informatica recommends testing the backup files and recovery process at least once each quarter. The repository can be restored using the client tool, Repository Server Administrator Console, or the command line programs pmrepagent.

Restore folders There is no easy way to restore only one particular folder from backup. First the backup repository has to be restored into a new repository, then you can use the client tool, repository manager, to copy the entire folder from the restored repository into the target repository.

Remove older versions Use the purge command to remove older versions of objects from repository. To purge a specific version of an object, view the history of the object, select the version, and purge it.

Finding deleted objects and removing them from repository If a PowerCenter repository is enabled for versioning through the use of the Team Based Development option. Objects that have been deleted from the repository are not be visible in the client tools. To list or view deleted objects, use either the find checkouts command in the client tools or a query generated in the repository INFORMATICA CONFIDENTIAL

BEST PRACTICES

598 of 954

manager, or a specific query.

After an object has been deleted from the repository, you cannot create another object with the same name unless the deleted object has been completely removed from the repository. Use the purge command to completely remove deleted objects from the repository. Keep in mind, however, that you must remove all versions of a deleted object to completely remove it from repository.

Truncating Logs You can truncate the log information (for sessions and workflows) stored in the repository either by using repository manager or the pmrep command line program. Logs can be truncated for the entire repository or for a particular folder. Options allow truncating all log entries or selected entries based on date and time.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

599 of 954

Repository Performance Analyzing (or updating the statistics) of repository tables can help to improve the repository performance. Because this process should be carried out for all tables in the repository, a script offers the most efficient means. You can then schedule the script to run using either an external scheduler or a PowerCenter workflow with a command task to call the script.

Repository Agent and Repository Server performance Factors such as team size, network, number of objects involved in a specific operation, number of old locks (on repository objects), etc. may reduce the efficiency of the repository server (or agent). In such cases, the various causes should be analyzed and the repository server (or agent) configuration file modified to improve performance.

Managing Metadata The following paragraphs list the queries that are most often used to report on PowerCenter metadata. The queries are written for PowerCenter repositories on Oracle and are based on PowerCenter 6 and PowerCenter 7. Minor changes in the queries may be required for PowerCenter repositories residing on other databases.

Failed Sessions The following query lists the failed sessions in the last day. To make it work for the last ‘n’ days, replace SYSDATE-1 with SYSDATE - n SELECT Subject_Area AS Folder, Session_Name, Last_Error AS Error_Message, DECODE (Run_Status_Code,3,'Failed',4,'Stopped',5,'Aborted') AS Status, Actual_Start AS Start_Time,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

600 of 954

Session_TimeStamp FROM rep_sess_log WHERE run_status_code != 1 AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

Long running Sessions The following query lists long running sessions in the last day. To make it work for the last ‘n’ days, replace SYSDATE-1 with SYSDATE - n SELECT Subject_Area AS Folder, Session_Name, Successful_Source_Rows AS Source_Rows, Successful_Rows AS Target_Rows, Actual_Start AS Start_Time, Session_TimeStamp FROM rep_sess_log WHERE run_status_code = 1 AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE) AND (Session_TimeStamp - Actual_Start) > (10/(24*60)) ORDER BY Session_timeStamp

Invalid Tasks The following query lists folder names and task name, version number, and last saved for all invalid tasks. SELECT SUBJECT_AREA AS FOLDER_NAME, DECODE(IS_REUSABLE,1,'Reusable',' ') || ' ' ||TASK_TYPE_NAME AS TASK_TYPE, TASK_NAME AS OBJECT_NAME, VERSION_NUMBER, -- comment out for V6 LAST_SAVED

INFORMATICA CONFIDENTIAL

BEST PRACTICES

601 of 954

FROM REP_ALL_TASKS WHERE IS_VALID=0 AND

IS_ENABLED=1

--AND CHECKOUT_USER_ID = 0 -- Comment out for V6 --AND is_visible=1 -- Comment out for V6 ORDER BY SUBJECT_AREA,TASK_NAME

Load Counts The following query lists the load counts (number of rows loaded) for the successful sessions. SELECT subject_area, workflow_name, session_name, DECODE (Run_Status_Code,1,'Succeeded',3,'Failed',4,'Stopped',5,'Aborted') AS Session_Status, successful_rows, failed_rows, actual_start FROM REP_SESS_LOG WHERE TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE) ORDER BY subject_area workflow_name, session_name, Session_status INFORMATICA CONFIDENTIAL

BEST PRACTICES

602 of 954

Using Metadata Extensions Challenge To provide for efficient documentation and achieve extended metadata reporting through the use of metadata extensions in repository objects.

Description Metadata Extensions, as the name implies, help you to extend the metadata stored in the repository by associating information with individual objects in the repository. Informatica Client applications can contain two types of metadata extensions: vendordefined and user-defined. ●

●

Vendor-defined. Third-party application vendors create vendor-defined metadata extensions. You can view and change the values of vendor-defined metadata extensions, but you cannot create, delete, or redefine them. User-defined. You create user-defined metadata extensions using PowerCenter clients. You can create, edit, delete, and view user-defined metadata extensions. You can also change the values of user-defined extensions.

You can create reusable or non-reusable metadata extensions. You associate reusable metadata extensions with all repository objects of a certain type. So, when you create a reusable extension for a mapping, it is available for all mappings. Vendor-defined metadata extensions are always reusable. Non-reusable extensions are associated with a single repository object. Therefore, if you edit a target and create a non-reusable extension for it, that extension is available only for the target you edit. It is not available for other targets. You can promote a nonreusable metadata extension to reusable, but you cannot change a reusable metadata extension to non-reusable. Metadata extensions can be created for the following repository objects: ●

Source definitions

●

Target definitions

INFORMATICA CONFIDENTIAL

BEST PRACTICES

603 of 954

●

Transformations (Expressions, Filters, etc.)

●

Mappings

●

Mapplets

●

Sessions

●

Tasks

●

Workflows

●

Worklets

Metadata Extensions offer a very easy and efficient method of documenting important information associated with repository objects. For example, when you create a mapping, you can store the mapping owners name and contact information with the mapping OR when you create a source definition, you can enter the name of the person who created/imported the source. The power of metadata extensions is most evident in the reusable type. When you create a reusable metadata extension for any type of repository object, that metadata extension becomes part of the properties of that type of object. For example, suppose you create a reusable metadata extension for source definitions called SourceCreator. When you create or edit any source definition in the Designer, the SourceCreator extension appears on the Metadata Extensions tab. Anyone who creates or edits a source can enter the name of the person that created the source into this field. You can create, edit, and delete non-reusable metadata extensions for sources, targets, transformations, mappings, and mapplets in the Designer. You can create, edit, and delete non-reusable metadata extensions for sessions, workflows, and worklets in the Workflow Manager. You can also promote non-reusable metadata extensions to reusable extensions using the Designer or the Workflow Manager. You can also create reusable metadata extensions in the Workflow Manager or Designer. You can create, edit, and delete reusable metadata extensions for all types of repository objects using the Repository Manager. If you want to create, edit, or delete metadata extensions for multiple objects at one time, use the Repository Manager. When you edit a reusable metadata extension, you can modify the properties Default Value, Permissions and Description. Note: You cannot create non-reusable metadata extensions in the Repository Manager. All metadata extensions created in the Repository Manager are reusable. Reusable metadata extensions are repository wide. You can also migrate Metadata Extensions from one environment to another. When

INFORMATICA CONFIDENTIAL

BEST PRACTICES

604 of 954

you do a copy folder operation, the Copy Folder Wizard copies the metadata extension values associated with those objects to the target repository. A non-reusable metadata extension will be copied as a non-reusable metadata extension in the target repository. A reusable metadata extension is copied as reusable in the target repository, and the object retains the individual values. You can edit and delete those extensions, as well as modify the values. Metadata Extensions provide for extended metadata reporting capabilities. Using Informatica MX2 API, you can create useful reports on metadata extensions. For example, you can create and view a report on all the mappings owned by a specific team member. You can use various programming environments such as Visual Basic, Visual C++, C++ and Java SDK to write API modules. The Informatica Metadata Exchange SDK 6.0 installation CD includes sample Visual Basic and Visual C++ applications. Additionally, Metadata Extensions can also be populated via data modeling tools such as ERWin, Oracle Designer, and PowerDesigner via Informatica Metadata Exchange for Data Models. With the Informatica Metadata Exchange for Data Models, the Informatica Repository interface can retrieve and update the extended properties of source and target definitions in PowerCenter repositories. Extended Properties are the descriptive, user defined, and other properties derived from your Data Modeling tool and you can map any of these properties to the metadata extensions that are already defined in the source or target object in the Informatica repository.

Last updated: 27-May-08 12:04

INFORMATICA CONFIDENTIAL

BEST PRACTICES

605 of 954

Using PowerCenter Metadata Manager and Metadata Exchange Views for Quality Assurance Challenge The role that the PowerCenter repository can play in an automated QA strategy is often overlooked and under-appreciated. This repository is essentially a database about the transformation process and the software developed to implement it; the challenge is to devise a method to exploit this resource for QA purposes. To address the above challenge, Informatica PowerCenter provides several pre-packaged reports (PowerCenter Repository Reports) that can be installed on Data Analyzer or Metadata Manager Installation. These reports provide lots of useful information about PowerCenter object metadata and operational metadata that can be used for quality assurance.

Description Before considering the mechanics of an automated QA strategy, it is worth emphasizing that quality should be built in from the outset. If the project involves multiple mappings repeating the same basic transformation pattern(s), it is probably worth constructing a virtual production line. This is essentially a template-driven approach to accelerate development and enforce consistency through the use of the following aids: ● ●

●

Shared template for each type of mapping. Checklists to guide the developer through the process of adapting the template to the mapping requirements. Macros/scripts to generate productivity aids such as SQL overrides etc.

It is easier to ensure quality from a standardized base rather than relying on developers to repeat accurately the same basic keystrokes. Underpinning the exploitation of the repository for QA purposes is the adoption of naming standards which categorize components. By running the appropriate query on the repository, it is possible to identify those components whose attributes differ from those predicted for the category. Thus, it is quite possible to automate some aspects of QA. Clearly, the function of naming conventions is not just to standardize, but also to provide logical access paths into the information in the repository; names can be used to identify patterns and/or categories and thus allow assumptions to be made about object attributes. Along with the facilities provided to query the repository, such as the Metadata Exchange (MX) Views and the PowerCenter Metadata Manager, this opens the door to an automated QA strategy For example, consider the following situation: it is possible that the EXTRACT mapping/session should always truncate the target table before loading; conversely, the TRANSFORM and LOAD phases should never truncate a target. Possible code errors in this respect can be identified as follows: ●

Define a mapping/session naming standard to indicate EXTRACT, TRANSFORM, or LOAD.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

606 of 954

●

●

●

Develop a query on the repository to search for sessions named EXTRACT, which do not have the truncate target option set. Develop a query on the repository to search for sessions named TRANSFORM or LOAD, which do have the truncate target option set. Provide a facility to allow developers to run both queries before releasing code to the test environment.

Alternatively, a standard may have been defined to prohibit unconnected output ports from transformations (such as expressions) in a mapping. These can be very easily identified from the MX View REP_MAPPING_UNCONN_PORTS: The following bullets represent a high-level overview of the steps involved in automating QA: ●

● ●

Review the transformations/mappings/sessions/workflows and allocate to broadly representative categories. Identify the key attributes of each category. Define naming standards to identify the category for transformations/mappings/sessions/ workflows.

●

Analyze the MX Views to source the key attributes.

●

Develop the query to compare actual and expected attributes for each category.

After you have completed these steps, it is possible to develop a utility that compares actual and expected attributes for developers to run before releasing code into any test environment. Such a utility may incorporate the following processing stages: ●

Execute a profile to assign environment variables (e.g., repository schema user, password, etc).

●

Select the folder to be reviewed.

●

Execute the query to find exceptions.

●

Report the exceptions in an accessible format.

●

Exit with failure if exceptions are found.

TIP Remember that any queries on the repository that bypass the MX views will require modification if subsequent upgrades to PowerCenter occur and as such is not recommended by Informatica.

The principal objective of any QA strategy is to ensure that developed components adhere to standards and to identify defects before incurring overhead during the migration from development to test/production environments. Qualitative, peer-based reviews of PowerCenter objects due for release obviously have their part to play in this process.

Using Metadata Manager and PowerCenter Repository Reports for Quality Assurance The need for the Informatica Metadata Reporter was identified from the a number of clients requesting custom and complete metadata reports from their repositories. Metadata Reporter provides Data Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations. In this INFORMATICA CONFIDENTIAL

BEST PRACTICES

607 of 954

section, we focus primarily on how these reports and custom reports can help ease the QA process. The following reports can help identify regressions in load performance: ●

Session Run details

●

Workflow Run details

●

Worklet Run details

●

●

●

Server Load by Day of the Week can help determine the load on the server before and after QA migrations and may help balance the loads through the week by modifying the schedules. The Target Table Load Analysis can help identify any data regressions with the number of records loaded in each target (if a baseline was established before the migration/upgrade). The Failed Session report lists failed sessions at a glance, which is very helpful after a major QA migration or QA of Informatica upgrade process

During huge deployments to QA, the Code review team can look at the following reports to determine if the standards (i.e., Naming standards, Comments for repository objects, metadata extensions usage, etc.) were followed. Accessing this information from PowerCenter Repository Reports typically reduces the time required for review because the reviewer doesn’t need to open each mapping and check for these details. All of the following are out-of-the-box reports provided by Informatica: ●

Label report

●

Mappings list

●

Mapping shortcuts

●

Mapping lookup transformation

●

Mapplet list

●

Mapplet shortcuts

●

Mapplet lookup transformation

●

Metadata extensions usage

●

Sessions list>

●

Worklets list

●

Workflows list

●

Source list

●

Target list

●

Custom reports based on the review requirements

In addition, note that the following reports are also useful during migration and upgrade processes: ●

●

●

Invalid object reports and deployment group report in the QA repository help to determine which deployments caused the invalidations. Invalid object report against Development repository helps to identify the invalid objects that are part of deployment before QA migration. Invalid object report helps in QA of an Informatica upgrade process.

The following table summarizes some of the reports that Informatica ships with a PowerCenter Repository INFORMATICA CONFIDENTIAL

BEST PRACTICES

608 of 954

Reports installation:

Report Name

Description

1

Deployment Group Displays deployment groups by repository

2

Deployment Group History

Displays, by group, deployment groups and the dates they were deployed. It also displays the source and target repository names of the deployment group for all deployment dates.

3

Labels

Displays labels created in the repository for any versioned object by repository.

4

All Object Version Displays all versions of an object by the date the object is saved in History the repository.

5

Server Load by Day of Week

Displays the total number of sessions that ran, and the total session run duration for any day of week in any given month of the year by server by repository. For example, all Mondays in September are represented in one row if that month had 4 Mondays

6

Session Run Details

Displays session run details for any start date by repository by folder.

7

Target Table Load Displays the load statistics for each table for last month by Analysis (Last repository by folder Month)

8

Workflow Run Details

Displays the run statistics of all workflows by repository by folder.

9

Worklet Run Details

Displays the run statistics of all worklets by repository by folder.

10

Mapping List

Displays mappings by repository and folder. It also displays properties of the mapping such as the number of sources used in a mapping, the number of transformations, and the number of targets.

11

Mapping Lookup Transformations

Displays Lookup transformations used in a mapping by repository and folder.

12

Mapping Shortcuts Displays mappings defined as a shortcut by repository and folder.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

609 of 954

13

Source to Target Dependency

Displays the data flow from the source to the target by repository and folder. The report lists all the source and target ports, the mappings in which the ports are connected, and the transformation expression that shows how data for the target port is derived.

14

Mapplet List

Displays mapplets available by repository and folder. It displays properties of the mapplet such as the number of sources used in a mapplet, the number of transformations, or the number of targets.

15

Mapplet Lookup Transformations

Displays all Lookup transformations used in a mapplet by folder and repository.

16

Mapplet Shortcuts Displays mapplets defined as a shortcut by repository and folder.

17

Unused Mapplets Displays mapplets defined in a folder but not used in any mapping in Mappings in that folder.

18

Metadata Displays, by repository by folder, reusable metadata extensions Extensions Usage used by any object. Also displays the counts of all objects using that metadata extension.

19

Server Grid List

Displays all server grids and servers associated with each grid. Information includes host name, port number, and internet protocol address of the servers.

20

Session List

Displays all sessions and their properties by repository by folder. This is a primary report in a data integration workflow.

21

Source List

Displays relational and non-relational sources by repository and folder. It also shows the source properties. This report is a primary report in a data integration workflow.

22

Source Shortcuts Displays sources that are defined as shortcuts by repository and folder

23

Target List

Displays relational and non-relational targets available by repository and folder. It also displays the target properties. This is a primary report in a data integration workflow.

24

Target Shortcuts

Displays targets that are defined as shortcuts by repository and folder.

25

Transformation List

Displays transformations defined by repository and folder. This is a primary report in a data integration workflow.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

610 of 954

26

Transformation Shortcuts

Displays transformations that are defined as shortcuts by repository and folder.

27

Scheduler (Reusable) List

Displays all the reusable schedulers defined in the repository and their description and properties by repository by folder.

28

Workflow List

Displays workflows and workflow properties by repository by folder.

29

Worklet List

Displays worklets and worklet properties by repository by folder.

Last updated: 05-Jun-08 13:27

INFORMATICA CONFIDENTIAL

BEST PRACTICES

611 of 954

Configuring Standard Metadata Resources Challenge Metadata that is derived from a variety of sources and tools is often disparate and fragmented. To be of value, metadata needs to be consolidated into a central repository. Informatica's Metadata Manager provides a central repository for the capture and analysis of critical metadata. Before you can browse and search metadata in the Metadata Manager warehouse, you must configure Metadata Manager, create resources, and then load the resource metadata.

Description Informatica Metadata Manager is a web-based metadata management tool that you can use to browse and analyze metadata from disparate metadata repositories. Metadata Manager helps you understand and manage how information and processes are derived. It also helps you understand the fundamental relationships between information and processes, and how they are used. Metadata Manager extracts metadata from application, business intelligence, data integration, data modeling, and relational metadata sources. Metadata Manager uses PowerCenter workflows to extract metadata from metadata sources and load it into a centralized metadata warehouse called the Metadata Manager warehouse. Metadata Manager uses resources to represent the metadata in the Metadata Manager. Each resource represents metadata from a metadata source. Metadata Manager shows the metadata for each resource in the metadata catalog. The metadata catalog is a hierarchical representation of the metadata in the Metadata Manager warehouse. There are several steps to configure a standard resource in Metadata Manager. It is very important to identify, setup, and test your resource connections before configuring a resource into Metadata Manager. Informatica recommends creating naming standards usually prefixed by the metadata source type for the Metadata Manager Application (i.e. for a SQL Server relational database use ‘SS_databasename_schemaname’). The steps below describe how to load metadata from a metadata source into the Metadata Manager warehouse. Each detailed section shows information needed for individual standard resource types.

Loading Metadata Resource into Metadata Manager Warehouse The Load page in the Metadata Manager Application is used to create and load resources into the Metadata Manager warehouse. Use the Load page to monitor and schedule resource loads, purge metadata from the Metadata Manager warehouse, and manage the search index. Complete the following steps to load metadata from a metadata source into the Metadata Manager warehouse: 1. Set up Metadata Manager and metadata sources. Create a Metadata Manager Service, install the Metadata Manager, and configure the metadata sources from which you want to extract metadata. 2. Create resources. Create resources that represent the metadata sources from which you want to extract metadata. 3. Configure resources. Configure the resources, including metadata source files and direct source connections, parameters, and connection assignments. You can also purge metadata for a previously loaded resource and update the index for resources. 4. Load and monitor resources. Load a resource to load the metadata for the resource into the Metadata Manager warehouse. When you load a resource, Metadata Manager extracts and loads the metadata for the resource. You can monitor the status of all resources and the status of individual resources. You can also schedule resource loads. 5. Manage resource and object permissions for Metadata Manager users. You can configure the resources and metadata objects in the warehouse for which Metadata Manager users have access. Use Metadata Manager command line programs to load resources, monitor the status of resource loads and PowerCenter workflows, and back up and restore the Metadata Manager repository. INFORMATICA CONFIDENTIAL

BEST PRACTICES

612 of 954

Configure Metadata Resources Before you configure resources and load metadata into the Metadata Manager warehouse, you must configure the metadata sources. For metadata sources that use a source file, you select the source file when you configure the resource. If you do not correctly configure the metadata sources, the metadata load can fail or the metadata can be incorrectly loaded in the Metadata Manager warehouse. Table 2-1 describes the configuration tasks for the metadata sources: Table 2-1. Metadata Source Configuration Tasks Metadata Source Type

Metadata Source

Tasks

Application

SAP

Install SAP transports and configure permissions. For more information, see SAP.

Business Intelligence

Business Objects

Export documents, universes, and Crystal Reports to a repository. For more information, see Business Objects.

Cognos ReportNet Content Manager

Verify that you have access ReportNet URL. Metadata Manager uses the ReportNet URL to access the source repository metadata.

Cognos Impromptu

Use the Cognos client tool to export the metadata to a .cat file.

Hyperion Essbase

Export metadata to an XML file. For more information, see Hyperion Essbase.

IBM DB2 CubeViews

Export metadata to an XML file. For more information, see IBM DB2 Cube Views.

Microstrategy

Configure the database user account and projects. For more information, see Microstrategy.

PowerCenter

Metadata Manager extracts the latest version of objects that are checked into the PowerCenter repository. Check in all metadata objects that you want to extract from the PowerCenter repository. For more information, see PowerCenter.

Data Integration

INFORMATICA CONFIDENTIAL

BEST PRACTICES

613 of 954

Database Management IBM DB2 UDB

Configure the permissions for the database user account. For more information, see Relational Database Sources.

Informix Microsoft SQL Server Oracle Sybase Teradata Data Modeling*

Custom

Embarcadero ERStudio

Use the ERStudio client tool to export the metadata to a . dm1 file.

ERwin

Export metadata. For more information, see Erwin.

Oracle Designer

Use the Oracle Designer client tool to export the metadata to a .dat file.

Rational Rose ER

Use the Rational Rose client tool to export the metadata to an .mdl file.

Sybase PowerDesigner

Use the Sybase PowerDesigner client tool to save the model to a .pdm file in XML format.

Visio

Use the Visio client tool to export the metadata to an .erx file.

Custom

Export metadata to a .csv or .txt file. For more information, see Custom Metadata Sources.

* You can load multiple models from the same data modeling tool source. For more information, se e Data Modeling Tool Sources

Standard Resource Types Business Objects The Business Objects Resource requires you to install Business Object Designer on the machine hosting the Metadata Manager console and to provide user name and password to access Business Objects repository. Export the Business Objects universes, documents, and Crystal Reports to the Business Objects source repository. You can extract documents, universes, and Crystal Reports that have been exported to the source repository. You cannot extract metadata from documents or universes. Export from source repositories to make sure that the metadata in the Metadata Manager warehouse is consistent with the metadata that is distributed to Business Objects users. Use Business Objects Designer to export a universe to the Business Objects source repository. For example, to begin the export process in Business Objects Designer, click File > Export. You must secure a connection type to export a universe to a Business Objects source repository. INFORMATICA CONFIDENTIAL

BEST PRACTICES

614 of 954

Use Business Objects to export a document to the Business Objects repository. For example, to begin the export process in Business Objects, click File > Publish To > Corporate Documents. Use the Business Objects Central Manager Console to export Crystal Reports to the Business Objects repository. The screenshot below displays the information you will need to add the resource.

Custom Metadata Sources If you create a custom resource and use a metadata source file, you must export the metadata to a metadata file with a .csv or .txt file extension. When you configure the custom resource, you specify the metadata file.

Data Modeling Tool Sources You can load multiple models from a data modeling tool into the Metadata Manager warehouse. After you load the metadata, the Metadata Manager catalog shows models from the same modeling tool under the resource name. This requirement applies to all data modeling tool resource types.

Erwin / ER-Studio Metadata Manager extracts ERwin metadata from a metadata file. When you configure the connection to the ERwin source repository in Metadata Manager, you specify the metadata file. The required format for the metadata file depends on the version of the ERwin source repository. The following table specifies the required file type format for each supported version: Version

INFORMATICA CONFIDENTIAL

File Type

BEST PRACTICES

615 of 954

ERwin 3.0 to 3.5.2

.erx

ERwin 4.0 SP1 to 4.1

.er1 or .xml

Erwin 7.x

.erwin or .xml

The screenshot below displays the information you will need to add the resource.

Hyperion Essbase Use the Hyperion Essbase client tool to export the metadata to an .xml file. Metadata Manager extracts Hyperion Essbase metadata from a metadata file with an .xml file extension. When you set up the resource for Hyperion Essbase in the Metadata Manager, you specify the metadata file. Use the Hyperion Essbase Integration Server to export the source metadata to an XML file. Export one model to each metadata file. To export the Hyperion model to an XML file: 1. 2. 3. 4. 5. 6. 7.

Log in to Hyperion Essbase Integration Server. Create the Hyperion source or open an existing model. Click File > Save to save the model if you created or updated it. Click File > XML Import/Export. On the Export tab, select the model. Click Save As XML File. A pop-up window appears. Select the location where you want to store the XML file.

The screenshot below displays the information you will need to add the resource. INFORMATICA CONFIDENTIAL

BEST PRACTICES

616 of 954

IBM DB2 Cube Views Use the IBM DB2 Cube Views OLAP Center GUI to export cube models to .xml files. When configure the resource for DB2 Cube Views in Metadata Manager, you specify the metadata files. TIP You can load multiple cube models into the Metadata Manager warehouse. Export each cube model into a separate .xml file and name the file with the same name as the cube model. If you export multiple cube models into an .xml file, export the same cube models into the same .xml file each time you export them. The screenshot below displays the information you will need to add the resource.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

617 of 954

Microstrategy To configure Microstrategy, complete the following tasks: ●

Configure permissions.

●

Configure multiple projects (optional).

The screenshot below displays the information you will need to add the resource.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

618 of 954

Configure Permissions The Microstrategy project user account for which you provide the user name and password must have the Bypass All Object Security Access Checks administration privilege. You set this privilege in the Microstrategy Desktop client tool. Note: Although Microstrategy allows you to connect to a project source using database or network authentication, Metadata Manager uses project source authentication. Configure Multiple Projects in the Same Metadata File Microstrategy projects can be from different project sources. You can load multiple Microstrategy projects under the same Microstrategy resource. You must provide the user name and password for each project source. Project names must be unique. When you configure the Microstrategy resource, you specify the project source, project, user name, and password for each project.

PowerCenter The screenshot below displays the information you will need to add the resource.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

619 of 954

Relational Database Sources Configure the permissions for the IBM DB2 UDB, Informix, Microsoft SQL Server, Oracle, Sybase ASE, and Teradata database user account. The database user account you use to connect to the metadata source must have SELECT permissions on the following objects in the specified schemas: Tables, Views, Indexes, Packages, Procedures, Functions, Sequences, Triggers, Synonyms. Note: For Oracle resources, the user account must also have the SELECT CATALOG ROLE permission.

DB2 Resource jdbc:informatica:db2://host_name:port;DatabaseName=database_name

Informix Resource jdbc:informatica:informix://host_name:port;InformixServer=server_name;DatabaseName=database_name SQL Server Resourcejdbc:informatica:sqlserver://host_name:port;SelectMethod=cursor; DatabaseName=database_name Connection String: For default instance: SQL Server Name@Database Name For named instance: Server Name\Instance Name@Database Name

Oracle Resource INFORMATICA CONFIDENTIAL

BEST PRACTICES

620 of 954

jdbc:informatica:oracle://host_name:port;SID=sid Connect String: Oracle instance name If the metadata in the Oracle source database contains unicode characters, set the NLS_LENGTH_SEMANTICS parameter to CHAR from BYTE. Specify a user name and password to access the Oracle database metadata. Be sure that the user has the Select Any Table privilege and Select Permissions on the following objects in the specified schemas: tables, views, indexes, packages, procedures, functions, sequences, triggers, and synonyms. Also ensure the user has Select Permissions on the SYS.v_$instance. One Resource is needed for each Oracle instance.

Teradata Resource: jdbc:teradata://database_server_name/Database=database_name Connect String: Be sure that the user has access to all the system “DBC” tables.

SAP To configure SAP, complete the following tasks: ●

Install PowerCenter transports

●

Configure user authorization profile

Installing Transports To extract metadata from SAP, you must install PowerCenter transports. The transports are located in the following folder in the location where you downloaded PowerCenter: \saptrans\mySAP Table 2-2 describes the transports you must install: Table 2-2. SAP Transports for Metadata Manager and SAP SAP Version

Data and Cofile Names

4.0B to 4.6B, XCONNECT_DESIGN_R900116.R46 4.6C, and nonUnicode XCONNECT_DESIGN_K900116.R46 Versions 4.7 and Higher

INFORMATICA CONFIDENTIAL

BEST PRACTICES

Transport Request

Functionality

R46K900084

For mySAP R/3 ECC (R/3) and mySAP add-on components, including CRM, BW, and APO: Supports Table Metadata Extraction for SAP in Metadata Manager.

621 of 954

Unicode XCONNECT_DESIGN_R900109.U47 Versions 4.7 and Higher XCONNECT_DESIGN_K900109.U47

U47K900109

For mySAP R/3 ECC (R/3) and mySAP add-on components, including CRM, BW, and APO: Supports Table Metadata Extraction for SAP in Metadata Manager.

You must install the other mySAP transports before you install the transports for Metadata Manager. Configure User Authorization Profile The SAP administrator needs to create the product and development user authorization profile. Table 2-3 describes the user authorization profile: Table 2-3. SAP User Authorization Profile Authorization Object

Description

Class

S_RFC

Authorization check for RFC access.

Cross Application Authorization Activity: 16 (Execute) objects

Field Values

Name of RFC to be protected: *. Type of RFC object to be protected: FUGR

Last updated: 02-Jun-08 22:53

INFORMATICA CONFIDENTIAL

BEST PRACTICES

622 of 954

Custom XConnect Implementation Challenge Metadata Manager uses XConnects to extract source repository metadata and load it into the Metadata Manager Warehouse. The Metadata Manager Configuration Console is used to run each XConnect. A Custom XConnect is needed to load metadata from a source repository for which Metadata Manager does not prepackage an out-of-the box XConnect.

Description This document organizes all steps into phases, where each phase and step must be performed in the order presented. To integrate custom metadata, complete tasks for the following phases: ●

Design the Metamodel.

●

Implement the Metamodel Design.

●

Set-up and run the custom XConnect.

●

Configure the reports and schema.

Prerequisites for Integrating Custom Metadata To integrate custom metadata, install Metadata Manager and the other required applications. The custom metadata integration process assumes knowledge of the following topics: ●

●

●

Common Warehouse Metamodel (CWM) and Informatica-Defined Metamodels. The CWM metamodel includes industry-standard packages, classes, and class associations. The Informatica metamodel components supplement the CWM metamodel by providing repository-specific packages, classes, and class associations. For more information about the Informaticadefined metamodel components, run and review the metamodel reports. PowerCenter Functionality. During the metadata integration process, XConnects are configured and run. The XConnects run PowerCenter workflows that extract custom metadata and load it into the Metadata Manager Warehouse. Data Analyzer Functionality. Metadata Manager embeds Data Analyzer

INFORMATICA CONFIDENTIAL

BEST PRACTICES

623 of 954

functionality to create, run, and maintain a metadata reporting environment. Knowledge of creating, modifying, and deleting reports, dashboards, and analytic workflows in Data Analyzer is required. Knowledge of creating, modifying, and deleting table definitions, metrics, and attributes is required to update the schema with new or changed objects.

Design the Metamodel During this planning phase, the metamodel is designed; the metamodel will be implemented in the next phase. A metamodel is the logical structure that classifies the metadata from a particular repository type. Metamodels consist of classes, class, associations, and packages, which group related classes and class associations. An XConnect loads metadata into the Metadata Manager Warehouse based on classes and class associations. This task consists of the following steps: 1. Identify Custom Classes. To identify custom classes, determine the various types of metadata in the source repository that need to be loaded into the Metadata Manager Warehouse. Each type of metadata corresponds to one class. 2. Identify Custom Class Properties. After identifying the custom classes, each custom class must be populated with properties (i.e., attributes) in order for Metadata Manager to track and report values belonging to classes instances. 3. Map Custom Classes to CWM Classes. Metadata Manager prepackages all CWM classes, class properties, and class associations. To quickly develop a custom metamodel and reduce redundancy, reuse the predefined class properties and associations instead of recreating them. To determine which custom classes can inherit properties from CWM classes, map custom classes to the packaged CWM classes. For all properties that cannot be inherited, define them in Metadata Manager. 4. Determine the Metadata Tree Structure. Configure the way the metadata tree displays objects. Determine the groups of metadata objects in the metadata tree, then determine the hierarchy of the objects in the tree. Assign the TreeElement class as a base class to each custom class. 5. Identify Custom Class Associations. The metadata browser uses class associations to display metadata. For each identified class association, determine if you can reuse a predefined association from a CWM base class or if you need to manually define an association in Metadata Manager. 6. Identify Custom Packages. A package contains related classes and class associations. Multiple packages can be assigned to a repository type to define the structure of the metadata contained in the source repositories of the given

INFORMATICA CONFIDENTIAL

BEST PRACTICES

624 of 954

repository type. Create packages to group related classes and class associations. To see an example of sample metamodel design specifications, see Appendix A in the Metadata Manager Custom Metadata Integration Guide.

Implement the Metamodel Design Using the metamodel design specifications from the previous task, implement the metamodel in Metadata Manager. This task includes the following steps: 1. Create the originator (aka owner) of the metamodel. When creating a new metamodel, specify the originator of each metamodel. An originator is the organization that creates and owns the metamodel. When defining a new custom originator in Metadata Manager, select ‘Customer’ as the originator type. ●

Go to the Administration tab.

●

Click Originators under Metamodel Management.

●

Click Add to add a new originator.

●

●

Fill out the requested information (Note: Domain Name, Name, and Type are mandatory fields). Click OK when you are finished.

2. Create the packages that contain the classes and associations of the subject metamodel. Define the packages to which custom classes and associations are assigned. Packages contain classes and their class associations. Packages have a hierarchical structure, where one package can be the parent of another package. Parent packages are generally used to group child packages together. ●

Go to the Administration tab.

●

Click Packages under Metamodel Management.

●

Click Add to add a new package.

●

●

Fill out the requested information (Note: Name and Originator are mandatory fields). Choose the originator created above. Click OK when you are finished.

3. Create Custom Classes. In this step, create custom classes identified in the metamodel design task.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

625 of 954

●

Go to the Administration tab.

●

Click Classes under Metamodel Management.

●

● ●

●

From the drop-down menu, select the package that you created in the step above Click Add to create a new class. Fill out the requested information (Note: Name, Package, and Class Label are mandatory fields). Base Classes: In order to see the metadata in the Metadata Manager metadata browser, you need to at least add the base class, TreeElement. To do this: a. Click Add under Base Classes. b. Select the package. c. Under Classes, select TreeElement. d. Click OK (You should now see the class properties in the properties section).

●

●

To add custom properties to your class, click Add. Fill out the property information (Name, Data Type, and Display Label are mandatory fields). Click OK when you are done. Click OK at the top of the page to create the class.

Repeat the above steps for additional classes. 4. Create Custom Class Associations. In this step, implement the custom class associations identified in the metamodel design phase. In the previous step, CWM classes are added as base classes. Any of the class associations from the CWM base classes can be reused. Define those custom class associations that cannot be reused. If you only need the ElementOwnership association, skip this step ●

Go to the Administration tab.

●

Click Associations under Metamodel Management.

●

Click Add to add a new association.

●

Fill out the requested information (all bold fields are required).

●

Click OK when you are finished.

5. Create the Repository Type. Each type of repository contains unique metadata. For example, a PowerCenter data integration repository type INFORMATICA CONFIDENTIAL

BEST PRACTICES

626 of 954

contains workflows and mappings, but a Data Analyzer business intelligence repository type does not. Repository types maintain the uniqueness of each repository. ●

Go to the Administration tab.

●

Click Repository Types under Metamodel Management.

●

Click Add to add a new repository type.

●

●

Fill out the requested information (Note: Name and Product Type are mandatory fields). Click OK when you are finished.

6. Configure a Repository Type Root Class. Root classes display under the source repository in the metadata tree. All other objects appear under the root class. To configure a repository root class: ● ●

Go to the Administration tab. Click Custom Repository Type Root Classes under Metamodel Management.

●

Select the custom repository type.

●

Optionally, select a package to limit the number of classes that display.

●

Select the Root Class option for all applicable classes.

●

Click Apply to apply the changes.

Set Up and Run the XConnect The objective of this task is to set up and run the custom XConnect. Custom XConnects involve a set of mappings that transform source metadata into the required format specified in the Informatica Metadata Extraction (IME) files. The custom XConnect extracts the metadata from the IME files and loads it into the Metadata Manager Warehouse. This task includes the following steps: 1. Determine which Metadata Manager Warehouse tables to load. Although you do not have to load all Metadata Manager Warehouse tables, you must load the following Metadata Manager Warehouse tables: ●

IMW_ELEMENT: The IME_ELEMENT interface file loads the element names from the source repository into the IMW_ELEMENT table. Note that element is used generically to mean packages, classes, or properties.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

627 of 954

●

●

IMW_ELMNT_ATTR: The IME_ELMNT_ATTR interface file loads the attributes belonging to elements from the source repository into the IMW_ELMNT_ATTR table. IMW_ELMNT_ASSOC: The IME_ELMNT_ASSOC interface file loads the associations between elements of a source repository into the IMW_ELMNT_ASSOC table.

To stop the metadata load into particular Metadata Manager Warehouse tables, disable the worklets that load those tables. 2. Reformat the source metadata. In this step, reformat the source metadata so that it conforms to the format specified in each required IME interface file. (The IME files are packaged with the Metadata Manager documentation.) Present the reformatted metadata in a valid source type format. To extract the reformatted metadata, the integration workflows require that the reformatted metadata be in one or more of the following source type formats: database table, database view, or flat file. Note that you can load metadata into a Metadata Manager Warehouse table using more than one of the accepted source type formats. 3. Register the Source Repository Instance in Metadata Manager. Before the Custom XConnect can extract metadata, the source repository must be registered in Metadata Manager. When registering the source repository, the Metadata Manager application assigns a unique repository ID that identifies the source repository. Once registered, Metadata Manager adds an XConnect in the Configuration Console for the source repository. To register the source repository, go to the Metadata Manager web interface. Register the repository under the custom repository type created above. All packages, classes, and class associations defined for the custom repository type apply to all repository instances registered to the repository type. When defining the repository, provide descriptive information about the repository instance. Once the repository is registered in Metadata Manager, Metadata Manager adds an XConnect in the Configuration Console for the repository. Create the Repository that will hold the metadata extracted from the source system: ●

Go to the Administration tab.

●

Click Repositories under Repository Management.

●

Click Add to add a new repository.

●

Fill out the requested information (Note: Name and Repository Type are mandatory fields). Choose the repository type created above.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

628 of 954

●

Click OK when finished.

4. Configure the Custom Parameter Files. Custom XConnects require that the parameter file be updated by specifying the following information: ● ●

●

●

The source type (database table, database view, or flat file). The name of the database views or tables used to load the Metadata Manager Warehouse, if applicable. The list of all flat files used to load a particular Metadata Manager Warehouse table, if applicable. The worklets you want to enable and disable.

Understanding Metadata Manager Workflows for Custom Metadata ●

wf_Load_IME. Custom workflow to extract and transform metadata from the source repository into IME format. This is created by a developer. Metadata Manager prepackages the following integration workflows for custom metadata. These workflows read the IME files mentioned above and load them into the Metadata Manager Warehouse. ❍

❍

WF_STATUS: Extracts and transforms statuses from any source repository and loads them into the Metadata Manager Warehouse. To resolve status IDs correctly, the workflow is configured to run before the WF_CUSTOM workflow. WF_CUSTOM: Extracts and transforms custom metadata from IME files and loads that metadata into the Metadata Manager Warehouse.

5. Configure the Custom XConnect. The XConnect loads metadata into the Metadata Manager Warehouse based on classes and class associations specified in the custom metamodel. When the custom repository type is defined, Metadata Manager registers the corresponding XConnect in the Configuration Console. The following information in the Configuration Console configures the XConnect: ●

Under the Administration Tab, select Custom Workflow Configuration and choose the repository type to which the custom repository belongs.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

629 of 954

●

Workflows to load the metadata. ❍ ❍

❍

CustomXConnect-wf_Load_IME workflow Metadata Manager-WF_CUSTOM workflow(prepackages all worklets and sessions required to populate all Metadata Manager Warehouse tables, except the IMW_STATUS table) Metadata Manager -WF_STATUS workflow (populates the IMW_STATUS)

Note: Metadata Manager Server does not load Metadata Manager Warehouse tables that have disabled worklets. ●

6.

7.

Under the Administration Tab, select Custom Workflow Configuration and choose the parameter file used by the workflows to load the metadata (the parameter file name is assigned at first data load). This parameter file name has the form nnnnn.par, where nnnnn is a five digit integer assigned at the time of the first load of this source repository. The script promoting Metadata Manager from the development environment to test and from the test environment to production preserves this file name.

Reset the $$SRC_INCR_DATE Parameter . After completing the first metadata load, reset the $$SRC_INCR_DATE parameter to extract metadata in shorter intervals, such as every f days. The value depends on how often the Metadata Manager Warehouse needs to be updated. If the source does not provide the date when the records were last updated, records are extracted regardless of the $$SRC_INCR_DATE parameter setting. Run the Custom XConnect . Using the Configuration Console, Metadata Manager Administrators can run the custom XConnect and ensure that the metadata loads correctly.

Note: When loading metadata with Effective From and Effective To Dates, Metadata Manager does not validate whether the Effective From Date is less than the Effective To Date. Ensure that each Effective To Date is greater than the Effective From Date. If you do not supply Effective From and Effective To Dates, Metadata Manager sets the Effective From Date to 1/1/1899 and the Effective To Date to 1/1/3714.

To Run a Custom XConnect ●

Log in to the Configuration Console.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

630 of 954

●

Click Source Repository Management

●

Click Load next to the custom XConnect you want to run

Configure the Reports and Schema The objective of this task is to set up the reporting environment, which needs to run reports on the metadata stored in the Metadata Manager Warehouse. The setup of the reporting environment depends on the reporting requirements. The following options are available for creating reports: ●

●

●

Use the existing schema and reports. Metadata Manager contains prepackaged reports that can be used to analyze business intelligence metadata, data integration metadata, data modeling tool metadata, and database catalog metadata. Metadata Manager also provides impact analysis and lineage reports that provide information on any type of metadata. Create new reports using the existing schema. Build new reports using the existing Metadata Manager metrics and attributes. Create new Metadata Manager Warehouse tables and views to support the schema and reports. If the prepackaged Metadata Manager schema does not meet the reporting requirements, create new Metadata Manager Warehouse tables and views. Prefix the name of custom-built tables with Z_IMW_. Prefix custom-built views with Z_IMA_. If you build new Metadata Manager Warehouse tables or views, register the tables in the Metadata Manager schema and create new metrics/attributes in the Metadata Manager schema. Note that the Metadata Manager schema is built on the Metadata Manager views.

After the environment setup is complete, test all schema objects, such as dashboards, analytic workflows, reports, metrics, attributes, and alerts.

Last updated: 05-Jun-08 14:15

INFORMATICA CONFIDENTIAL

BEST PRACTICES

631 of 954

Customizing the Metadata Manager Interface Challenge Customizing the Metadata Manager Presentation layer to meet specific business needs.

Description There are several areas in which the Metadata Manager Application interface can be customized to meet specific business needs. Customizations can be done by configuring security as well as the Metadata Manager Application interface. The first step to customization is configuring security according to business needs. By configuring security, only certain users will be able to access, search, and customize specific areas of Metadata Manager. Use the PowerCenter Administration Console to first create different roles, groups, and users. After users have been created, use the Security page in the Metadata Manager Application to manage permissions. The sections below cover some of the areas to configure when customizing Metadata Manager to meet specific business needs.

Metadata Manager Interface The Metadata Manager Application interface consists of the following pages: Browse . Browse and search the metadata catalog, create and view shortcuts and shared folders, view information about metadata objects, run data lineage and whereused analysis, and add information about metadata objects. Model. Create and edit custom models, add custom attributes to packaged and custom models, and import and export custom models. Load. Create and load resources to load metadata into the Metadata Manager warehouse. Use the Load page to monitor and schedule resource loads, purge metadata from the Metadata Manager warehouse, and manage the search index. Security. Manage permissions on resources and metadata objects in the Metadata Manager warehouse.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

632 of 954

The Metadata Manager Custom Metadata Integration Guide provides methodology and procedures for integrating custom metadata into the Metadata Manager warehouse. The Metadata Manager Custom Metadata Integration Guide is written for system administrators who want to load metadata from a repository type for which Metadata Manager does not package a model. This guide assumes that system administrators have knowledge of relational database concepts, models, and PowerCenter. Metadata Manager uses models to define the metadata it extracts from metadata sources. The following types of custom metadata can be added into the Metadata Manager warehouse: ●

●

●

Metadata for a custom metadata source. Load or add metadata from a source for which Metadata Manager does not package a resource type (e. g., from a Microsoft Access database). Metadata Manager does not package a resource type for Microsoft Access. A custom model can be created for the source metadata and then loaded into the Metadata Manager warehouse. This is also called, ‘Creating a custom resource’. Attributes. Add custom attributes to the existing metadata in the Metadata Manager warehouse. For example, to add an additional attribute to a report for Cognos ReportNet, the Cognos ReportNet model can be edited in Metadata Manager. Add the attribute and then add the metadata for the attribute on the Browse page. Relationships. Add relationships from custom metadata classes to model classes for which Metadata Manager packages a resource type. For example, a column in a custom metadata source is also used in an Oracle table. A classlevel relationship can be created between the custom source column and the Oracle table column. Then create the object-level relationship on the Browse page. The relationship can be created to run data lineage and where-used analysis on the custom metadata.

The Model page in Metadata Manager is where models for Metadata Manager are created or edited. After you create or edit the model for the custom metadata, you add the metadata to the Metadata Manager warehouse. You can add the metadata using the metadata catalog. You can also create a custom resource, create a template and generate PowerCenter workflows using the Custom Metadata Configurator, and load the metadata on the Load page in Metadata Manager. After you add the custom metadata into the Metadata Manager warehouse, use the Metadata Manager or the packaged Metadata Manager reports to analyze the metadata. You can create new reports to analyze additional information. You can also export and import the models, or export and import the metadata that you added to the metadata catalog.

Adding and Loading Metadata for a Custom Metadata Source INFORMATICA CONFIDENTIAL

BEST PRACTICES

633 of 954

When you add metadata for a custom metadata source, you define a model for the source metadata to define the type of metadata that Metadata Manager extracts. You create the model and add classes, attributes, and relationships. After you define the model, you can add the metadata to the metadata catalog using the Browse page, or you can use the Custom Metadata Configurator to create a template and then load the metadata into the Metadata Manager warehouse from metadata source files. When you create a template, you use the Custom Metadata Configurator to create the template and the PowerCenter objects, including mappings, sessions, and workflows that Metadata Manager uses to extract metadata from metadata source files. You can export the metadata from the metadata source to a metadata source file, create a custom resource in Metadata Manager, and load the metadata from the metadata source files.

Adding Custom Metadata To add metadata for a custom metadata source, complete the following steps: Create the model. Create the model to represent the metadata in the metadata source using the Model page in Metadata Manager. Add classes, attributes, and relationships. Add custom classes, attributes, and relationships to the model using the Model page. Add the metadata to the Metadata Manager warehouse. Create a resource in the metadata catalog that represents the source metadata using the Browse page. Add custom metadata objects based on the classes you create.

Metadata Manager Reporting You can access Metadata Manager Reporting from Metadata Manager to run reports. To access Data Analyzer from Metadata Manager, complete the following steps: Create a Reporting Service . Create a Reporting Service in the PowerCenter Administration Console and use the Metadata Manager repository as the data source. Launch Metadata Manager Reporting . On the Metadata Manager Browse page, click Reports in the toolbar. If you have the required privileges on the Reporting Service, Metadata Manager logs you into the Data Analyzer instance being used for Metadata Manager. You can then run the Metadata Manager reports. INFORMATICA CONFIDENTIAL

BEST PRACTICES

634 of 954

Metadata Manager includes the following types of reports: Primary reports . This is the top-level report in an analytic workflow. To access all lower-level reports in the analytic workflow, first run this report on the Analyze tab. Standalone reports . Unlike analytic workflow reports, you run these reports independently of other reports. Workflow reports. These are the lower-level reports in an analytic workflow. To access a workflow report, first run the associated primary report and all workflow reports that precede the given workflow report. You can use these reports to perform several types of analysis on metadata stored in the Metadata Manager warehouse. Metadata Manager prepackages reports for Business intelligence, Data modeling, Data integration, Database management, and Metamodel.

Customizing Metadata Manager Reporting You can create new reporting elements and attributes under ‘Schema Design’. These elements can be used in new reports or existing report extensions. You can also extend or customize out-of-the-box reports, indicators, or dashboards. Informatica recommends using the ‘Save As’ new report option for such changes in order to avoid any conflicts during upgrades. The Metadata Manager Reports Reference gives you a guideline of the reports and attributes being used. Further, you can use Data Analyzer's 1-2-3-4 report creation wizard to create new reports. Informatica recommends saving such reports in a new report folder to avoid conflict during upgrades.

Customizing Metadata Manager ODS Reports Use the operational data store (ODS) report templates to analyze metadata stored in a particular repository. Although these reports can be used as is, they can also be customized to suit particular business requirements. Out-of-the-box reports can be used as a guideline for creating reports for other types of source repositories, such as a repository for which Metadata Manager does not prepackage a standard resource.

Last updated: 02-Jun-08 23:22

INFORMATICA CONFIDENTIAL

BEST PRACTICES

635 of 954

Estimating Metadata Manager Volume Requirements Challenge Understanding the relationship between various inputs for the Metadata Manager solution in order to estimate data volumes for the Metadata Manager Warehouse.

Description The size of the Metadata Manager warehouse is directly proportional to the size of metadata being loaded into it. The size is dependent on the number of element attributes being captured in source metadata and the associations defined in the metamodel. When estimating volume requirements for a Metadata Manager implementation, consider the following Metadata Manager components: ●

●

●

Metadata Manager Service - Manages the source repository metadata stored in the Metadata Manager Warehouse. You can use Metadata Manager, which uses the Metadata Manager Service, to search, view, and configure source repository metadata and run reports. Metadata Manager Integration Repository - This PowerCenter repository stores the workflows, which are resource components that extract source metadata and load it into the Metadata Manager Warehouse. Metadata Manager Warehouse- The Metadata Manager Warehouse stores the Metadata Manager metadata. It also stores source repository metadata and metamodels

Considerations Volume estimation for Metadata Manager is an iterative process. Use the Metadata Manager in the development environment to get accurate size estimates for the Metadata Manager in the production environment. The required steps are as follows: 1. Identify the source metadata that needs to be loaded in the Metadata Manager production warehouse. 2. Size the Metadata Manager Development warehouse based on the initial sizing estimates (as explained under the Sizing Estimate Example section of this document). 3. Run the resource loads and monitor the disk usage. The development metadata loaded during the initial run of the resources should be used as a baseline for further sizing estimates.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

636 of 954

Sizing Estimate Example The following table is an initial estimation matrix that should be helpful in deriving a reasonable initial estimation. For increased input sizes, consider the expected Metadata Manager Warehouse target size to increase in direct proportion.

Resource

Input Size

Expected Metadata Manager Warehouse Target Size

Metamodel and other tables

-

50MB

PowerCenter

1MB

10MB

Data Analyzer

1MB

4MB

Database

1MB

5MB

Other Resource

1MB

4.5MB

Last updated: 02-Jun-08 23:31

INFORMATICA CONFIDENTIAL

BEST PRACTICES

637 of 954

Metadata Manager Business Glossary Challenge A group of people working towards a common goal need shared definitions for the information they are dealing with. Implementing a Business Glossary with Metadata Manager provides a vocabulary to facilitate better understanding between business and IT.

Description Data values and data object names such as names of entities and attributes can be interpreted differently by various groups. In many organizations, Business Analysts create spreadsheets or word documents to manage business terms. Without a common repository to store these business terms, it is often a challenge to communicate them to other groups. By creating a Business Glossary, Metadata Manager can be leveraged to associate business terms to IT assets. This is achieved by creating and configuring a custom model in Metadata Manager. The custom metadata model is used for building the Business Glossary as a searchable online catalog. The Business Glossary can also be printed and published as a report. In order to capture the association between the business terms and actual IT implementations, it is necessary to utilize the Classes from the Business Nomenclature package of the Common Warehouse Model (CWM). Below are the major elements of the Business Nomenclature package (Poole et al.,2002b):

For an example implementation of a Business Glossary, it is assumed that business terms are currently maintained in spreadsheets. By building a metadata adapter, the business terms can be loaded from the spreadsheets to Metadata Manager. Predefined PowerCenter workflows that come with Metadata Manager can use the spreadsheets as source and load into Metadata Manager warehouse. The main steps involved to implement a Business Glossary with Metadata Manager are: 1. Create the Model 2. Configure Classes 3. Configure the Class Attributes INFORMATICA CONFIDENTIAL

BEST PRACTICES

638 of 954

4. Configure the Class Relationships 5. Load Custom Resource using Custom Metadata Configurator 6. Add Custom Resource to the Browse page Before creating the Model, it is necessary to identify model name/description, parent classes and subclasses, attributes and relationships. Subsequently, a new model for the Business Glossary should be created and classes for this model should be configured. The following table is a summary of the repository and class model definition: CLASS NAME

DESCRIPTION

BASE CLASSES

ATTRIBUTES

Category

Represents a category in the glossary

CWM:Business Nomenclature:Glossary IMM Tree Element

Name

Term

Represents a term in the business glossary

CWM:Business Nomenclature:Term

Name, subtype, version, supplier

An association called “ConceptToimplementation” will be added to link the Term class to physical implementations such as database tables. Once the class is opened on the Model page from the Model navigator, attributes and relationships can be configured. The next step is to load the spreadsheets to Metadata Manager. Element metadata and association metadata file should be created in order to map attributes and associations to classes. The ime_element interface should be used to load every entity identified as an element in the source repository. Any named entity in the source system can be classified as an element. IME_ELEMENT has some predefined attributes like Description which can be mapped to this interface. The value of the element class identifier, CLASS_ID, must exist in the IMW_CLASS table. If not, the ETL process rejects the record. The combination of REPOSITORY_ID, CLASS_ID, and ELEMENT_ID must be unique for each element. The following is an example element metadata file: REPOSITORY_ID

CLASS_ID

ELEMENT_NAME

DESCRIPTION

BUS_GLOSSARY

Businessglossary.category

CRM

CRM related terms

BUS_GLOSSARY

Businessglossary.term

Acquisition Cost

Total dollar amount spent to acquire a new customer

BUS_GLOSSARY

Businessglossary.term

Advertisement

A paid, public, promotional message for an identified sponsor promoting a company’s products and services

BUS_GLOSSARY

Businessglossary.term

Campaign Management

A marketing process to effectively promote the products and services of a given organization to prospects

The ime_elmnt_assoc interface should be used to load associations between two elements in a source repository. An association is the way in which two elements are related. Examples of associations in a database environment include associations between an INFORMATICA CONFIDENTIAL

BEST PRACTICES

639 of 954

index and the table on which it is defined, a table and its corresponding columns, and a column and its constraints. There are two elements in each association. The From element is the element from which the association is defined. The TO_ELEMENT is the element to which the association is defined. While mapping associations to this interface, ensure that one of the following is true: ●

●

The values in the FROM_CLASS_ID and TO_CLASS_ID are the same as the From and To classes that are defined in the association in the IMW_ASSOCIATION table The classes specified in the FROM_CLASS_ID and TO_CLASS_ID of the interface have base classes that are the From and To classes defined in the association in the IMW_ASSOCIATION table.

In other words, you cannot load an association for an object if the association is not defined for its class or the base class of its class. This interface loads the IMW_ELMNT_ASSOC table. The Metadata Manager server uses this table to show metadata in the browser and the ETL process uses this table to load other Metadata Manager Warehouse tables. The following is an outline of the necessary fields for association metadata file:

LABEL

COLUMN NAME

DESCRIPTION

Repository

REPOSITORY_ID

Uniquely identifies the repository which stores the association between two elements.

From Element Class

FROM_CLASS_ID

Uniquely identifies the class of the From element in the association being loaded.

From Element

FROM_ELEMENT_ID

Uniquely identifies the From element in the association being loaded.

From Element Repository

FROM_REPO_ID

Uniquely identifies the repository to which the From element belongs.

To Element Class

TO_CLASS_ID

Uniquely identifies the class of the To element in the association being loaded.

To Element

TO_ELEMENT_ID

Uniquely identifies the To element of the association being loaded.

To Element Repository

TO_REPO_ID

Uniquely identifies the repository to which the To element belongs.

Association

ASSOCIATION_ID

Uniquely identifies the association between the From element and To element in the source repository.

The last step before generating the PowerCenter mappings and workflows to load the spreadsheets into Metadata Manager warehouse is configuring a template to store information about how to map the metadata object attributes to the class attributes. In order to load the source spreadsheets, it is necessary to generate a comma delimited file. This can be done by saving the spreadsheets in a .csv format. The name of the .csv files must be entered in the indirect files that PowerCenter will use as source. INFORMATICA CONFIDENTIAL

BEST PRACTICES

640 of 954

Once the workflows complete and the spreadsheets are loaded into the Metadata Manager Warehouse, the Business Glossary information can be searched and browsed like any other Metadata resource. Last but not least, Business Glossary implementation with Metadata Manager intends to not only capture the business terms, but also provide a way to relate technical and business concepts. This will allow Business Users, Data Stewards, Business Analysts, and Data Analysts to describe key concepts of their business and communicate to other groups.

Last updated: 27-May-08 12:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

641 of 954

Metadata Manager Load Validation Challenge Just as it is essential to know that all data for the current load cycle has loaded correctly, it is important to ensure that all metadata extractions (Metadata Resources) loaded correctly into the Metadata Manager warehouse. If metadata extractions do not execute successfully, the Metadata Manager warehouse will not be current with the most up-to-date metadata.

Description The process for validating Metadata Manager metadata loads is very simple using the Metadata Manager Application interface. In the Metadata Manager Application interface, you can view the run history for each of the resources. For load validation, use the Load Page in the Metadata Manager Application interface, PowerCenter Workflow Monitor, and PowerCenter Administration Console logs. The Workflow Monitor in PowerCenter will also have a workflow and session log for the resource load. Resources can fail for a variety of reasons common in IT such as unavailability of the database, network failure, improper configuration, etc. More detailed error messages can be found in the activity log or in the workflow log files. The following installation directories will also have additional log files that are used for the resource load process: …\server\tomcat\mm_files\MM_PC851\mm_load …\server\tomcat\mm_files\MM_PC851\mm_index …\server\tomcat\logs

Loading and Monitoring Resources Overview After you configure the metadata source and create a resource, you can load the resource. When you load a resource, Metadata Manager uses the connection information for the resource to extract the metadata from the metadata source. Metadata Manager converts the extracted metadata into IME files and runs PowerCenter workflows to load the metadata from the IME files into the Metadata Manager warehouse. You can use the Load page to perform the following resource tasks: Load a resource. Load the source metadata for a resource into the Metadata Manager warehouse. Metadata Manager extracts metadata and profiling information, and indexes the resource.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

642 of 954

Monitor a resource. Use the Metadata Manager activity log, resource output log, and PowerCenter Workflow Monitor to monitor and troubleshoot resource loads. Schedule a resource. Create a schedule to select the time and frequency that Metadata Manager loads a resource. You can attach the schedule to a resource. The figure below shows the Load page for Metadata Manager:

Loading Resources You can load a resource for Metadata Manager immediately in the Load page. Metadata Manager loads the resource and displays the results of the resource load in the Resource List. When Metadata Manager loads a resource, it completes the following tasks: Loads metadata. Loads the metadata for the resource into the Metadata Manager warehouse. Extracts profiling information. Extracts profiling information from the source database. If you load a relational database resource, you can extract profiling information from tables and columns in the database. Indexes the resource. Creates or updates the index files for the resource. You can start the load process from the Resource List section of the Load page.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

643 of 954

To load a resource: 1. On the Load page, select the resource you want to load in the Resource List. 2. Click Load. Metadata Manager adds the resource to the load queue and starts the load process. If Metadata Manager finds an unassigned connection to another metadata source, Metadata Manager pauses the load. You must configure the connectionassignment to proceed. Configure the connection assignments for the resource in the Resource Properties section and click Resume. For more information about configuring connection assignments, see Configuring Connection Assignments. 3. To cancel the load, click Cancel. When the resource load completes, Metadata Manager updates the Last Status Date and Last Status for the resource. You can use the activity log and the output log to view more information about the resource load.

Resuming a Failed Resource Load If a resource load fails when PowerCenter runs the workflows that load the metadata into the warehouse, you can resume the resource load. You can use the output log in Metadata Manager and the workflow and session logs in the PowerCenter Workflow Manager to troubleshoot the error and resume the resource load. To resume a failed load: 1. On the Load page, select the resource in the Resource List for which you want to resume the resource load. 2. Click Resume. Metadata Manager continues loading the resource from the previous point of failure and completes any profiling or indexing operations.

Load Queue When you load a resource, Metadata Manager places the resource in a load queue. The load queue controls the order in which Metadata Manager loads resources. Metadata Manager places resources in the load queue when you start the resource load from the Load page or when a scheduled resource load begins. If a resource load fails, Metadata Manager keeps the resource in the load queue until the timeout interval for the resource load is exceeded. When the timeout interval is exceeded, Metadata Manager removes the INFORMATICA CONFIDENTIAL

BEST PRACTICES

644 of 954

resource from the load queue and begins loading the next resource in the queue. You can configure the number of resources that Metadata Manager loads simultaneously and the timeout interval for resource loads when you configure the Metadata Manager Service in the PowerCenter Administration Console.

Loading Metadata Sources in Order To view data lineage or where-used analysis across source repositories and databases, you configure the connection assignments for the resource, load the metadata for the database or other source repository, and then load the resource that contains the connections. For example, you want run data lineage analysis between a PowerCenter repository and an Oracle database. You must load the Oracle database, configure the connection assignments for the PowerCenter resource, and then load the PowerCenter resource.

Monitoring Resources You can monitor resource loads runs to determine if they are successful. If a resource load fails, troubleshoot the failure and load the resource again. You can use the following logs in Metadata Manager to view information about resource loads and troubleshoot errors: Activity log. Contains the status of resource load operations for all resources. Output log. Contains detailed information about each resource load operation. You can also use the PowerCenter Workflow Monitor to view the workflows as they load the metadata. Use session and workflow logs to troubleshoot errors. If you load multiple resources of the same resource type concurrently, the Integration Service runs multiple instances of the workflow that corresponds to the resource type. Each workflow instance includes separate workflow and session logs. You can also use mmcmd and mmwfdrundetails to get more information about the status of a resource load and get more information about the PowerCenter workflows and sessions that load metadata. Note: Profiling may show as successful although some of the PowerCenter sessions that load profiling information fail. Sessions can fail because of run-time resource constraints. If one or more sessions fail but the other profiling sessions complete successfully, profiling INFORMATICA CONFIDENTIAL

BEST PRACTICES

645 of 954

displays as successful on the Load page.

Activity Log The activity log contains details on each resource load. Use the activity log to get more details on a specific resource load. The following shows a sample activity log:

The following table describes the contents of the activity log: Property

Description

Resource Name of the resource. Task Type

Type of task performed by Metadata Manager. Metadata Manager performs the following tasks: - Metadata Load. Loads metadata into the Metadata Manager warehouse. - Profiling. Extracts profiling information from the source database. - Indexing. Creates or updates index files for the resource.

User

Metadata Manager user that started the resource load.

Start Date The date and time the resource load started.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

646 of 954

End Date The date and time the resource load completed. Status

The status of the metadata load, profiling, and indexing operations.

To view the contents of the activity log: 1. On the Load page, click Activity Log. The Activity Log window appears. 2. To filter the contents of the Activity Log window, select the time frame in the Show logs list. 3. To sort by column, click the column name. 4. To refresh the log to see recent changes, click Refresh.

Output Log The output log displays the results of the most recent resource load for a resource. Use the output log for detailed information about the operations performed by Metadata Manager when it loads the resource. The following example shows an excerpt from an output log: MetadataLoad [Sun Sep 16 09:20:54 PDT 2007] : Starting metadata load... MetadataLoad [Sun Sep 16 09:20:54 PDT 2007] : Resource Name: PowerCenter_85 MetadataLoad [Sun Sep 16 09:20:54 PDT 2007] : Resource Type: PowerCenter MetadataLoad [Sun Sep 16 09:20:54 PDT 2007] : Resource Group: Data Integration MetadataLoad [Sun Sep 16 09:20:54 PDT 2007] : Metadata load started... MetadataLoad [Sun Sep 16 09:20:55 PDT 2007] : Task started: ETLTaskHandler. MetadataLoad [Sun Sep 16 09:20:55 PDT 2007] : Opened connection to PowerCenter repository MetadataLoad [Sun Sep 16 09:20:55 PDT 2007] : Connected to the Repository Service MetadataLoad [Sun Sep 16 09:21:20 PDT 2007] : Started workflow WF_PC8X_STAGE MetadataLoad [Sun Sep 16 09:21:20 PDT 2007] : Waiting for workflow to complete... MetadataLoad [Sun Sep 16 09:24:06 PDT 2007] : Completed workflow WF_PC8X_STAGE MetadataLoad [Sun Sep 16 09:24:09 PDT 2007] : Started workflow WF_PC MetadataLoad [Sun Sep 16 09:24:09 PDT 2007] : Waiting for workflow to complete... Last updated: 02-Jun-08 23:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

647 of 954

Metadata Manager Migration Procedures Challenge This Best Practice describes the processes to follow when Metadata Manager is deployed in multiple environments and out of the box Metadata Manager components are customized or configured; or new components are added to Metadata Manager. ●

●

●

Reports: This includes changes to the reporting schema and the out of the box reports. This also includes any newly created reports or schema elements generated to cater to custom reporting needs and located at a specific implementation instance of the product. Metamodel: This includes additionally created metamodel components needed to help associate any custom metadata against repository types and domains that are not covered by the out of the box Metadata Manager repository types. Metadata: This includes additionally created metadata objects, their properties or associations against repository instances configured within Metadata Manager. These repository instances could either belong to the repository types supported out of the box by Metadata Manager or any new repository types configured through custom additions to the metamodels.

●

Integration Repository: This includes changes to the out of the box PowerCenter workflows or mappings. This would also include any new PowerCenter objects (mappings, transformations etc.) or associated workflows.

Description Report Changes The following chart depicts the various scenarios related to the reporting area and the actions that need to be taken as relates to the migration of the changed components. It is always advisable to create new schema elements (metrics, attributes etc.) or new reports in the new (migration target) Data Analyzer folder to facilitate exporting or importing the Data Analyzer objects across development, test and production.

Nature of Report Change: Modify schema component (metric, attribute etc.) Development Perform the change in development, test the same and certify it for deployment. Do an XML export of the changed components.

Test

Production

Backup up the Test environment as a failsafe.

Backup the Production environment as a failsafe.

Import the XML exported from the development environment.

Import the XML exported from the development environment.

Answer ‘Yes’ to overriding the definitions that already exist for the changed schema components.

Answer ‘Yes’ to overriding the definitions that already exist for the changed schema components.

Test and verify the changes within the TEST environment.

Test and verify the changes within the Production environment.

Nature of Report Change: Modify an existing report (add or delete metrics, attributes, filters, change formatting etc.) Development

INFORMATICA CONFIDENTIAL

Test

BEST PRACTICES

Production

648 of 954

Perform the change in development, test the same and certify it for deployment. Do an XML export of the changed report.

Backup up the Test environment as a failsafe.

Backup the Production environment as a failsafe.

Import the XML exported from the development environment.

Import the XML exported from the development environment.

Answer ‘Yes’ to overriding the definitions that already exist for the changed report.

Answer ‘Yes’ to overriding the definitions that already exist for the changed report.

Test and verify the changes within the TEST environment.

Test and verify the changes within the Production environment.

Nature of Report Change: Add new schema component (metric, attribute etc.) Development Perform the change in development, test the same and certify it for deployment. Do an XML export of the new schema components.

Test

Production

Backup up the Test environment as a failsafe.

Backup the Production environment as a failsafe.

Import the XML exported from the development environment.

Import the XML exported from the development environment.

Test and verify the changes within the TEST environment.

Test and verify the changes within the Production environment.

Nature of Report Change: Add new report Development Perform the change in development, test the same and certify it for deployment. Do an XML export of the new report.

Test

Production

Backup up the Test environment as a failsafe.

Backup the Production environment as a failsafe.

Import the XML exported from the development environment.

Import the XML exported from the development environment.

Test and verify the changes within the TEST environment.

Test and verify the changes within the Production environment.

Metamodel Changes The following chart depicts the various scenarios related to the metamodel area and the actions that need to be taken related to the migration of the changed components.

Nature of the Change: Add new metamodel component

Development INFORMATICA CONFIDENTIAL

Test BEST PRACTICES

Production 649 of 954

Backup up the Test environment as Backup the Production environment as a failsafe. a failsafe.

Perform the change in development, test the same and certify it for deployment. Do an XML export of the new metamodel components (export can be done at 3 levels: Originators, Repository Types and Entry Points) using the “Export Metamodel” option.

Import the XML exported from the Import the XML exported from the development environment using the development environment using the “Import metamodel” option. “Import metamodel” option. Test and verify the changes within the TEST environment.

Test and verify the changes within the Production environment.

Integration Repository Changes The following chart depicts the various scenarios related to the integration repository area and the actions that need to be taken as relates to the deployment of the changed components. It is always advisable to create new mappings, transformations, workflows etc in a new PowerCenter folder so that it becomes easy to export the ETL objects across development, test and production.

Nature of the Change: Modify an existing mapping, transformation and/or the associated workflows etc. Development Perform the change in development, test the same and certify it for deployment. Do an XML export of the changed objects.

Test

Production

Backup up the Test environment as a failsafe.

Backup the Production environment as a failsafe.

Import the XML exported from the development environment.

Import the XML exported from the development environment.

Answer ‘Yes’ to overriding the definitions that already exist for the changed object.

Answer ‘Yes’ to overriding the definitions that already exist for the changed object.

Test and verify the changes within the TEST environment.

Test and verify the changes within the Production environment.

Nature of the Change: Add new ETL object (mapping, transformation etc.) Development

INFORMATICA CONFIDENTIAL

Test

BEST PRACTICES

Production

650 of 954

Perform the change in development, test the same and certify it for deployment.

Backup up the Test environment as a failsafe.

Import the XML exported from the Do an XML export of the new objects. development environment. Test and verify the changes within the TEST environment.

Backup the Production environment as a failsafe. Import the XML exported from the development environment. Test and verify the changes within the Production environment.

Last updated: 03-Jun-08 00:05

INFORMATICA CONFIDENTIAL

BEST PRACTICES

651 of 954

Metadata Manager Repository Administration Challenge The task of administering the Metadata Manager Repository involves taking care of both the integration repository and the Metadata Manager warehouse. This requires knowledge of both PowerCenter administrative features (i.e., the integration repository used in Metadata Manager) and Metadata Manager administration features.

Description A Metadata Manager administrator needs to be involved in the following areas to ensure that the Metadata Manager warehouse is fulfilling the end-user needs: ●

●

Migration of Metadata Manager objects created in the Development environment to QA or the Production environment Creation and maintenance of access and privileges of Metadata Manager objects

●

Repository backups

●

Job monitoring

●

Metamodel creation.

Migration from Development to QA or Production In cases where a client has modified out-of-the-box objects provided in Metadata Manager or created a custom metamodel for custom metadata, the objects must be tested in the Development environment prior to being migrated to the QA or Production environments. The Metadata Manager Administrator needs to do the following to ensure that the objects are in sync between the two environments: ●

●

Install a new Metadata Manager instance for the QA/Production environment. This involves creating a new integration repository and Metadata Manager warehouse Export the metamodel from the Development environment and import it to QA or production via XML Import/Export functionality (in the Metadata Manager Administration tab) or via the Metadata Manager command line utility.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

652 of 954

●

Export the custom or modified reports created or configured in the Development environment and import them to QA or Production via XML Import/Export functionality. This functionality is identical to the function in Data Analyzer.

Providing Access and Privileges Users can perform a variety of Metadata Manager tasks based on their privileges. The Metadata Manager Administrator can assign privileges to users by assigning them roles. Each role has a set of privileges that allow the associated users to perform specific tasks. The Administrator can also create groups of users so that all users in a particular group have the same functions. When an Administrator assigns a role to a group, all users of that group receive the privileges assigned to the role. The Metadata Manager Administrator can assign privileges to users to enable users to perform the any of the following tasks in Metadata Manager: ●

●

●

Configure reports. Users can view particular reports, create reports, and/or modify the reporting schema. Configure the Metadata Manager Warehouse. Users can add, edit, and delete repository objects using Metadata Manager. Configure metamodels. Users can add, edit, and delete metamodels.

Metadata Manager also allows the Administrator to create access permissions on specific source repository objects for specific users. Users can be restricted to reading, writing, or deleting source repository objects that appear in Metadata Manager. Similarly, the Administrator can establish access permissions for source repository objects in the Metadata Manager warehouse. Access permissions determine the tasks that users can perform on specific objects. When the Administrator sets access permissions, he or she determines which users have access to the source repository objects that appear in Metadata Manager. The Administrator can assign the following types of access permissions to objects: ●

●

● ●

Read - Grants permission to view the details of an object and the names of any objects it contains. Write - Grants permission to edit an object and create new repository objects in the Metadata Manager warehouse. Delete - Grants permission to delete an object from a repository. Change permission - Grants permission to change the access permissions for an object.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

653 of 954

When a repository is first loaded into the Metadata Manager warehouse, Metadata Manager provides all permissions to users with the System Administrator role. All other users receive read permissions. The Administrator can then set inclusive and exclusive access permissions.

Metamodel Creation In cases where a client needs to create custom metamodels for sourcing custom metadata, the Metadata Manager Administrator needs to create new packages, originators, repository types and class associations.

Job Monitoring When Metadata Manager Resources are running in the Production environment, Informatica recommends monitoring loads through the Metadata Manager console. The Load Page in the Metadata Manager Application interface has an Activity Log that can identify the total time it takes for a Resource to complete. The console maintains a history of all runs of an Resource, enabling a Metadata Manager Administrator to ensure that load times are meeting the SLA agreed upon with end users and that the load times are not increasing inordinately as data increases in the Metadata Manager warehouse. The Activity Log provides the following details about each repository load: ●

Repository Name- name of the source repository defined in Metadata Manager

●

Run Start Date- day of week and date the Resource run began

●

Start Time- time the Resource run started

●

End Time- time the Resource run completed

●

Duration- number of seconds the Resource run took to complete

● ●

Ran From- machine hosting the source repository Last Refresh Status- status of the Resource run, and whether it completed successfully or failed

Repository Backups When Metadata Manager is running in either the Production or QA environment, Informatica recommends taking periodic backups of the following areas:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

654 of 954

● ●

Database backups of the Metadata Manager warehouse Integration repository; Informatica recommends either of two methods for this backup: ❍

❍

The PowerCenter Repository Server Administration Console or pmrep command line utility The traditional, native database backup method.

The native PowerCenter backup is required but Informatica recommends using both methods because, if database corruption occurs, the native PowerCenter backup provides a clean backup that can be restored to a new database.

Last updated: 03-Jun-08 00:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

655 of 954

Upgrading Metadata Manager Challenge This best practices document summarizes the instructions for a Metadata Manager upgrade.

Description Install and configure PowerCenter Services before you start the Metadata Manager upgrade process. To upgrade a Metadata Manager repository, create a Metadata Manager Service and associate it with the Metadata Manager repository you wish to upgrade. Then use the Metadata Manager options to upgrade the Metadata Manager repository. Use the Administration Console to perform the following Metadata Manager upgrade tasks. Refer to the PowerCenter Configuration Guide for detailed instructions on upgrading Metadata Manager. Before you start the upgrade process be sure to check through the Informatica support information for the Metadata Manager upgrade path. For instance, Superglue 2.1 (as Metadata Manager was previously called) should first be upgraded to Metadata Manager 8.1 and then to the Metadata Manager 8.5. Superglue 2.2 or Metadata Manager 8.1 can be upgraded to Metadata Manager 8.5 in one step. Also verify the requirements for the following Metadata Manager 8.5 components: ●

Metadata Manager and Metadata Manager Client

●

Web browser

●

Databases

●

Third-party software

●

Code pages

●

Application server

For more information about requirements for the PowerCenter components, see Chapter 3 “PowerCenter Prerequisites” in the PowerCenter Installation Guide. For information about requirements for the Metadata Manager components, see Chapter 2

INFORMATICA CONFIDENTIAL

BEST PRACTICES

656 of 954

- “Verify Prerequisites After you install” in the PowerCenter Configuration Guide. ●

Disk Space for Metadata Manager Repository The Disk Space of 1GB is considered the starting size. Disk Space can grow beyond this considerably when many or big metadata resources are loaded

●

Flash 9 plug-in from Adobe The Flash 9 plug-in from Adobe is required to properly display data lineage. To run data lineage analysis in Metadata Manager or from the Designer, download and install the Flash 9 plug-in on the web browser. You can obtain the Flash plug-in from the Adobe web site. This can be downloaded after the upgrade. When starting a data lineage display, the Metadata Manager will prompt for the Adobe installation and point to the correct web site. To check whether Adobe Flashplayer 9 is installed on a Windows Client, you can check Start > Control Panel > Add or Remove Programs > (usually the first entry) Adobe Flashplayer 9

As we already know from the existing installation, Metadata Manager is made up of various components. Except for the Metadata Manager Repository all other Metadata Manager components (i.e., Metadata Manager Server, PowerCenter Repository, PowerCenter Clients and Metadata Manager Clients) should be uninstalled and then reinstalled with the latest version of the Metadata Manager Keep in mind that all modifications and/or customizations to the standard version of Metadata Manager will be lost and will need to be re-created and re-tested after the upgrade process.

Upgrade Steps 1. Set up new repository database and user account. ●

●

Set up new database/schema for the PowerCenter Metadata Manager repository. For Oracle, set the appropriate storage parameters. For IBM DB2, use a single node tablespace to optimize PowerCenter performance. For IBM DB2, configure the system temporary table spaces and update the heap sizes. Create a database user account for the PowerCenter Metadata Manager repository. The database user must have permissions to create and drop

INFORMATICA CONFIDENTIAL

BEST PRACTICES

657 of 954

tables and indexes, and to select, insert, update, and delete data from tables. 2. Make a copy of the existing Metadata Manager repository. ●

You can use any backup or copy utility provided with the database to make a copy of the working Metadata Manager repository prior to upgrading the Metadata Manager. Use the copy of the Metadata Manager repository for the new Metadata Manager installation.

3. Back up the existing parameter files. ●

Make a copy of the existing parameter files. If you have custom Resources and the parameter, attribute and data files of these custom Resources is in a different place, do not forget to take a backup of them too. You may need to refer to these files when you later configure the parameters for the custom Resources as part of the Metadata Manager client upgrade. For PowerCenter 8.5, you can find the parameter files in the following directory: PowerCenter_Home\server\infa_shared\SrcFiles For Metadata Manager 8.5, you can find the parameter files in the following directory: PowerCenter_Home\Server\SrcFiles

4. Export the Metadata Manager mappings that you customized or created for your environment. ●

●

If you made any changes on the standard Metadata Manager mappings, or created some new mappings within the Metadata Manager Integration repository, make an export of these mappings, workflows and/or sessions. If you created some additional reports, make an export of these reports too.

5. Install Metadata Manager. ●

Select the Custom installation set and install Metadata Manager. The installer creates a Repository Service and Integration Service in the PowerCenter domain and creates a PowerCenter repository for Metadata Manager.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

658 of 954

6. Stop the Metadata Manager server. ●

You must stop the Metadata Manager server before you upgrade the Metadata Manager repository contents.

7. Upgrade the Metadata Manager repository. ●

Use the Metadata Manager upgrade utility shipped with the latest version of Metadata Manager to upgrade the Metadata Manager repository.

8. Complete the Metadata Manager post-upgrade tasks. ●

After you upgrade the Metadata Manager repository, perform the following tasks: ❍

Update metamodels for Business Objects and Cognos ReportNet Content Manager.

❍

Delete obsolete Metadata Manager objects.

❍

Refresh Metadata Manager views.

❍

For a DB2 Metadata Manager repository, import metamodels.

9. Upgrade the Metadata Manager Client. ●

●

For instructions on upgrading the Metadata Manager Client, refer to the PowerCenter Configuration Guide. After you complete the upgrade steps, verify that all dashboards and reports are working correctly in Metadata Manager. When you are sure that the new version is working properly, you can delete the old instance of Metadata Manager and switch to the new version.

10. Compare and redeploy the exported Metadata Manager mappings that were customized or created for your environment. ●

If you had any modified Metadata Manager mappings in the previous release of Metadata Manager, check whether the modifications are still necessary. If the modifications still needed override or rebuild the changes into the new PowerCenter mappings.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

659 of 954

●

Import the customized reports into the new environment and check that the reports are still working with the new Metadata Manager environment. If not then make the necessary modifications to make them compatible with the new structure.

11. Upgrade the Custom Resources ●

If you have any custom Resources in your environment, you need to regenerate the Resource mappings that were generated by the previous version of the custom Resource configuration wizard. Before starting the regeneration process, ensure that the absolute paths to the .csv files are the same as the previous version. If all the paths are the same, no further actions are required after the regeneration of the workflows and mappings.

12. Uninstall the previous version of Metadata Manager. ●

Verify that the browser and all reports are working correctly in Metadata Manager 8.1. If the upgrade is successful, you can uninstall the previous version of Metadata Manager.

Last updated: 05-Jun-08 14:26

INFORMATICA CONFIDENTIAL

BEST PRACTICES

660 of 954

Daily Operations Challenge Once the data warehouse has been moved to production, the most important task is keeping the system running and available for the end users.

Description In most organizations, the day-to-day operation of the data warehouse is the responsibility of a Production Support team. This team is typically involved with the support of other systems and has expertise in database systems and various operating systems. The Data Warehouse Development team becomes, in effect, a customer to the Production Support team. To that end, the Production Support team needs two documents, a Service Level Agreement and an Operations Manual, to help in the support of the production data warehouse.

Monitoring the System Monitoring the system is useful for identifying any problems or outages before the users notice. The Production Support team must know what failed, where it failed, when it failed, and who needs to be working on the solution. Identifying outages and/or bottlenecks can help to identify trends associated with various technologies. The goal of monitoring is to reduce downtime for the business user. Comparing the monitoring data against threshold violations, service level agreements, and other organizational requirements helps to determine the effectiveness of the data warehouse and any need for changes.

Service Level Agreement The Service Level Agreement (SLA) outlines how the overall data warehouse system is to be maintained. This is a high-level document that discusses system maintenance and the components of the system, and identifies the groups responsible for monitoring the various components. The SLA should be able to be measured against key performance indicators. At a minimum, it should contain the following information: ●

Times when the system should be available to users.

●

Scheduled maintenance window.

●

Who is expected to monitor the operating system.

●

Who is expected to monitor the database.

●

Who is expected to monitor the PowerCenter sessions.

●

How quickly the support team is expected to respond to notifications of system failures.

●

Escalation procedures that include data warehouse team contacts in the event that the support team cannot resolve the system failure.

Operations Manual The Operations Manual is crucial to the Production Support team because it provides the information needed to perform the data warehouse system maintenance. This manual should be self-contained, providing all of the information necessary for a production support operator to maintain the system and resolve most problems that can arise. This manual should contain information on how to maintain all data warehouse system components. At a minimum, the Operations Manual should contain: ●

Information on how to stop and re-start the various components of the system.

●

Ids and passwords (or how to obtain passwords) for the system components.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

661 of 954

●

Information on how to re-start failed PowerCenter sessions and recovery procedures.

●

A listing of all jobs that are run, their frequency (daily, weekly, monthly, etc.), and the average run times.

●

Error handling strategies.

●

Who to call in the event of a component failure that cannot be resolved by the Production Support team.

PowerExchange Operations Manual The need to maintain archive logs and listener logs, use started tasks, perform recovery, and other operation functions on MVS are challenges that need to be addressed in the Operations Manual. If listener logs are not cleaned up on a regular basis, operations is likely to face space issues. Setting up archive logs on MVS requires datasets to be allocated and sized. Recovery after failure requires operations intervention to restart workflows and set the restart tokens. For Change Data Capture, operations are required to start the started tasks in a scheduler and/or after an IPL. There are certain commands that need to be executed by operations. The PowerExchange Reference Guide (8.1.1) and the related Adapter Guide provides detailed information on the operation of PowerExchange Change Data Capture.

Archive/Listener Log Maintenace The archive log should be controlled by using the Retention Period specified in the EDMUPARM ARCHIVE_OPTIONS in parameter ARCHIVE_RTPD=. The default supplied in the Install (in RUNLIB member SETUPCC2) is 9999. This is generally longer than most organizations need. To change it, just rerun the first step (and only the first step) in SETUPCC2 after making the appropriate changes. Any new archive log datasets will be created with the new retention period. This does not, however, fix the old archive datasets; to do that, use SMS to override the specification, removing the need to change the EDMUPARM. The listener default log are part of the joblog of the running listener. If the listener job runs continuously, there is a potential risk of the spool file reaching the maximum and causing issues with the listener. For example, if the listener started task is scheduled to restart every weekend, the log will be refreshed and a new spool file will be created. If necessary, change the started task listener jobs from //DTLLOG DD SYSOUT=* //DTLLOG DD DSN=&HLQ..LOG, this will log the file to the member LOG in the HLQ..RUNLIB.

Recovery After Failure The last resort recovery procedure is to re-execute your initial extraction and load, and restart the CDC process from the new initial load start point. Fortunately there are other solutions. In any case, if you do need every change, reinitializing may not be an option.

Application ID PowerExchange documentation talks about “consuming” applications – the processes that extract changes, whether they are realtime or change (periodic batch extraction). Each “consuming” application must identify itself to PowerExchange. Realistically, this means that each session must have an application id parameter containing a unique “label”.

Restart Tokens Power Exchange remembers each time that a consuming application successfully extracts changes. The end-point of the extraction (Address in the database Log – RBA or SCN) is stored in a file on the server hosting the Listener that reads the changed data. Each of these memorized end-points (i.e., Restart Tokens) is a potential restart point. It is possible, using the Navigator interface directly, or by updating the restart file, to force the next extraction to restart from any of these points. If you’re using the ODBC interface for PowerExchange, this is the best solution to implement. INFORMATICA CONFIDENTIAL

BEST PRACTICES

662 of 954

If you are running periodic extractions of changes and everything finishes cleanly, the restart token history is a good approach to recovery back to a previous extraction. You simply chose the recovery point from the list and re-use it. There are more likely scenarios though. If you are running realtime extractions, potentially never-ending or until there’s a failure, there are no end-points to memorize for restarts. If your batch extraction fails, you may already have processed and committed many changes. You can’t afford to “miss” any changes and you don’t want to reapply the same changes you’ve just processed, but the previous restart token does not correspond to the reality of what you’ve processed. If you are using the Power Exchange Client for PowerCenter (PWXPC), the best answer to the recovery problem lies with PowerCenter, which has historically been able to deal with restarting this type of process – Guaranteed Message Delivery. This functionality is applicable to both realtime and change CDC options. The PowerExchange Client for PowerCenter stores the Restart Token of the last successful extraction run for each Application Id in files on the PowerCenter Server. The directory and file name are required parameters when configuring the PWXPC connection in the Workflow Manager. This functionality greatly simplifies recovery procedures compared to using the ODBC interface to PowerExchange. To enable recovery, select the Enable Recovery option in the Error Handling settings of the Configuration tab in the session properties. During normal session execution, PowerCenter Server stores recovery information in cache files in the directory specified for $PMCacheDir.

Normal CDC Execution If the session ends "cleanly" (i.e., zero return code), PowerCenter writes tokens to the restart file, and the GMD cache is purged. If the session fails, you are left with unprocessed changes in the GMD cache and a Restart Token corresponding to the point in time of the last of the unprocessed changes. This information is useful for recovery.

Recovery If a CDC session fails, and it was executed with recovery enabled, you can restart it in recovery mode – either from the PowerCenter Client interfaces or using the pmcmd command line instruction. Obviously, this assumes that you are able to identify that the session failed previously. 1. 2. 3. 4.

Start from the point in time specified by the Restart Token in the GMD cache. PowerCenter reads the change records from the GMD cache. PowerCenter processes and commits the records to the target system(s). Once the records in the GMD cache have been processed and committed, PowerCenter purges the records from the GMD cache and writes a restart token to the restart file. 5. The PowerCenter session ends “cleanly”. The CDC session is now ready for you to execute in normal mode again.

Recovery Using PWX ODBC Interface You can, of course, successfully recover if you are using the ODBC connectivity to PowerCenter, but you have to build in some things yourself – coping with processing all the changes from the last restart token, even if you’ve already processed some of them. When you re-execute a failed CDC session, you receive all the changed data since the last Power Exchange restart token. Your session has to cope with processing some of the same changes you already processed at the start of the failed execution – either using lookups/joins to the target to see if you’ve already applied the change you are INFORMATICA CONFIDENTIAL

BEST PRACTICES

663 of 954

processing, or simply ignoring database error messages such as trying to delete a record you already deleted. If you run DTLUAPPL to generate a restart token periodically during the execution of your CDC extraction and save the results, you can use the generated restart token to force a recovery at a more recent point in time than the last session-end restart token. This is especially useful if you are running realtime extractions using ODBC, otherwise you may find yourself re-processing several days of changes you’ve already processed. Finally, you can always re-initialize the target and the CDC processing: ●

Take an image copy of the tablespace containing the table to be captured, with QUIESCE option.

●

Monitor the EDMMSG output from the PowerExchange Logger job.

●

●

Look for message DTLEDM172774I which identifies the PowerExchange Logger sequence number corresponding to the QUIESCE event. The output logger show detail with the following format: DB2 QUIESCE of TABLESPACE TSNAME.TBNAME at DB2 RBA/LRSN 000849C56185 EDP Logger RBA . . . . . . . . . : D5D3D3D34040000000084E0000000000 Sequence number . . . . . . . . . : 000000084E0000000000 Edition number . . . . . . . . . : B93C4F9C2A79B000 Source EDMNAME(s) . . . . . . . . : DB2DSN1CAPTNAME1

●

Take note of the log sequence number

●

Repeat for all tables that form part of the same PowerExchange Application.

●

●

Run the DTLUAPPL utility specifying the application name and the registration name for each table in the application. Alter the SYSIN as follows: MOD APPL REGDEMO DSN1 (where REGDEMO is Registration name on Navigator) add RSTTKN CAPDEMO (where CAPDEMO is Capture name from Navigator) SEQUENCE 000000084E0000000000000000084E0000000000 RESTART D5D3D3D34040000000084E0000000000 END APPL REGDEMO (where REGDEMO is Registration name from Navigator)

●

Note how the sequence number is a repeated string from the sequence number found in the Logger messages after the Copy/Quiesce.

Note that the Restart parameter specified in the DTLUAPPL job is the EDP Logger RBA generated in the same message sequence. This sets the extraction start point on the PowerExchange Logger to the point at which the QUIESCE was done above. The image copy obtained above can be used for the initial materialization of the target tables.

PowerExchange Tasks: MVS Start and Stop Command Summary Task

Start Command*

INFORMATICA CONFIDENTIAL

Stop Command

BEST PRACTICES

Notes

Description of Task

664 of 954

/F DTLLST, CLOSE

Listener

/S DTLLST

/F DTLLST, CLOSE, FORCE If CLOSE doesn’t work /P DTLLST /C DTLLST

Agent

/S DTLA

Logger

/S DTLL

/S DTLC

If FORCE doesn’t work If STOP doesn’t work

The PowerExchange listener is used for bulk data movement and registering sources for Change Data Capture

The PowerExchange Agent, used to manage connections to the /DTLA DRAIN and PowerExchange SHUTDOWN COMPLETELY can be Logger and handle /DTLA shutdown used only at the request of Informatica repository and other Support tasks. This must be started before the Logger. The PowerExchange /P DTLL Logger used to ****(if you are installing, you need to manage the Linear run setup2 here prior to starting the datasets and /F DTLL, STOP Logger) /f DTLL, display hiperspace that hold change capture data.

/F DTLDB2EC, STOP or /F DTLDB2EC, ECCR (DB2) /S DTLDB2EC QUIESCE or /P DTLDB2EC

Condense

Preferred method /F DTLLST, CLOSE

STOP command just cancel ECCR, QUIESCE wait for open UOWs to There must be complete. /F DTLDB2EC, display will publish stats into the ECCR sysout

registrations present prior to bringing up most adaptor ECCRs. The PowerExchange Condenser used to run condense jobs against the PowerExchange Logger. This is used with PowerExchange CHANGE to organize the data by table, allow for interval-based extraction, and optionally fully condense multiple changes to a single row.

/F DTLC, SHUTDOWN

The PowerExchange

(1) To identify all tasks running through Apply process used in

Apply

a certain listener issue the following: (2) Then to stop the Apply issue the Submit JCL or / (1) F , D A S DTLAPP name) (2) F DTLLST, If the CAPX access and apply is STOPTASK name running locally not through a listener then issue the following command: , CLOSE

INFORMATICA CONFIDENTIAL

BEST PRACTICES

situations where straight replication is required and the data is not moved through PowerCenter before landing in the target.

665 of 954

Notes: 1. /p is an MVS STOP command , /f is an MVS MODIFY command. 2. REMOVE the / if the command is done from the console not SDSF.

If you attempt to shut down the Logger before the ECCR(s), a message indicates that there are still active ECCRs and that the logger will come down AFTER the ECCRS go away. What you should do is: You can shut the Listener and the ECCR(s) down at the same time. The Listener: 1. F ,CLOSE 2. If this isn’t coming down fast enough for you, issue F ,CLOSE FORCE 3. If it still isn’t coming down fast enough, issue C Note that these commands are listed in the order of most to least desirable method for bringing the listener down. The DB2 ECCR: 1. F ,QUIESCE - this waits for all OPEN UOWs to finish, which can be awhile if a longrunning batch job is running. 2. F ,STOP - this terminates immediately 3. P - this also terminates immediately Once the ECCR(s) are down, you can then bring the Logger down. The Logger: P The Agent: CMDPREFIX SHUTDOWN If you know that you are headed for an IPL, you can issue all these commands at the same time. The Listener and ECCR(s) should start down, if you are looking for speed, issue F ,CLOSE FORCE to shut down the Listener, then issue F ,STOP to terminate DB2 ECCR, then shut down the Logger and the Agent. Note: Bringing the Agent down before the ECCR(S) are down can result in a loss of captured data. If a new file/DB2 table/IMS database is being updated during this shutdown process and the Agent is not available, the call to see if the source is registered returns a “Not being captured” answer. The update, therefore, occurs without you capturing it, leaving your target in a broken state (which you won't know about until too late!)

Sizing the Logger When you install PWX-CHANGE, up to two active log data sets are allocated with minimum size requirements. The information in this section can help to determine if you need to increase the size of the data sets, and if you should allocate additional log data sets. When you define your active log data sets, consider your system’s capacity and your changed data requirements, including archiving and performance issues. After the PWX Logger is active, you can change the log data set configuration as necessary. In general, remember that you must balance the following variables: ●

Data set size

INFORMATICA CONFIDENTIAL

BEST PRACTICES

666 of 954

●

Number of data sets

●

Amount of archiving

The choices you make depend on the following factors: ●

Resource availability requirements

●

Performance requirements

●

Whether you are running near-realtime or batch replication

●

Data recovery requirements

An inverse relationship exists between the size of the log data sets and the frequency of archiving required. Larger data sets need to be archived less often than smaller data sets. Note: Although smaller data sets require more frequent archiving, the archiving process requires less time. Use the following formulas to estimate the total space you need for each active log data set. For an example of the calculated data set size, refer to the PowerExchange Reference Guide. ●

active log data set size in bytes = (average size of captured change record * number of changes captured per hour * desired number of hours between archives) * (1 + overhead rate)

●

active log data set size in cylinders = active log data set size in tracks/number of tracks per cylinder

●

active log data set size in tracks = active log data set size in bytes/number of usable bytes per track

When determining the average size of your captured change records, note the following information: ●

●

●

PWX Change Capture captures the full object that is changed. For example, if one field in an IMS segment has changed, the product captures the entire segment. The PWX header adds overhead to the size of the change record. Per record, the overhead is approximately 300 bytes plus the key length. The type of change transaction affects whether PWX Change Capture includes a before-image, afterimage, or both: ❍

DELETE includes a before-image.

❍

INSERT includes an after-image.

❍

UPDATE includes both.

Informatica suggests using an overhead rate of 5 to 10 percent, which includes the following factors: ●

Overhead for control information

●

Overhead for writing recovery-related information, such as system checkpoints.

You have some control over the frequency of system checkpoints when you define your PWX Logger parameters. See CHKPT_FREQUENCY in the PowerExchange Reference Guide for more information about this parameter. DASD Capacity Conversion Table Space Information

INFORMATICA CONFIDENTIAL

Model 3390

BEST PRACTICES

Model 3380

667 of 954

usable bytes per track

49,152

40,960

tracks per cylinder

15

15

This example is based on the following assumptions: ●

estimated average size of a changed record = 600 bytes

●

estimated rate of captured changes = 40,000 changes per hour

●

desired number of hours between archives = 12

●

overhead rate = 5 percent

●

DASD model = 3390

The estimated size of each active log data set in bytes is calculated as follows: 600 * 40,000 * 12 * 1.05 = 302,400,000 The number of cylinders to allocate is calculated as follows: 302,400,000 / 49,152 = approximately 6152 tracks 6152 / 15 = approximately 410 cylinders The following example shows an IDCAMS DEFINE statement that uses the above calculations:

DEFINE CLUSTER (NAME (HLQ.EDML.PRILOG.DS01) LINEAR VOLUMES(volser) SHAREOPTIONS(2,3) CYL(410) ) DATA (NAME(HLQ.EDML.PRILOG.DS01.DATA) ) The variable HLQ represents the high-level qualifier that you defined for the log data sets during installation.

Additional Logger Tips The Logger format utility (EDMLUTL0) formats only the primary space allocation. This means that the Logger does not use secondary allocation. This includes Candidate Volumes and Space, such as that allocated by SMS when using a STORCLAS with the Guaranteed Space attribute. Logger active logs should be defined through IDCAMS with: ●

No secondary allocation.

●

A single VOLSER in the VOLUME parameter.

●

An SMS STORCLAS, if used, without GUARANTEED SPACE=YES.

PowerExchange Agent Commands INFORMATICA CONFIDENTIAL

BEST PRACTICES

668 of 954

You can use commands from the MVS system to control certain aspects of PowerExchange Agent processing. To issue a PowerExchange Agent command, enter the PowerExchange Agent command prefix (as specified by CmdPrefix in your configuration parameters), followed by the command. For example, if CmdPrefix=AG01, issue the following command to close the Agent's message log: AG01 LOGCLOSE The PowerExchange Agent intercepts agent commands issued on the MVS console and processes them in the agent address space. If the PowerExchange Agent address space is inactive, MVS rejects any PowerExchange Agent commands that you issue. If the PowerExchange Agent has not been started during the current IPL, or if you issue the command with the wrong prefix, MVS generates the following message: IEE305I command COMMAND INVALID See PowerExchange Reference Guide (8.1.1) for detailed information on Agent commands.

PowerExchange Logger Commands The PowerExchange Logger uses two types of commands: interactive and batch You run interactive commands from the MVS console when the PowerExchange logger is running. You can use PowerExchange Logger interactive commands to: ●

Display PowerExchange Logger log data sets, units of work (UOWs), and reader/writer connections.

●

Resolve in-doubt UOWs.

●

Stop a PowerExchange Logger.

●

Print the contents of the PowerExchange active log file (in hexadecimal format).

You use batch commands primarily in batch change utility jobs to make changes to parameters and configurations when the PowerExchange Logger is stopped. Use PowerExchange Logger batch commands to: ●

Define PowerExchange Loggers and PowerExchange Logger options, including PowerExchange Logger names, archive log options, buffer options, and mode (single or dual).

●

Add log definitions to the restart data set.

●

Delete data set records from the restart data set.

●

Display log data sets, UOWs, and reader/writer connections.

See PowerExchange Reference Guide (8.1.1) for detailed information on Logger Commands (Chapter 4, Page 59)

Last updated: 05-Jun-08 14:43

INFORMATICA CONFIDENTIAL

BEST PRACTICES

669 of 954

Data Integration Load Traceability Challenge Load management is one of the major difficulties facing a data integration or data warehouse operations team. This Best Practice tries to answer the following questions: ●

How can the team keep track of what has been loaded?

●

What order should the data be loaded in?

●

What happens when there is a load failure?

●

How can bad data be removed and replaced?

●

How can the source of data be identified?

●

When it was loaded?

Description Load management provides an architecture to allow all of the above questions to be answered with minimal operational effort.

Benefits of a Load Management Architecture Data Lineage The term Data Lineage is used to describe the ability to track data from its final resting place in the target back to its original source. This requires the tagging of every row of data in the target with an ID from the load management metadata model. This serves as a direct link between the actual data in the target and the original source data. To give an example of the usefulness of this ID, a data warehouse or integration competency center operations team, or possibly end users, can, on inspection of any row of data in the target schema, link back to see when it was loaded, where it came from, any other metadata about the set it was loaded with, validation check results, number of other rows loaded at the same time, and so forth. It is also possible to use this ID to link one row of data with all of the other rows loaded at the same time. This can be useful when a data issue is detected in one row and the operations team needs to see if the same error exists in all of the other rows. More than this, it is the ability to easily identify the source data for a specific row in the target, enabling the operations team to quickly identify where a data issue may lie. It is often assumed that data issues are produced by the transformation processes executed as

INFORMATICA CONFIDENTIAL

BEST PRACTICES

670 of 954

part of the target schema load. Using the source ID to link back the source data makes it easy to identify whether the issues were in the source data when it was first encountered by the target schema load processes or if those load processes caused the issue. This ability can save a huge amount of time, expense, and frustration -- particularly in the initial launch of any new subject areas.

Process Lineage Tracking the order that data was actually processed in is often the key to resolving processing and data issues. Because choices are often made during the processing of data based on business rules and logic, the order and path of processing differs from one run to the next. Only by actually tracking these processes as they act upon the data can issue resolution be simplified.

Process Dependency Management Having a metadata structure in place provides an environment to facilitate the application and maintenance of business dependency rules. Once a structure is in place that identifies every process, it becomes very simple to add the necessary metadata and validation processes required to ensure enforcement of the dependencies among processes. Such enforcement resolves many of the scheduling issues that operations teams typically faces. Process dependency metadata needs to exist because it is often not possible to rely on the source systems to deliver the correct data at the correct time. Moreover, in some cases, transactions are split across multiple systems and must be loaded into the target schema in a specific order. This is usually difficult to manage because the various source systems have no way of coordinating the release of data to the target schema.

Robustness Using load management metadata to control the loading process also offers two other big advantages, both of which fall under the heading of robustness because they allow for a degree of resilience to load failure.

Load Ordering Load ordering is a set of processes that use the load management metadata to identify the order in which the source data should be loaded. This can be as simple as making sure the data is loaded in the sequence it arrives, or as complex as having a pre-defined load sequence planned in the metadata. There are a number of techniques used to manage these processes. The most common is an automated process that generates a PowerCenter load list from flat files in a directory, then archives the files in that list after the load is complete. This process can use embedded data in file names or can read header records to identify the correct ordering of the data. Alternatively the correct order can be pre-defined in the load management metadata using load calendars. Either way, load ordering should be employed in any data integration or data warehousing INFORMATICA CONFIDENTIAL

BEST PRACTICES

671 of 954

implementation because it allows the load process to be automatically paused when there is a load failure, and ensures that the data that has been put on hold is loaded in the correct order as soon as possible after a failure. The essential part of the load management process is that it operates without human intervention, helping to make the system self healing!

Rollback If there is a loading failure or a data issue in normal daily load operations, it is usually preferable to remove all of the data loaded as one set. Load management metadata allows the operations team to selectively roll back a specific set of source data, the data processed by a specific process, or a combination of both. This can be done using manual intervention or by a developed automated feature.

Simple Load Management Metadata Model

As you can see from the simple load management metadata model above, there are two sets of data linked to every transaction in the target tables. These represent the two major types of load management metadata: ●

Source tracking

●

Process tracking

INFORMATICA CONFIDENTIAL

BEST PRACTICES

672 of 954

Source Tracking Source tracking looks at how the target schema validates and controls the loading of source data. The aim is to automate as much of the load processing as possible and track every load from the source through to the target schema.

Source Definitions Most data integration projects use batch load operations for the majority of data loading. The sources for these come in a variety of forms, including flat file formats (ASCII, XML etc), relational databases, ERP systems, and legacy mainframe systems. The first control point for the target schema is to maintain a definition of how each source is structured, as well as other validation parameters. These definitions should be held in a Source Master table like the one shown in the data model above. These definitions can and should be used to validate that the structure of the source data has not changed. A great example of this practice is the use of DTD files in the validation of XML feeds. In the case of flat files, it is usual to hold details like: ●

Header information (if any)

●

How many columns

●

Data types for each column

●

Expected number of rows

For RDBMS sources, the Source Master record might hold the definition of the source tables or store the structure of the SQL statement used to extract the data (i.e., the SELECT, FROM and ORDER BY clauses). These definitions can be used to manage and understand the initial validation of the source data structures. Quite simply, if the system is validating the source against a definition, there is an inherent control point at which problem notifications and recovery processes can be implemented. It’s better to catch a bad data structure than to start loading bad data.

Source Instances A Source Instance table (as shown in the load management metadata model) is designed to hold one record for each separate set of data of a specific source type being loaded. It should have a direct key link back to the Source Master table which defines its type.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

673 of 954

The various source types may need slightly different source instance metadata to enable optimal control over each individual load. Unlike the source definitions, this metadata will change every time a new extract and load is performed. In the case of flat files, this would be a new file name and possibly date / time information from its header record. In the case of relational data, it would be the selection criteria (i. e., the SQL WHERE clause) used for each specific extract, and the date and time it was executed. This metadata needs to be stored in the source tracking tables so that the operations team can identify a specific set of source data if the need arises. This need may arise if the data needs to be removed and reloaded after an error has been spotted in the target schema.

Process Tracking Process tracking describes the use of load management metadata to track and control the loading processes rather than the specific data sets themselves. There can often be many load processes acting upon a single source instance set of data. While it is not always necessary to be able to identify when each individual process completes, it is very beneficial to know when a set of sessions that move data from one stage to the next has completed. Not all sessions are tracked this way because, in most cases, the individual processes are simply storing data into temporary tables that will be flushed at a later date. Since load management process IDs are intended to track back from a record in the target schema to the process used to load it, it only makes sense to generate a new process ID if the data is being stored permanently in one of the major staging areas.

Process Definition Process definition metadata is held in the Process Master table (as shown in the load management metadata model ). This, in its basic form, holds a description of the process and its overall status. It can also be extended, with the introduction of other tables, to reflect any dependencies among processes, as well as processing holidays.

Process Instances A process instance is represented by an individual row in the load management metadata Process Instance table. This represents each instance of a load process that is actually run. This holds metadata about when the process started and stopped, as well as its current status. Most importantly, this table allocates a unique ID to each instance. The unique ID allocated in the process instance table is used to tag every row of source data. This ID is then stored with each row of data in the target table.

Integrating Source and Process Tracking

INFORMATICA CONFIDENTIAL

BEST PRACTICES

674 of 954

Integrating source and process tracking can produce an extremely powerful investigative and control tool for the administrators of data warehouses and integrated schemas. This is achieved by simply linking every process ID with the source instance ID of the source it is processing. This requires that a write-back facility be built into every process to update its process instance record with the ID of the source instance being processed. The effect is that there is a one to one/many relationship between the source instance table and the process instance table containing several rows for each set of source data loaded into a target schema. For example, in a data warehousing project, a row for loading the extract into a staging area, a row for the move from the staging area to an ODS, and a final row for the move from the ODS to the warehouse.

Integrated Load Management Flow Diagram

Tracking Transactions This is the simplest data to track since it is loaded incrementally and not updated. This means that the process and source tracking discussed earlier in this document can be applied as is.

Tracking Reference Data

INFORMATICA CONFIDENTIAL

BEST PRACTICES

675 of 954

This task is complicated by the fact that reference data, by its nature, is not static. This means that if you simply update the data in a row any time there is a change, there is no way that the change can be backed out using the load management practice described earlier. Instead, Informatica recommends always using slowly changing dimension processing on every reference data and dimension table to accomplish source and process tracking. Updating the reference data as a ‘slowly changing table’ retains the previous versions of updated records, thus allowing any changes to be backed out.

Tracking Aggregations Aggregation also causes additional complexity for load management because the resulting aggregate row very often contains the aggregation across many source data sets. As with reference data, this means that the aggregated row cannot be backed out in the same way as transactions. This problem is managed by treating the source of the aggregate as if it was an original source. This means that rather than trying to track the original source, the load management metadata only tracks back to the transactions in the target that have been aggregated. So, the mechanism is the same as used for transactions but the resulting load management metadata only tracks back from the aggregate to the fact table in the target schema.

Last updated: 20-Dec-07 15:44

INFORMATICA CONFIDENTIAL

BEST PRACTICES

676 of 954

Disaster Recovery Planning with PowerCenter HA Option Challenge Develop a Disaster Recovery (DR) Plan for PowerCenter running on Unix/Linux platforms. Design a PowerCenter data integration platform for high availability (HA) and disaster recovery that can support a variety of mission-critical and time-sensitive operational applications across multiple business and functional areas.

Description To enable maximum resilience, the data integration platform design should provide redundancy and remoteness. The target architecture proposed in this document is based upon the following assumptions: ●

A PowerCenter HA option license is present. ●

A Cluster File System will be used to provide concurrent file access from multiple servers in order to provide a flexible, high-performance, and highly available platform for shared data in a SAN environment. ●

Four servers will be available for installing PowerCenter components. ●

PowerCenter binaries, repository/domain database, and shared file system for PowerCenter working files are considered in a failover scenario. The DR plan does not take into consideration source and target databases, ftp servers or scheduling tools. A standby database server (which requires replicated logs for recovery) will be used as the disaster recovery solution for the database tier. It will provide disaster tolerance for both the PowerCenter repository and the domain database. As this server will be used to achieve high availability it should have performance characteristics in parity with the primary repository database server. Recovery time for storage can be reduced using near real-time replication of data-over-distance from the primary SAN to a mirror SAN. Storage vendors should be consulted for optimal SAN and mirror SAN configuration.

Primary Data Center During Normal Operation

INFORMATICA CONFIDENTIAL

BEST PRACTICES

677 of 954

PowerCenter Domain During Normal Operation Informatica Server Manager on Node 1 and Node 2 are running. Informatica Server Manager on Node 3 and Node 4 is shutdown.

A node is a logical representation of a physical machine. Each node runs a Service Manager (SM) process to control the services running on that node. A node is considered unavailable if the SM process is not up and running. For example, the SM process may not be running if the administrator has shut down the machine or the SM process. SM processes periodically exchange a heartbeat signal amongst themselves to detect any node/network INFORMATICA CONFIDENTIAL

BEST PRACTICES

678 of 954

failure. Upon detecting a primary (or backup) node failure, the remaining nodes determine the new primary (or backup) node via a distributed voting algorithm. Typically, the administrator will configure the OS to automatically start the SM whenever the OS boots up or in the event the SM fails unexpectedly. For unexpected failures of the SM, monitoring scripts should be used because the SM is the primary point of control for PowerCenter services on a node. When PowerCenter is installed on a Unix/Linux platform, the same user id (uid) and group id (gid) should be created for all Unix/Linux users on Node1, Node2, Node3 and Node4. When the infa_shared directory is placed on a shared file system like CFS, all Unix/Linux users should be granted read/write access to the same files. For example, if a workflow running on Node1 creates a log file in the log directory, Node2, Node3 and Node4 should be able to read and update this file. To install and configure PowerCenter services on four nodes: 1. 2. 3. 4.

For the Node1 installation, choose the option to “create domain”. For the Node2, Node3 and Node4 installations choose the option to “join the domain”. Node1 will be the master gateway. For Node2, Node3 and Node4 choose “Serves as Gateway: Yes”. For Node 1, use the following URL to confirm that it is the Master Gateway: http://node1_hostname:6001/coreservices/DomainService The result should look like this: /coreservices/AlertService : enabled /coreservices/AuthenticationService : initialized /coreservices/ AuthorizationService : enabled /coreservices/DomainConfigurationService : enabled /coreservices/ DomainService : [DOM_10004] Domain service is currently master gateway node and enabled. / coreservices/DomainService/InitTime : Fri Aug 03 09:59:03 EDT 2007 /coreservices/ LicensingService : enabled /coreservices/LogService : enabled /coreservices/LogServiceAgent : initialized /coreservices/NodeConfigurationService : enabled

5. For Node2, Node 3 and Node 4 respectively, use the following URL to confirm that they are not Master Gateways: http://node2_hostname:6001/coreservices/DomainService The result should look like this: /coreservices/AlertService : uninitialized /coreservices/AuthenticationService : initialized /coreservices/ AuthorizationService : initialized /coreservices/DomainConfigurationService : initialized /coreservices/ DomainService : [DOM_10005] Domain service is currently non-master gateway node and listening. / coreservices/DomainService/InitTime : Fri Aug 03 09:59:03 EDT 2007 /coreservices/ LicensingService : initialized /coreservices/LogService : initialized /coreservices/LogServiceAgent : initialized /coreservices/NodeConfigurationService : enabled 6. Confirm the following settings: a. b. c. d. e.

For Node1 Repository Service should be created as primary. For Node1 “Acts as backup Integration Service” should be checked. For Node2 Integration Service should be created as primary. For Node2 “Acts as backup Repository Service” should be checked. Node3 and Node4 should be assigned as backup nodes for Repository Service and Integration Service.

Note: During the failover in order for Node3 and Node4 to act as primary repository services, they will need to have access to the standby repository database. INFORMATICA CONFIDENTIAL

BEST PRACTICES

679 of 954

After the installation, persistent cache files, parameter files, logs, and other run-time files should be configured to use the directory created on the shared file system by pointing the $PMRootDir variable to this directory. Otherwise a symbolic link can be created from the default infa_shared location to the infa_shared directory created on the shared file system. After the initial set up, Node3 and Node4 should be shutdown from the Administration Console. During normal operations Node3 and Node4 will be unavailable. In the event of a failover to the secondary data center, it is assumed that the servers for Node1 and Node2 will become unavailable. By rebooting the hosts for Node3 and Node4 the following script-placed init.d will start the Service Manager process: TOMCAT_HOME=/u01/app/informatica/pc8.0.0/server/tomcat/bin case "$1" in 'start') # Start the PowerCenter daemons: su - pmuser -c "$TOMCAT_HOME/infaservice.sh startup" exit ;; 'stop') Esac Every node in the domain sends a heartbeat to the primary gateway at a periodic interval. The default value for this interval is 15 seconds (this may change in a future release). The heartbeat is a message sent over an HTTP connection. As part of the heartbeat, each node also updates the gateway with the service processes currently running on the node. If a node fails to send a heartbeat during the default timeout value which is a multiple of the heartbeat interval (the default value is 90 seconds) then the primary gateway node marks the node unavailable and will failover any of the services running on that node. Six chances are given for the node to update the master before it is marked as down. This avoids any false alarms for a single packet loss or in cases of heavy network load where the packet delivery could take longer. When Node3 and Node4 are started in the backup data center, they will try to establish a connection to the Master Gateway Node1. After failing to reach Node1, one of them will establish itself as the new Master Gateway. When normal operations resume, Node1 and Node2 will be rebooted and the Informatica Service Manager process will start on these nodes. Since the Informatica Service Manager process on Node3 and Node4 will be shutdown, Node1 will try to become the Master Gateway. The change in configuration required for the DR servers (there will be two servers as in production) can be set up as a script to automate the switchover to DR. For example, the database connectivity should be configured such that failover to the standby database is transparent to the PowerCenter repository and the Domain database. All database connectivity information should be identical in both data centers to make sure that the same source and target databases are used. For scheduling tools, FTP servers and message queues additional steps are required to switch to the ETL platform in the backup data center. As a result of using the PowerCenter HA option, redundancy in the primary data center is achieved. By using SAN mirroring, a standby repository database, and PowerCenter installations at the backup data center, remoteness is achieved. A further scale-out approach is recommended using the PowerCenter grid option to leverage resources on all of the servers. A single cluster file system across all nodes is essential to coordinate read/write access to the storage pool, ensure data integrity, and attain performance.

Backup Data Center After Failover From Primary Data Center

INFORMATICA CONFIDENTIAL

BEST PRACTICES

680 of 954

PowerCenter Domain During DR Operation Informatica Server Manager on Node 2 and Node 3 are running. Informatica Server Manager on Node 1 and Node 2 is shutdown.

Last updated: 04-Dec-07 18:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

681 of 954

High Availability Challenge Increasingly, a number of customers find that their Data Integration implementation must be available 24x7 without interruption or failure. This Best Practice describes the High Availability (HA) capabilities incorporated in PowerCenter and explains why it is critical to address both architectural (e.g., systems, hardware, firmware) and procedural (e.g., application design, code implementation, session/workflow features) recovery with HA.

Description One of the common requirements of high volume data environments with non-stop operations is to minimize the risk exposure from system failures. PowerCenter’s High Availability Option provides failover, recovery and resilience for business critical, always-on data integration processes. When considering HA recovery, be sure to explore the following two components of HA that exist on all enterprise systems:

External Resilience External resilience has to do with the integration and specification of domain name servers, database servers, FTP servers, network access servers in a defined, tested 24x7 configuration. The nature of Informatica’s data integration setup places it at many interface points in system integration. Before placing and configuring PowerCenter within an infrastructure that has an HA expectation, the following questions should be answered: ●

●

●

Is the pre-existing set of servers already in a sustained HA configuration? Is there a schematic with applicable settings to use for reference? If so, is there a unit test or system test to exercise before installing PowerCenter products? It is important to remember, as a prerequisite for the PowerCenter architecture that the external systems must be HA. What are the bottlenecks or perceived failure points of the existing system? Are these bottlenecks likely to be exposed or heightened by placing PowerCenter in the infrastructure? (e.g., five times the amount of Oracle traffic, ten times the amount of DB2 traffic, a UNIX server that always shows 10% idle may now have twice as many processes running). Finally, if a proprietary solution (such as IBM HACMP or Veritas Storage Foundation for Windows) has been implemented with success at a customer site, this sets a different expectation. The customer may merely want the grid capability of multiple PowerCenter nodes to recover Informatica tasks, and expect their O/S level HA capabilities to provide file system or server bootstrap recovery upon a fundamental failure of those back-end systems. If these back-end systems have a script/command capability to, for example, restart a repository service, PowerCenter can be installed in this fashion. However, PowerCenter's HA capability extends as far as the PowerCenter components.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

682 of 954

Internal Resilience In an HA PowerCenter environment key elements to keep in mind are: ● ●

●

Rapid and constant connectivity to the repository metadata. Rapid and constant network connectivity between all gateway and worker nodes in the PowerCenter domain. A common highly-available storage system accessible to all PowerCenter domain nodes with one service name and one file protocol. Only domain nodes on the same operating system can share gateway and log files (see Admin Console->Domain->Properties->Log and Gateway Configuration).

Internal resilience occurs within the PowerCenter environment among PowerCenter services, the PowerCenter Client tools, and other client applications such as pmrep and pmcmd. Internal resilience can be configured at the following levels: ●

●

●

Domain. Configure service connection resilience at the domain level in the general properties for the domain. The domain resilience timeout determines how long services attempt to connect as clients to application services or the Service Manager. The domain resilience properties are the default values for all services in the domain. Service. It is possible to configure service connection resilience in the advanced properties for an application service. When configuring connection resilience for an application service, this overrides the resilience values from the domain settings. Gateway. The master gateway node maintains a connection to the domain configuration database. If the domain configuration database becomes unavailable, the master gateway node tries to reconnect. The resilience timeout period depends on user activity and whether the domain has one or multiple gateway nodes: ❍

❍

Single gateway node. If the domain has one gateway node, the gateway node tries to reconnect until a user or service tries to perform a domain operation. When a user tries to perform a domain operation, the master gateway node shuts down. Multiple gateway nodes. If the domain has multiple gateway nodes and the master gateway node cannot reconnect, then the master gateway node shuts down. If a user tries to perform a domain operation while the master gateway node is trying to connect, the master gateway node shuts down. If another gateway node is available, the domain elects a new master gateway node. The domain tries to connect to the domain configuration database with each gateway node. If none of the gateway nodes can connect, the domain shuts down and all domain operations fail.

Common Elements of Concern in an HA Configuration Restart and Failover Restart and Failover has to do with the Domain Services (Integration and Repository). If these services are not highly available, the scheduling, dependencies(e.g., touch files, ftp, etc.) and INFORMATICA CONFIDENTIAL

BEST PRACTICES

683 of 954

artifacts of the ETL process cannot be highly available. If a service process becomes unavailable, the Service Manager can restart the process or fail it over to a backup node based on the availability of the node. When a service process restarts or fails over, the service restores the state of operation and begins recovery from the point of interruption. Backup nodes can be configured for services with the the high availability option. If an application service is configured to run on primary and backup nodes, one service process can run at a time. The following situations describe restart and failover for an application service: ●

●

If the primary node running the service process becomes unavailable, the service fails over to a backup node. The primary node may be unavailable if it shuts down or if the connection to the node becomes unavailable. If the primary node running the service process is available, the domain tries to restart the process based on the restart options configured in the domain properties. If the process does not restart, the Service Manager can mark the process as failed. The service then fails over to a backup node and starts another process. If the Service Manager marks the process as failed, the administrator must enable the process after addressing any configuration problem.

If a service process fails over to a backup node, it does not fail back to the primary node when the node becomes available. The service process can be disabled on the backup node to cause it to fail back to the primary node.

Recovery Recovery is the completion of operations after an interrupted service is restored. When a service recovers, it restores the state of operation and continues processing the job from the point of interruption. The state of operation for a service contains information about the service process. The PowerCenter services include the following states of operation: ●

●

●

Service Manager. The Service Manager for each node in the domain maintains the state of service processes running on that node. If the master gateway shuts down, the newly elected master gateway collects the state information from each node to restore the state of the domain. Repository Service. The Repository Service maintains the state of operation in the repository. This includes information about repository locks, requests in progress and connected clients. Integration Service. The Integration Service maintains the state of operation in the shared storage configured for the service. This includes information about scheduled, running, and completed tasks for the service. The Integration Service maintains the session and workflow state of operations based on the recovery strategy configured for the session and workflow.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

684 of 954

When designing a system that has HA recovery as a core component, be sure to include architectural and procedural recovery. Architectural recovery for a PowerCenter domain involves the Service Manager, Repository Service and Integration Service restarting in a complete, sustainable and traceable manner. If the Service Manager and Repository Service recover, but the Integration Service cannot recover the restart is not successful and has little value to a production environment. Field experience with PowerCenter has yielded these key items in planning a proper recovery upon a systemic failure: ●

●

●

●

A PowerCenter domain cannot be established without at least one gateway node running. Even if a domain consists of ten worker nodes and one gateway node, none of the worker nodes can run ETL jobs without a gateway node managing the domain. An Integration Service cannot run without its associated Repository Service being started and connected to its metadata repository. A Repository Service cannot run without its metadata repository DBMS being started and accepting database connections. Often database connections are established on periodic windows that expire – which puts the repository offline. If the installed domain configuration is running from Authentication Module Configuration and the LDAP Principal User account becomes corrupt or inactive, all PowerCenter repository access is lost. If the installation uses any additional authentication outside PowerCenter (such as LDAP), an additional recovery and restart plan is required.

Procedural recovery is supported with many features of PowerCenter. Consider the following very simple mapping that might run in production for many ETL applications:

Suppose there is a situation where the ftp server sending this ff_customer file is inconsistent. Many times the file is not there, but the processes depending on this must always run. The process is always insert only. You do not want the succession of ETL that follows this small process to fail they can run to customer_stg with current records only. This setting in the Workflow Manager, Session, Properties would fit your need:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

685 of 954

Since it is not critical the ff_customer records run each time, record the failure but continue the process. Now say the situation has changed. Sessions are failing on a PowerCenter server due to target database timeouts. A requirement is given that the session must recover from this:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

686 of 954

Resuming from last checkpoint restarts the process from its prior commit, allowing no loss of ETL work. To finish this second case, consider three basic items on the workflow side when the HA option is implemented:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

687 of 954

An Integration Service in an HA environment can only recover those workflows marked with “Enable HA recovery”. For all critical workflows, this should be considered. For a mature set of ETL code running in QA or Production, consider the following workflow property:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

688 of 954

This would automatically recover tasks from where they failed in a workflow upon an application or system wide failure. Consider carefully the use of this feature, however. Remember, automated restart of critical ETL processes without interaction can have vast unintended side effects. For instance, if a database alias or synonym was dropped, all ETL targets may now refer to different objects than the original intent. Only PowerCenter environments with HA, mature production support practices, and a complete operations manual per Velocity, should expect complete recovery with this feature. In an HA environment, certain components of the Domain can go offline while the Domain stays up to execute ETL jobs. This is a time to use the “Suspend On Error” feature from the General tab of Workflow settings. The backup Integration Server would then pickup this workflow and resume processing based on the resume settings of this workflow:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

689 of 954

Features A variety of HA features exist in PowerCenter. Specifically, they include: ●

Integration Service HA option

●

PowerCenter Enterprise Grid Option

●

Repository Service HA option

First, proceed from an assumption that nodes have been provided such that a basic HA configuration of PowerCenter can take place. A lab-tested version completed by Informatica is configured as below with an HP solution. Your solution can be completed with any reliable clustered file system. Your first step would always be implementing and thoroughly exercising a

INFORMATICA CONFIDENTIAL

BEST PRACTICES

690 of 954

clustered file system:

Now, let’s address the options in order:

Integration Service HA Option You must have the HA option on the license key for this to be available on install. Note that once the base PowerCenter install is configured, all nodes are available from the Admin Console>Domain->Integration Services->Grid/Node Assignments. From the above example, you would see Node 1, Node 2, Node 3 as dropdown options on that browse page. With the HA (Primary/Backup) install complete, Integration Services are then displayed with both “P” and “B” in a configuration, with the current operating node highlighted:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

691 of 954

If a failure were to occur on this HA configuration, the Integration Service INT_SVCS_DEV would poll the Domain: Domain_Corp_RD for another Gateway Node, then assign INT_SVCS_DEV over to that Node, in this case Node_Corp_RD02. Then the “B” button would highlight showing this Node as providing INT_SVCS_DEV. A vital component of configuring the Integration Service for HA is making sure the Integration Service files are stored in a shared persistent environment. The paths for Integration Service files must be specified for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files. Each Integration Service process uses run-time files to process workflows and sessions. If an Integration Service is configured to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. State of operation files must be accessible by all Integration Service processes. When an Integration Service is enabled, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption. All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location. By default, the installation program creates a set of Integration Service directories in the server INFORMATICA CONFIDENTIAL

BEST PRACTICES

692 of 954

\infa_shared directory. The shared location for these directories can be set by configuring the process variable $PMRootDir to point to the same location for each Integration Service process. The key HA concern of this is $PMRootDir should be on the highly-available clustered file system mentioned above.

Integration Service Grid Option The Grid option provides implicit HA since the Integration Service can be configured as active/ active to provide redundancy. The Server Grid option should be included on the license key for this to be available upon install. In configuring the $PMRootDir files for the Integration Service, retain the methodology described above. Also, in Admin Console->Domain->Properties->Log and Gateway Configuration, the log and directory paths should be on the clustered file system mentioned above. A grid must be created before it can be used in a Power Center domain. Be sure to remember these key points: ●

●

PowerCenter supports nodes from heterogeneous operating systems, bit modes, and others to be used within same domain. However, if there are heterogeneous nodes for a grid, then you can run Workflow on Grid. For the Session on Grid option, an homogeneous grid is required. An homogeneous grid is necessary for Session on Grid because a session may have a sharing cache file and other objects that may not be compatible with all of the operating systems.

If you have a large volume of disparate hardware, it is certainly possible to make perhaps two grids centered on two different operating systems. In either case, the performance of your clustered file system is going to affect the performance of your server grid, and should be considered as part of your performance/maintenance strategy.

Repository Service HA Option You must have the HA option on the license key for this to be available on install. There are two ways to include the Repository Service HA capability when configuring PowerCenter: ●

●

The first is during install. When the Install Program prompts for your nodes to do a Repository install (after answering “Yes” to Create Repository), you can enter a second node where the Install Program can create and invoke the PowerCenter service and Repository Service for a backup repository node. Keep in mind that all of the database, OS, and server preparation steps referred to in the PowerCenter Installation and Configuration Guide still hold true for this backup node. When the install is complete, the Repository Service displays a “P”/”B” link similar to that illustrated above for the INT_SVCS_DEV example Integration Service. A second method for configuring Repository Service HA allows for measured, incremental implementation of HA from a tested base configuration. After ensuring that your initial Repository Service settings (e.g., resilience timeout, codepage, connection timeout) and the DBMS repository containing the metadata are running and stable, you can add a

INFORMATICA CONFIDENTIAL

BEST PRACTICES

693 of 954

second node and make it the Repository Backup. Install the PowerCenter Service on this second server following the PowerCenter Installation and Configuration Guide. In particular, skip creating Repository Content or an Integration Service on the node. Following this, Go to Admin Console->Domain and select: “Create->Node”. The server to contain this node should be of the exact same configuration/ clustered file system/OS as the Primary Repository Service. The following dialog should appear:

Assign a logical name to the node to describe its place, and select “Create”. The node should now be running as part of your domain, but if it isn't, refer to the PowerCenter Command Line Reference with the infaservice and infacmd commands to ensure the node is running on the domain. When it is running, go to Domain->Repository->Properties->Node Assignments->Edit and the browser window displays:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

694 of 954

Click “OK” and the Repository Service is now configured in a Primary/Backup setup for the domain. To ensure the P/B setting, test the following elements of the configuration: 1. Be certain the same version of the DBMS client is installed on the server and can access the metadata. 2. Both nodes must be on the same clustered file system. 3. Log onto the OS for the Backup Repository Service and ping the Domain Master Gateway Node. Be sure a reasonable response time is being given at an OS level (i.e., less than 5 seconds). 4. Take the Primary Repository Service Node offline and validate that the polling, failover, restart process takes place in a methodical, traceable manner for the Repository Service on the Domain. This should be clearly visible from the node logs on the Primary and Secondary Repository Service boxes [$INFA_HOME/server/tomcat/logs] or from Admin Console->Repository->Logs. Note: Remember that when a node is taken offline, you cannot access Admin Console from that node.

Using a Script to Monitor Informatica Services A script should be used with the High Availability Option that will check all the Informatica Services in the domain as well as the domain itself. If any of the services are down the script can bring them back up. To implement the HA Option using a script, the Domain, Repository and Integration details need to be provided as input to the script; and the script needs to be scheduled to run at regular intervals. The script can be developed with eight functions (and one main function to check and bring up the services). A script can be implemented in any environment by providing input in the section only. Comments have been provided for each function to make them easy to understand. Below is a brief description of the eight functions: INFORMATICA CONFIDENTIAL

BEST PRACTICES

695 of 954

●

print_msg: Called to print output to the I/O and also writes to the log file.

●

domain_service_lst: Accepts the list of services to be checked for in the domain.

●

●

● ●

check_service: Calls the service manager, repository, and the integration functions internally to check if they are up and running. check_repo_service: Checks if the repository is up or down. If it is down it calls another function to bring it up. enable_repo_service: Called to enable the repository service. check_int_service: Checks if the integration is up or down. If it is down it calls another function to bring it up.

●

enable_int_service: Called to enable the integration service.

●

disable_int_service: Called to disable the integration service.

Last updated: 25-May-08 19:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

696 of 954

Load Validation Challenge Knowing that all data for the current load cycle has loaded correctly is essential for effective data warehouse management. However, the need for load validation varies depending on the extent of error checking, data validation, and data cleansing functionalities inherent in your mappings. For large data integration projects with thousands of mappings, the task of reporting load statuses becomes overwhelming without a well-planned load validation process.

Description Methods for validating the load process range from simple to complex. Use the following steps to plan a load validation process: 1. Determine what information you need for load validation (e.g., work flow names, session names, session start times, session completion times, successful rows and failed rows). 2. Determine the source of the information. All of this information is stored as metadata in the PowerCenter repository, but you must have a means of extracting it. 3. Determine how you want the information presented to you. Should the information be delivered in a report? Do you want it emailed to you? Do you want it available in a relational table so that history is easily preserved? Do you want it stored as a flat file? Weigh all of these factors to find the correct solution for your project. Below are descriptions of five possible load validation solutions, ranging from fairly simple to increasingly complex:

1. Post-session Emails on Success or Failure One practical application of the post-session email functionality is the situation in which a key business user waits for completion of a session to run a report. Email is configured to notify the user when the session was successful so that the report can be run. Another practical application is the situation in which a production support analyst needs to be notified immediately of any failures. Configure the session to send an email to the analyst upon failure. For round-the-clock support, a pager number that has the ability to receive email can be used in place of an email address. Post-session email is configured in the session, under the General tab and ‘Session Commands’. A number of variables are available to simplify the text of the email: ●

%s Session name

INFORMATICA CONFIDENTIAL

BEST PRACTICES

697 of 954

●

%e Session status

●

%b Session start time

●

%c Session completion time

●

%i Session elapsed time

●

%l Total records loaded

●

%r Total records rejected

●

%t Target table details

●

%m Name of the mapping used in the session

●

%n Name of the folder containing the session

●

%d Name of the repository containing the session

●

%g Attach the session log to the message

●

%a Attache a file to the message

2. Other Workflow Manager Features In addition to post session email messages, there are other features available in the Workflow Manager to help validate loads. Control, Decision, Event, and Timer tasks are some of the features that can be used to place multiple controls on the behavior of loads. Another solution is to place conditions within links. Links are used to connect tasks within a workflow or worklet. Use the pre-defined or user-defined variables in the link conditions. In the example below, upon the ‘Successful’ completion of both sessions A and B, the PowerCenter Server executex session C.

3. PowerCenter Reports (PCR) The PowerCenter Reports (PCR) is a web-based business intelligence (BI) tool that is included with every PowerCenter license to provide visibility into metadata stored in the PowerCenter repository in a manner that is easy to comprehend and distribute. The PCR includes more than 130 pre-packaged metadata reports and dashboards delivered through Data Analyzer, Informatica’s BI offering. These pre-packaged reports enable PowerCenter customers to extract extensive business and technical metadata through easy-to-read reports including:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

698 of 954

●

Load statistics and operational metadata that enable load validation.

●

Table dependencies and impact analysis that enable change management.

●

PowerCenter object statistics to aid in development assistance.

●

Historical load statistics that enable planning for growth.

In addition to the 130 pre-packaged reports and dashboards that come standard with PCR, you can develop additional custom reports and dashboards that are based upon the PCR limited-use license that allows you to source reports from the PowerCenter repository. Examples of custom components that can be created include: ●

Repository-wide reports and/or dashboards with indicators of daily load success/failure.

●

Customized project-based dashboard with visual indicators of daily load success/failure.

●

●

Detailed daily load statistics report for each project that can be exported to Microsoft Excel or PDF. Error handling reports that deliver error messages and source data for row level errors that may have occurred during a load.

Below is an example of a custom dashboard that gives instant insight into the load validation across an entire repository through four custom indicators.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

699 of 954

4. Query Informatica Metadata Exchange (MX) Views Informatica Metadata Exchange (MX) provides a set of relational views that allow easy SQL access to the PowerCenter repository. The Repository Manager generates these views when you create or upgrade a repository. Almost any query can be put together to retrieve metadata related to the load execution from the repository. The MX view, REP_SESS_LOG, is a great place to start. This view is likely to contain all the information you need. The following sample query shows how to extract folder name, session name, session end time, successful rows, and session duration: select subject_area, session_name, session_timestamp, successful_rows, (session_timestamp - actual_start) * 24 * 60 * 60 from rep_sess_log a where session_timestamp = (select max(session_timestamp) from rep_sess_log where session_name =a.session_name) order by subject_area, session_name The sample output would look like this:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

700 of 954

TIP Informatica strongly advises against querying directly from the repository tables. Because future versions of PowerCenter are likely to alter the underlying repository tables, PowerCenter supports queries from the unaltered MX views, not the repository tables.

5. Mapping Approach A more complex approach, and the most customizable, is to create a PowerCenter mapping to populate a table or a flat file with desired information. You can do this by sourcing the MX view REP_SESS_LOG and then performing lookups to other repository tables or views for additional information. The following graphic illustrates a sample mapping:

This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the absolute

INFORMATICA CONFIDENTIAL

BEST PRACTICES

701 of 954

minimum and maximum run times for that particular session. This enables you to compare the current execution time with the minimum and maximum durations. Note: Unless you have acquired additional licensing, a customized metadata data mart cannot be a source for a PCR report. However, you can use a business intelligence tool of your choice instead.

Last updated: 06-Dec-07 15:10

INFORMATICA CONFIDENTIAL

BEST PRACTICES

702 of 954

Repository Administration Challenge Defining the role of the PowerCenter Administrator to understand the tasks required to properly manage the domain and repository.

Description The PowerCenter Administrator has many responsibilities. In addition to regularly backing up the domain and repository, truncating logs, and updating the database statistics, he or she also typically performs the following functions: ●

Determines metadata strategy

●

Installs/configures client/server software

●

Migrates development to test and production

●

Maintains PowerCenter servers

●

Upgrades software

●

Administers security and folder organization

●

Monitors and tunes environment

Note: The Administrator is also typically responsible for maintaining domain and repository passwords; changing them on a regular basis and keeping a record of them in a secure place.

Determine Metadata Strategy The PowerCenter Administrator is responsible for developing the structure and standard for metadata in the PowerCenter Repository. This includes developing naming conventions for all objects in the repository, creating a folder organization, and maintaining the repository. The Administrator is also responsible for modifying the metadata strategies to suit changing business needs or to fit the needs of a particular project. Such changes may include new folder names and/or a different security setup.

Install/Configure Client/Server Software

INFORMATICA CONFIDENTIAL

BEST PRACTICES

703 of 954

This responsibility includes installing and configuring the application servers in all applicable environments (e.g., development, QA, production, etc.). The Administrator must have a thorough understanding of the working environment, along with access to resources such as a Windows 2000/2003 or UNIX Admin and a DBA. The Administrator is also responsible for installing and configuring the client tools. Although end users can generally install the client software, the configuration of the client tool connections benefits from being consistent throughout the repository environment. The Administrator, therefore, needs to enforce this consistency in order to maintain an organized repository.

Migrate Development to Production When the time comes for content in the development environment to be moved to the test and production environments, it is the responsibility of the Administrator to schedule, track, and copy folder changes. Also, it is crucial to keep track of the changes that have taken place. It is the role of the Administrator to track these changes through a change control process. The Administrator should be the only individual able to physically move folders from one environment to another. If a versioned repository is used, the Administrator should set up labels and instruct the developers on the labels that they must apply to their repository objects (i.e., reusable transformations, mappings, workflows and sessions). This task also requires close communication with project staff to review the status of items of work to ensure, for example, that only tested or approved work is migrated.

Maintain PowerCenter Servers The Administrator must also be able to understand and troubleshoot the server environment. He or she should have a good understanding of PowerCenter’s Service Oriented Architecture and how the domain and application services interact with each other. The Administrator should also understand what the Integration Service does when a session is running and be able to identify those processes. Additionally, certain mappings may produce files in addition to the standard session and workflow logs. The Administrator should be familiar with these files and know how and where to maintain them.

Upgrade Software If and when the time comes to upgrade software, the Administrator is responsible for overseeing the installation and upgrade process.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

704 of 954

Security and Folder Administration Security administration consists of both the PowerCenter domain and repository. For the domain, it involves creating, maintaining, and updating all domain users and their associated rights and privileges to services and alerts. For the repository, it involves creating, maintaining, and updating all users within the repository, including creating and assigning groups based on new and changing projects and defining which folders are to be shared, and at what level. Folder administration involves creating and maintaining the security of all folders. The Administrator should be the only user with privileges to edit folder properties.

Monitor and Tune Environment Proactively monitoring the domain and user activity helps ensure a healthy functioning PowerCenter environment. The Administrator should review user activity for the domain to verify that the appropriate rights and privileges have been applied. The domain activity will ensure correct CPU and license usage. The Administrator should have sole responsibility for implementing performance changes to the server environment. He or she should observe server performance throughout development so as to identify any bottlenecks in the system. In the production environment, the Repository Administrator should monitor the jobs and any growth (e.g., increases in data or throughput time) and communicate such change to appropriate staff address bottlenecks, accommodate growth, and ensure that the required data is loaded within the prescribed load window.

Last updated: 06-Dec-07 15:10

INFORMATICA CONFIDENTIAL

BEST PRACTICES

705 of 954

Third Party Scheduler Challenge Successfully integrate a third-party scheduler with PowerCenter. This Best Practice describes various levels to integrate a third-party scheduler.

Description Tasks such as getting server and session properties, session status, or starting or stopping a workflow or a task can be performed either through the Workflow Monitor or by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be integrated with PowerCenter at any of several levels. The level of integration depends on the complexity of the workflow/schedule and the skill sets of production support personnel. Many companies want to automate the scheduling process by using scripts or thirdparty schedulers. In some cases, they are using a standard scheduler and want to continue using it to drive the scheduling process. A third-party scheduler can start or stop a workflow or task, obtain session statistics, and get server details using the pmcmd commands. Pmcmd is a program used to communicate with the PowerCenter server.

Third Party Scheduler Integration Levels In general, there are three levels of integration between a third-party scheduler and PowerCenter: Low, Medium, and High.

Low Level Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter workflow. This process subsequently kicks off the rest of the tasks or sessions. The PowerCenter scheduler handles all processes and dependencies after the third-party scheduler has kicked off the initial workflow. In this level of integration, nearly all control lies with the PowerCenter scheduler. This type of integration is very simple to implement because the third-party scheduler INFORMATICA CONFIDENTIAL

BEST PRACTICES

706 of 954

kicks off only one process. It is only used as a loophole to fulfil a corporate mandate on a standard scheduler. This type of integration also takes advantage of the robust functionality offered by the Workflow Monitor. Low-level integration requires production support personnel to have a thorough knowledge of PowerCenter. Because Production Support personnel in many companies are only knowledgeable about the company’s standard scheduler, one of the main disadvantages of this level of integration is that if a batch fails at some point, the Production Support personnel may not be able to determine the exact breakpoint. Thus, the majority of the production support burden falls back on the Project Development team.

Medium Level With Medium-level integration, a third-party scheduler kicks off some, but not all, workflows or tasks. Within the tasks, many sessions may be defined with dependencies. PowerCenter controls the dependencies within the tasks. With this level of integration, control is shared between PowerCenter and the third-party scheduler, which requires more integration between the third-party scheduler and PowerCenter. Medium-level integration requires Production Support personnel to have a fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not have in-depth knowledge about the tool, they may be unable to fix problems that arise, so the production support burden is shared between the Project Development team and the Production Support team.

High Level With High-level integration, the third-party scheduler has full control of scheduling and kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible for controlling all dependencies among the sessions. This type of integration is the most complex to implement because there are many more interactions between the third-party scheduler and PowerCenter. Production Support personnel may have limited knowledge of PowerCenter but must have thorough knowledge of the scheduling tool. Because Production Support personnel in many companies are knowledgeable only about the company’s standard scheduler, one of the main advantages of this level of integration is that if the batch fails at some point, the Production Support personnel are usually able to determine the exact breakpoint. Thus, the production support burden lies with the Production Support team.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

707 of 954

Sample Scheduler Script There are many independent scheduling tools on the market. The following is an example of a AutoSys script that can be used to start tasks; it is included here simply as an illustration of how a scheduler can be implemented in the PowerCenter environment. This script can also capture the return codes, and abort on error, returning a success or failure (with associated return codes to the command line or the Autosys GUI monitor). # Name: jobname.job # Author: Author Name # Date: 01/03/2005 # Description: # Schedule: Daily # # Modification History # When Who Why # #-----------------------------------------------------------------. jobstart $0 $* # set variables ERR_DIR=/tmp # Temporary file will be created to store all the Error Information # The file format is TDDHHMISS.lst curDayTime=`date +%d%H%M%S` FName=T$CurDayTime$$.lst if [ $STEP -le 1 ] then echo "Step 1: RUNNING wf_stg_tmp_product_xref_table..." cd /dbvol03/vendor/informatica/pmserver/ #pmcmd startworkflow -s ah-hp9:4001 -u Administrator -p informat01 wf_stg_tmp_product_xref_table #pmcmd starttask -s ah-hp9:4001 -u Administrator -p informat01 -f FINDW_SRC_STG -w WF_STG_TMP_PRODUCT_XREF_TABLE -wait s_M_S # The above lines need to be edited to include the name of the workflow or the INFORMATICA CONFIDENTIAL

BEST PRACTICES

708 of 954

task that you are attempting to start. TG_TMP_PRODUCT_XREF_TABLE

fi

# Checking whether to abort the Current Process or not RetVal=$? echo "Status = $RetVal" if [ $RetVal -ge 1 ] then jobend abnormal "Step 1: Failed wf_stg_tmp_product_xref_table...\n" exit 1 fi echo "Step 1: Successful"

jobend normal exit 0

Last updated: 06-Dec-07 15:10

INFORMATICA CONFIDENTIAL

BEST PRACTICES

709 of 954

Updating Repository Statistics Challenge The PowerCenter repository has more than 170 tables, and most have one or more indexes to speed up queries. Most databases use column distribution statistics to determine which index to use to optimize performance. It can be important, especially in large or high-use repositories, to update these statistics regularly to avoid performance degradation.

Description For PowerCenter, statistics are updated during copy, backup or restore operations. In addition, the RMREP command has an option to update statistics that can be scheduled as part of a regularly-run script. For PowerCenter 6 and earlier there are specific strategies for Oracle, Sybase, SQL Server, DB2 and Informix discussed below. Each example shows how to extract the information out of the PowerCenter repository and incorporate it into a custom stored procedure.

Features in PowerCenter version 7 and later Copy, Backup and Restore Repositories PowerCenter automatically identifies and updates all statistics of all repository tables and indexes when a repository is copied, backed-up, or restored. If you follow a strategy of regular repository back-ups, the statistics will also be updated.

PMREP Command PowerCenter also has a command line option to update statistics in the database. This allows this command to be put in a Windows batch file or Unix Shell script to run. The format of the command is: pmrep updatestatistics {-s filelistfile} The –s option allows for you to skip different tables you may not want to update statistics.

Example of Automating the Process One approach to automating this would be to use a UNIX shell that includes the pmrep command “updatestatistics” which is incorporated into a special workflow in PowerCenter and run on a scheduled basis. Note: Workflow Manager supports command line as well as scheduling. Below listed is an example of the command line object.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

710 of 954

In addition, this workflow can be scheduled to run continuously on a daily, weekly or monthly schedule. This allows the statistics to be updated regularly so performance is not degraded.

Tuning Strategies for PowerCenter version 6 and earlier The following are strategies for generating scripts to update distribution statistics. Note that all PowerCenter repository tables and index names begin with "OPB_" or "REP_".

Oracle Run the following queries: select 'analyze table ', table_name, ' compute statistics;' from user_tables where table_name like 'OPB_%' select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes where INDEX_NAME like 'OPB_%' This will produce output like:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

711 of 954

'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;' analyze table OPB_ANALYZE_DEP compute statistics; analyze table OPB_ATTR compute statistics; analyze table OPB_BATCH_OBJECT compute statistics;

'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;' analyze index OPB_DBD_IDX compute statistics; analyze index OPB_DIM_LEVEL compute statistics; analyze index OPB_EXPR_IDX compute statistics; Save the output to a file. Then, edit the file and remove all the headers. (i.e., the lines that look like: 'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;' Run this as a SQL script. This updates statistics for the repository tables.

MS SQL Server Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This will produce output like : name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT Save the output to a file, then edit the file and remove the header information (i.e., the top two lines) and add a 'go' at the end of the file. Run this as a SQL script. This updates statistics for the repository tables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

712 of 954

Sybase Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This will produce output like name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT Save the output to a file, then remove the header information (i.e., the top two lines), and add a 'go' at the end of the file. Run this as a SQL script. This updates statistics for the repository tables.

Informix Run the following query: select 'update statistics low for table ', tabname, ' ;' from systables where tabname like 'opb_%' or tabname like 'OPB_%'; This will produce output like : (constant) tabname (constant) update statistics low for table OPB_ANALYZE_DEP ; update statistics low for table OPB_ATTR ; update statistics low for table OPB_BATCH_OBJECT ; Save the output to a file, then edit the file and remove the header information (i.e., the top line that looks like: (constant) tabname (constant) Run this as a SQL script. This updates statistics for the repository tables.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

713 of 954

DB2 Run the following query : select 'runstats on table ', (rtrim(tabschema)||'.')||tabname, ' and indexes all;' from sysstat.tables where tabname like 'OPB_%' This will produce output like: runstats on table PARTH.OPB_ANALYZE_DEP and indexes all; runstats on table PARTH.OPB_ATTR and indexes all; runstats on table PARTH.OPB_BATCH_OBJECT and indexes all; Save the output to a file. Run this as a SQL script to update statistics for the repository tables.

Last updated: 06-Dec-07 15:10

INFORMATICA CONFIDENTIAL

BEST PRACTICES

714 of 954

Determining Bottlenecks Challenge Because there are many variables involved in identifying and rectifying performance bottlenecks, an efficient method for determining where bottlenecks exist is crucial to good data warehouse management.

Description The first step in performance tuning is to identify performance bottlenecks. Carefully consider the following five areas to determine where bottlenecks exist; using a process of elimination, investigating each area in the order indicated: 1. 2. 3. 4. 5.

Target Source Mapping Session System

Best Practice Considerations Use Thread Statistics to Identify Target, Source, and Mapping Bottlenecks Use thread statistics to identify source, target or mapping (transformation) bottlenecks. By default, an Integration Service uses one reader, one transformation, and one target thread to process a session. Within each session log, the following thread statistics are available: ● ●

●

Run time – Amount of time the thread was running Idle time – Amount of time the thread was idle due to other threads within application or Integration Service. This value does not include time the thread is blocked due to the operating system. Busy – Percentage of the overall run time the thread is not idle. This percentage is calculated using the following formula:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

715 of 954

(run time – idle time) / run time x 100 By analyzing the thread statistics found in an Integration Service session log, it is possible to determine which thread is being used the most. If a transformation thread is 100 percent busy and there are additional resources (e.g., CPU cycles and memory) available on the Integration Service server, add a partition point in the segment. If reader or writer thread is 100 percent busy, consider using string data types in source or target ports since non-string ports require more processing.

Use the Swap Method to Test Changes in Isolation Attempt to isolate performance problems by running test sessions. You should be able to compare the session’s original performance with that of tuned session’s performance. The swap method is very useful for determining the most common bottlenecks. It involves the following five steps: 1. Make a temporary copy of the mapping, session and/or workflow that is to be tuned, then tune the copy before making changes to the original. 2. Implement only one change at a time and test for any performance improvements to gauge which tuning methods work most effectively in the environment. 3. Document the change made to the mapping, session and/or workflow and the performance metrics achieved as a result of the change. The actual execution time may be used as a performance metric. 4. Delete the temporary mapping, session and/or workflow upon completion of performance tuning. 5. Make appropriate tuning changes to mappings, sessions and/or workflows.

Evaluating the Five Areas of Consideration Target Bottlenecks Relational Targets The most common performance bottleneck occurs when the Integration Service writes to a target database. This type of bottleneck can easily be identified with the following INFORMATICA CONFIDENTIAL

BEST PRACTICES

716 of 954

procedure: 1. Make a copy of the original workflow 2. Configure the session in the test workflow to write to a flat file and run the session. 3. Read the thread statistics in session log If session performance increases significantly when writing to a flat file, you have a write bottleneck. Consider performing the following tasks to improve performance: ●

Drop indexes and key constraints

●

Increase checkpoint intervals

●

Use bulk loading

●

Use external loading

●

Minimize deadlocks

●

Increase database network packet size

●

Optimize target databases

Flat file targets If the session targets a flat file, you probably do not have a write bottleneck. If the session is writing to a SAN or a non-local file system, performance may be slower than writing to a local file system. If possible, a session can be optimized by writing to a flat file target local to the Integration Service. If the local flat file is very large, you can optimize the write process by dividing it among several physical drives. If the SAN or non-local file system is significantly slower than the local file system, work with the appropriate network/storage group to determine if there are configuration issues within the SAN.

Source Bottlenecks Relational sources If the session reads from a relational source, you can use a filter transformation, a read test mapping, or a database query to identify source bottlenecks. Using a Filter Transformation.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

717 of 954

Add a filter transformation in the mapping after each source qualifier. Set the filter condition to false so that no data is processed past the filter transformation. If the time it takes to run the new session remains about the same, then you have a source bottleneck. Using a Read Test Session. You can create a read test mapping to identify source bottlenecks. A read test mapping isolates the read query by removing any transformation logic from the mapping. Use the following steps to create a read test mapping: 1. Make a copy of the original mapping. 2. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or queries. 3. Remove all transformations. 4. Connect the source qualifiers to a file target. Use the read test mapping in a test session. If the test session performance is similar to the original session, you have a source bottleneck. Using a Database Query You can also identify source bottlenecks by executing a read query directly against the source database. To do so, perform the following steps: ● ●

●

Copy the read query directly from the session log. Run the query against the source database with a query tool such as SQL Plus. Measure the query execution time and the time it takes for the query to return the first row.

If there is a long delay between the two time measurements, you have a source bottleneck. If your session reads from a relational source and is constrained by a source bottleneck, review the following suggestions for improving performance: ●

Optimize the query.

●

Create tempdb as in-memory database.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

718 of 954

●

Use conditional filters.

●

Increase database network packet size.

●

Connect to Oracle databases using IPC protocol.

Flat file sources If your session reads from a flat file source, you probably do not have a read bottleneck. Tuning the line sequential buffer length to a size large enough to hold approximately four to eight rows of data at a time (for flat files) may improve performance when reading flat file sources. Also, ensure the flat file source is local to the Integration Service.

Mapping Bottlenecks If you have eliminated the reading and writing of data as bottlenecks, you may have a mapping bottleneck. Use the swap method to determine if the bottleneck is in the mapping. Begin by adding a Filter transformation in the mapping immediately before each target definition. Set the filter condition to false so that no data is loaded into the target tables. If the time it takes to run the new session is the same as the original session, you have a mapping bottleneck. You can also use the performance details to identify mapping bottlenecks: high Rowsinlookupcache and High Errorrows counters indicate mapping bottlenecks. Follow these steps to identify mapping bottlenecks: Create a test mapping without transformations 1. Make a copy of the original mapping. 2. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or queries. 3. Remove all transformations. 4. Connect the source qualifiers to the target. Check for High Rowsinlookupcache counters Multiple lookups can slow the session. You may improve session performance by locating the largest lookup tables and tuning those lookup expressions.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

719 of 954

Check for High Errorrows counters Transformation errors affect session performance. If a session has large numbers in any of the Transformation_errorrows counters, you may improve performance by eliminating the errors. For further details on eliminating mapping bottlenecks, refer to the Best Practice: Tuning Mappings for Better Performance

Session Bottlenecks Session performance details can be used to flag other problem areas. Create performance details by selecting “Collect Performance Data” in the session properties before running the session. View the performance details through the Workflow Monitor as the session runs, or view the resulting file. The performance details provide counters about each source qualifier, target definition, and individual transformation within the mapping to help you understand session and mapping efficiency. To view the performance details during the session run: ●

Right-click the session in the Workflow Monitor.

●

Choose Properties.

●

Click the Properties tab in the details dialog box.

To view the resulting performance daa file, look for the file session_name.perf in the same directory as the session log and open the file in any text editor. All transformations have basic counters that indicate the number of input row, output rows, and error rows. Source qualifiers, normalizers, and targets have additional counters indicating the efficiency of data moving into and out of buffers. Some transformations have counters specific to their functionality. When reading performance details, the first column displays the transformation name as it appears in the mapping, the second column contains the counter name, and the third column holds the resulting number or efficiency percentage. Low buffer input and buffer output counters

INFORMATICA CONFIDENTIAL

BEST PRACTICES

720 of 954

If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all sources and targets, increasing the session DTM buffer pool size may improve performance. Aggregator, Rank, and Joiner readfromdisk and writetodisk counters If a session contains Aggregator, Rank, or Joiner transformations, examine each Transformation_readfromdisk and Transformation_writetodisk counter. If these counters display any number other than zero, you can improve session performance by increasing the index and data cache sizes. If the session performs incremental aggregation, the Aggregator_readtodisk and writetodisk counters display a number other than zero because the Integration Service reads historical aggregate data from the local disk during the session and writes to disk when saving historical data. Evaluate the incremental Aggregator_readtodisk and writetodisk counters during the session. If the counters show any numbers other than zero during the session run, you can increase performance by tuning the index and data cache sizes. Note: PowerCenter versions 6.x and above include the ability to assign memory allocation per object. In versions earlier than 6.x, aggregators, ranks, and joiners were assigned at a global/session level. For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning Sessions for Better Performance and Tuning SQL Overrides and Environment for Better Performance.

System Bottlenecks After tuning the source, target, mapping, and session, you may also consider tuning the system hosting the Integration Service. The Integration Service uses system resources to process transformations, session execution, and the reading and writing of data. The Integration Service also uses system memory for other data tasks such as creating aggregator, joiner, rank, and lookup table caches. You can use system performance monitoring tools to monitor the amount of system resources the Server uses and identify system bottlenecks. ●

Windows NT/2000. Use system tools such as the Performance and

INFORMATICA CONFIDENTIAL

BEST PRACTICES

721 of 954

Processes tab in the Task Manager to view CPU usage and total memory usage. You can also view more detailed performance information by using the Performance Monitor in the Administrative Tools on Windows. ●

UNIX. Use the following system tools to monitor system performance and identify system bottlenecks: ❍ ❍

lsattr -E -I sys0 - To view current system settings iostat - To monitor loading operation for every disk attached to the database server

❍

vmstat or sar –w - To monitor disk swapping actions

❍

sar –u - To monitor CPU loading.

For further information regarding system tuning, refer to the Best Practices: Performance Tuning UNIX Systems and Performance Tuning Windows 2000/2003 Systems.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

722 of 954

Performance Tuning Databases (Oracle) Challenge Database tuning can result in a tremendous improvement in loading performance. This Best Practice covers tips on tuning Oracle.

Description Performance Tuning Tools Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar with these tools, so we’ve included only a short description of some of the major ones here.

V$ Views V$ views are dynamic performance views that provide real-time information on database activity, enabling the DBA to draw conclusions about database performance. Because SYS is the owner of these views, only SYS can query them. Keep in mind that querying these views impacts database performance; with each query having an immediate hit. With this in mind, carefully consider which users should be granted the privilege to query these views. You can grant viewing privileges with either the ‘SELECT’ privilege, which allows a user to view for individual V$ views or the ‘SELECT ANY TABLE’ privilege, which allows the user to view all V$ views. Using the SELECT ANY TABLE option requires the ‘O7_DICTIONARY_ACCESSIBILITY’ parameter be set to ‘TRUE’, which allows the ‘ANY’ keyword to apply to SYS owned objects.

Explain Plan Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks and developing a strategy to avoid them. Explain Plan allows the DBA or developer to determine the execution path of a block of SQL code. The SQL in a source qualifier or in a lookup that is running for a long time should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid inefficient execution of these statements. Review the PowerCenter session log for long initialization time (an indicator that the source qualifier may need tuning) and the time it takes to build a lookup cache to determine if the SQL for these transformations should

INFORMATICA CONFIDENTIAL

BEST PRACTICES

723 of 954

be tested.

SQL Trace SQL Trace extends the functionality of Explain Plan by providing statistical information about the SQL statements executed in a session that has tracing enabled. This utility is run for a session with the ‘ALTER SESSION SET SQL_TRACE = TRUE’ statement.

TKPROF The output of SQL Trace is provided in a dump file that is difficult to read. TKPROF formats this dump file into a more understandable report.

UTLBSTAT & UTLESTAT Executing ‘UTLBSTAT’ creates tables to store dynamic performance statistics and begins the statistics collection process. Run this utility after the database has been up and running (for hours or days). Accumulating statistics may take time, so you need to run this utility for a long while and through several operations (i.e., both loading and querying). ‘UTLESTAT’ ends the statistics collection process and generates an output file called ‘report.txt.’ This report should give the DBA a fairly complete idea about the level of usage the database experiences and reveal areas that should be addressed.

Disk I/O Disk I/O at the database level provides the highest level of performance gain in most systems. Database files should be separated and identified. Rollback files should be separated onto their own disks because they have significant disk I/O. Co-locate tables that are heavily used with tables that are rarely used to help minimize disk contention. Separate indexes so that when queries run indexes and tables, they are not fighting for the same resource. Also be sure to implement disk striping; this, or RAID technology can help immensely in reducing disk contention. While this type of planning is time consuming, the payoff is well worth the effort in terms of performance gains.

Dynamic Sampling Dynamic sampling enables the server to improve performance by: ●

Estimating single-table predicate statistics where available statistics are missing or may lead to bad estimations.

●

Estimating statistics for tables and indexes with missing statistics.

●

Estimating statistics for tables and indexes with out of date statistics.

Dynamic sampling is controlled by the OPTIMIZER_DYNAMIC_SAMPLING parameter, which accepts values from "0" (off) to "10" (aggressive sampling) with a default value of "2". At compile-time, Oracle determines if dynamic sampling can improve query INFORMATICA CONFIDENTIAL

BEST PRACTICES

724 of 954

performance. If so, it issues recursive statements to estimate the necessary statistics. Dynamic sampling can be beneficial when: ●

The sample time is small compared to the overall query execution time.

●

Dynamic sampling results in a better performing query.

The query can be executed multiple times.

Automatic SQL Tuning in Oracle Database 10g In its normal mode, the query optimizer needs to make decisions about execution plans in a very short time. As a result, it may not always be able to obtain enough information to make the best decision. Oracle 10g allows the optimizer to run in tuning mode, where it can gather additional information and make recommendations about how specific statements can be tuned further. This process may take several minutes for a single statement so it is intended to be used on high-load, resource-intensive statements. In tuning mode, the optimizer performs the following analysis: ●

●

●

●

Statistics Analysis. The optimizer recommends the gathering of statistics on objects with missing or stale statistics. Additional statistics for these objects are stored in an SQL profile. SQL Profiling. The optimizer may be able to improve performance by gathering additional statistics and altering session specific parameters such as the OPTIMIZER_MODE. If such improvements are possible, the information is stored in an SQL profile. If accepted, this information can then used by the optimizer when running in normal mode. Unlike a stored outline, which fixes the execution plan, an SQL profile may still be of benefit when the contents of the table alter drastically. Even so, it's sensible to update profiles periodically. The SQL profiling is not performed when the tuining optimizer is run in limited mode. Access Path Analysis. The optimizer investigates the effect of new or modified indexes on the access path. Because its index recommendations relate to a specific statement, where practical, it is also suggest the use of the SQL Access Advisor to check the impact of these indexes on a representative SQL workload. SQL Structure Analysis. The optimizer suggests alternatives for SQL statements that contain structures that may affect performance. Be aware that implementing these suggestions requires human intervention to check their

INFORMATICA CONFIDENTIAL

BEST PRACTICES

725 of 954

validity.

TIP The automatic SQL tuning features are accessible from Enterprise Manager on the "Advisor Central" page

Useful Views Useful views related to automatic SQL tuning include: ●

DBA_ADVISOR_TASKS

●

DBA_ADVISOR_FINDINGS

●

DBA_ADVISOR_RECOMMENDATIONS

●

DBA_ADVISOR_RATIONALE

●

DBA_SQLTUNE_STATISTICS

●

DBA_SQLTUNE_BINDS

●

DBA_SQLTUNE_PLANS

●

DBA_SQLSET

●

DBA_SQLSET_BINDS

●

DBA_SQLSET_STATEMENTS

●

DBA_SQLSET_REFERENCES

●

DBA_SQL_PROFILES

●

V$SQL

●

V$SQLAREA

●

V$ACTIVE_SESSION_HISTORY

Memory and Processing Memory and processing configuration is performed in the init.ora file. Because each database is different and requires an experienced DBA to analyze and tune it for optimal performance, a standard set of parameters to optimize PowerCenter is not practical and is not likely to ever exist.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

726 of 954

TIP Changes made in the init.ora file take effect after a restart of the instance. Use svrmgr to issue the commands “shutdown” and “startup” (eventually “shutdown immediate”) to the instance. Note that svrmgr is no longer available as of Oracle 9i because Oracle is moving to a web-based Server Manager in Oracle 10g. If you are using Oracle 9i, install Oracle client tools and log onto Oracle Enterprise Manager. Some other tools like DBArtisan expose the initialization parameters. The settings presented here are those used in a four-CPU AIX server running Oracle 7.3.4 set to make use of the parallel query option to facilitate parallel processing queries and indexes. We’ve also included the descriptions and documentation from Oracle for each setting to help DBAs of other (i.e., non-Oracle) systems determine what the commands do in the Oracle environment to facilitate setting their native database commands and settings in a similar fashion.

HASH_AREA_SIZE = 16777216 ●

Default value: 2 times the value of SORT_AREA_SIZE

●

Range of values: any integer

●

●

●

This parameter specifies the maximum amount of memory, in bytes, to be used for the hash join. If this parameter is not set, its value defaults to twice the value of the SORT_AREA_SIZE parameter. The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. (Note: ALTER SESSION refers to the Database Administration command issued at the svrmgr command prompt). HASH_JOIN_ENABLED ❍

❍

●

In Oracle 7 and Oracle 8 the hash_join_enabled parameter must be set to true. In Oracle 8i and above hash_join_enabled=true is the default value

HASH_MULTIBLOCK_IO_COUNT ❍ ❍

❍

Allows multiblock reads against the TEMP tablespace It is advisable to set the NEXT extentsize to greater than the value for hash_multiblock_io_count to reduce disk I/O This is the same behavior seen when setting the db_file_multiblock_read_count parameter for data tablespaces except this one applies only to multiblock access of segments of TEMP Tablespace

INFORMATICA CONFIDENTIAL

BEST PRACTICES

727 of 954

●

STAR_TRANSFORMATION_ENABLED ❍

❍

●

Determines whether a cost-based query transformation will be applied to star queries When set to TRUE, the optimizer will consider performing a cost-based query transformation on the n-way join table

OPTIMIZER_INDEX_COST_ADJ ❍ ❍

Numeric parameter set between 0 and 1000 (default 1000) This parameter lets you tune the optimizer behavior for access path selection to be more or less index friendly

Optimizer_percent_parallel=33 This parameter defines the amount of parallelism that the optimizer uses in its cost functions. The default of 0 means that the optimizer chooses the best serial plan. A value of 100 means that the optimizer uses each object's degree of parallelism in computing the cost of a full-table scan operation. The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. Low values favor indexes, while high values favor table scans. Cost-based optimization is always used for queries that reference an object with a nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or goal is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero setting of OPTIMIZER_PERCENT_PARALLEL. parallel_max_servers=40 ●

Used to enable parallel query.

●

Initially not set on Install.

●

Maximum number of query servers or parallel recovery processes for an instance.

Parallel_min_servers=8 ●

Used to enable parallel query.

●

Initially not set on Install.

●

Minimum number of query server processes for an instance. Also the number of query-server processes Oracle creates when the instance is started.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

728 of 954

SORT_AREA_SIZE=8388608 ●

Default value: operating system-dependent

●

Minimum value: the value equivalent to two database blocks

●

●

●

This parameter specifies the maximum amount, in bytes, of program global area (PGA) memory to use for a sort. After the sort is complete, and all that remains to do is to fetch the rows out, the memory is released down to the size specified by SORT_AREA_RETAINED_SIZE. After the last row is fetched out, all memory is freed. The memory is released back to the PGA, not to the operating system. Increasing SORT_AREA_SIZE size improves the efficiency of large sorts. Multiple allocations never exist; there is only one memory area of SORT_AREA_SIZE for each user process at any time. The default is usually adequate for most database operations. However, if very large indexes are created, this parameter may need to be adjusted. For example, if one process is doing all database access, as in a full database import, then an increased value for this parameter may speed the import, particularly the CREATE INDEX statements.

Automatic Shared Memory Management in Oracle 10g Automatic Shared Memory Management puts Oracle in control of allocating memory within the SGA. The SGA_TARGET parameter sets the amount of memory available to the SGA. This parameter can be altered dynamically up to a maximum of the SGA_MAX_SIZE parameter value. Provided the STATISTICS_LEVEL is set to TYPICAL or ALL, and the SGA_TARGET is set to a value other than "0", Oracle will control the memory pools that would otherwise be controlled by the following parameters: ●

DB_CACHE_SIZE (default block size)

●

SHARED_POOL_SIZE

●

LARGE_POOL_SIZE

●

JAVA_POOL_SIZE

If these parameters are set to a non-zero value, they represent the minimum size for the pool. These minimum values may be necessary if you experience application errors when certain pool sizes drop below a specific threshold. The following parameters must be set manually and take memory from the quota INFORMATICA CONFIDENTIAL

BEST PRACTICES

729 of 954

allocated by the SGA_TARGET parameter: ●

DB_KEEP_CACHE_SIZE

●

DB_RECYCLE_CACHE_SIZE

●

DB_nK_CACHE_SIZE (non-default block size)

●

STREAMS_POOL_SIZE

●

LOG_BUFFER

IPC as an Alternative to TCP/IP on UNIX On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same box), using an IPC connection can significantly reduce the time it takes to build a lookup cache. In one case, a fact mapping that was using a lookup to get five columns (including a foreign key) and about 500,000 rows from a table was taking 19 minutes. Changing the connection type to IPC reduced this to 45 seconds. In another mapping, the total time decreased from 24 minutes to 8 minutes for ~120-130 bytes/row, 500,000 row write (array inserts), and primary key with unique index in place. Performance went from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec). A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this: DW.armafix = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL =TCP) (HOST = armafix) (PORT = 1526) ) ) (CONNECT_DATA=(SID=DW) ) ) Make a new entry in the tnsnames like this, and use it for connection to the local Oracle instance: DWIPC.armafix = (DESCRIPTION = (ADDRESS = (PROTOCOL=ipc)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

730 of 954

(KEY=DW) ) (CONNECT_DATA=(SID=DW)) )

Improving Data Load Performance Alternative to Dropping and Reloading Indexes Experts often recommend dropping and reloading indexes during very large loads to a data warehouse but there is no easy way to do this. For example, writing a SQL statement to drop each index, then writing another SQL statement to rebuild it, can be a very tedious process. Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes by allowing you to disable and re-enable existing indexes. Oracle stores the name of each index in a table that can be queried. With this in mind, it is an easy matter to write a SQL statement that queries this table. then generate SQL statements as output to disable and enable these indexes. Run the following to generate output to disable the foreign keys in the data warehouse: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE CONSTRAINT ' || CONSTRAINT_NAME || ' ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'R' This produces output that looks like: ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011077 ; ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011075 ; ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011060 ;

INFORMATICA CONFIDENTIAL

BEST PRACTICES

731 of 954

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011059 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011133 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011134 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011131 ; Dropping or disabling primary keys also speeds loads. Run the results of this SQL statement after disabling the foreign key constraints: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'P' This produces output that looks like: ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ; ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY KEY ; Finally, disable any unique constraints with the following: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;' FROM USER_CONSTRAINTS

INFORMATICA CONFIDENTIAL

BEST PRACTICES

732 of 954

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'U' ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011070 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011071 ; Save the results in a single file and name it something like ‘DISABLE.SQL’ To re-enable the indexes, rerun these queries after replacing ‘DISABLE’ with ‘ENABLE.’ Save the results in another file with a name such as ‘ENABLE.SQL’ and run it as a post-session command. Re-enable constraints in the reverse order that you disabled them. Re-enable the unique constraints first, and re-enable primary keys before foreign keys.

TIP Dropping or disabling foreign keys often boosts loading, but also slows queries (such as lookups) and updates. If you do not use lookups or updates on your target tables, you should get a boost by using this SQL statement to generate scripts. If you use lookups and updates (especially on large tables), you can exclude the index that will be used for the lookup from your script. You may want to experiment to determine which method is faster.

Optimizing Query Performance Oracle Bitmap Indexing With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree index. A b-tree index can greatly improve query performance on data that has high cardinality or contains mostly unique values, but is not much help for low cardinality/ highly-duplicated data and may even increase query time. A typical example of a low cardinality field is gender – it is either male or female (or possibly unknown). This kind of data is an excellent candidate for a bitmap index, and can significantly improve query performance.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

733 of 954

Keep in mind however, that b-tree indexing is still the Oracle default. If you don’t specify an index type when creating an index, Oracle defaults to b-tree. Also note that for certain columns, bitmaps are likely to be smaller and faster to create than a b-tree index on the same column. Bitmap indexes are suited to data warehousing because of their performance, size, and ability to create and drop very quickly. Since most dimension tables in a warehouse have nearly every column indexed, the space savings is dramatic. But it is important to note that when a bitmap-indexed column is updated, every row associated with that bitmap entry is locked, making bit-map indexing a poor choice for OLTP database tables with constant insert and update traffic. Also, bitmap indexes are rebuilt after each DML statement (e.g., inserts and updates), which can make loads very slow. For this reason, it is a good idea to drop or disable bitmap indexes prior to the load and recreate or re-enable them after the load. The relationship between Fact and Dimension keys is another example of low cardinality. With a b-tree index on the Fact table, a query processes by joining all the Dimension tables in a Cartesian product based on the WHERE clause, then joins back to the Fact table. With a bitmapped index on the Fact table, a ‘star query’ may be created that accesses the Fact table first followed by the Dimension table joins, avoiding a Cartesian product of all possible Dimension attributes. This ‘star query’ access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in the init.ora file and if there are single column bitmapped indexes on the fact table foreign keys. Creating bitmap indexes is similar to creating b-tree indexes. To specify a bitmap index, add the word ‘bitmap’ between ‘create’ and ‘index’. All other syntax is identical.

Bitmap Indexes drop index emp_active_bit; drop index emp_gender_bit; create bitmap index emp_active_bit on emp (active_flag); create bitmap index emp_gender_bit on emp (gender);

B-tree Indexes drop index emp_active;

INFORMATICA CONFIDENTIAL

BEST PRACTICES

734 of 954

drop index emp_gender; create index emp_active on emp (active_flag); create index emp_gender on emp (gender); Information for bitmap indexes is stored in the data dictionary in dba_indexes, all_indexes, and user_indexes with the word ‘BITMAP’ in the Uniqueness column rather than the word ‘UNIQUE.’ Bitmap indexes cannot be unique. To enable bitmap indexes, you must set the following items in the instance initialization file: ●

compatible = 7.3.2.0.0 # or higher

●

event = "10111 trace name context forever"

●

event = "10112 trace name context forever"

●

event = "10114 trace name context forever"

Also note that the parallel query option must be installed in order to create bitmap indexes. If you try to create bitmap indexes without the parallel query option, a syntax error appears in the SQL statement; the keyword ‘bitmap’ won't be recognized.

TIP To check if the parallel query option is installed, start and log into SQL*Plus. If the parallel query option is installed, the word ‘parallel’ appears in the banner text.

Index Statistics Table method Index statistics are used by Oracle to determine the best method to access tables and should be updated periodically as normal DBA procedures. The following should improve query results on Fact and Dimension tables (including appending and updating records) by updating the table and index statistics for the data warehouse: The following SQL statement can be used to analyze the tables in the database:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

735 of 954

SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS;' FROM USER_TABLES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following result: ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS; ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS; ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS; The following SQL statement can be used to analyze the indexes in the database: SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;' FROM USER_INDEXES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following results: ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS; ANALYZE INDEX SYS_C0011119 COMPUTE STATISTICS; ANALYZE INDEX SYS_C0011105 COMPUTE STATISTICS; Save these results as a SQL script to be executed before or after a load.

Schema method Another way to update index statistics is to compute indexes by schema rather than by table. If data warehouse indexes are the only indexes located in a single schema, you can use the following command to update the statistics: EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute'); INFORMATICA CONFIDENTIAL

BEST PRACTICES

736 of 954

In this example, BDB is the schema for which the statistics should be updated. Note that the DBA must grant the execution privilege for dbms_utility to the database user executing this command. TIP

These SQL statements can be very resource intensive, especially for very large tables. For this reason, Informatica recommends running them at off-peak times when no other process is using the database. If you find the exact computation of the statistics consumes too much time, it is often acceptable to estimate the statistics rather than compute them. Use ‘estimate’ instead of ‘compute’ in the above examples.

Parallelism Parallel execution can be implemented at the SQL statement, database object, or instance level for many SQL operations. The degree of parallelism should be identified based on the number of processors and disk drives on the server, with the number of processors being the minimum degree.

SQL Level Parallelism Hints are used to define parallelism at the SQL statement level. The following examples demonstrate how to utilize four processors: SELECT /*+ PARALLEL(order_fact,4) */ …; SELECT /*+ PARALLEL_INDEX(order_fact, order_fact_ixl,4) */ …;

TIP When using a table alias in the SQL Statement, be sure to use this alias in the hint. Otherwise, the hint will not be used, and you will not receive an error message. Example of improper use of alias: SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME FROM EMP A

INFORMATICA CONFIDENTIAL

BEST PRACTICES

737 of 954

Here, the parallel hint will not be used because of the used alias “A” for table EMP. The correct way is: SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME FROM EMP A

Table Level Parallelism Parallelism can also be defined at the table and index level. The following example demonstrates how to set a table’s degree of parallelism to four for all eligible SQL statements on this table: ALTER TABLE order_fact PARALLEL 4; Ensure that Oracle is not contending with other processes for these resources or you may end up with degraded performance due to resource contention.

Additional Tips Executing Oracle SQL Scripts as Pre- and Post-Session Commands on UNIX You can execute queries as both pre- and post-session commands. For a UNIX environment, the format of the command is: sqlplus –s user_id/password@database @ script_name.sql For example, to execute the ENABLE.SQL file created earlier (assuming the data warehouse is on a database named ‘infadb’), you would execute the following as a postsession command: sqlplus –s user_id/password@infadb @ enable.sql In some environments, this may be a security issue since both username and password are hard-coded and unencrypted. To avoid this, use the operating system’s authentication to log onto the database instance. In the following example, the Informatica id “pmuser” is used to log onto the Oracle database. Create the Oracle user “pmuser” with the following SQL statement: INFORMATICA CONFIDENTIAL

BEST PRACTICES

738 of 954

CREATE USER PMUSER IDENTIFIED EXTERNALLY DEFAULT TABLESPACE . . . TEMPORARY TABLESPACE . . . In the following pre-session command, “pmuser” (the id Informatica is logged onto the operating system as) is automatically passed from the operating system to the database and used to execute the script: sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL You may want to use the init.ora parameter “os_authent_prefix” to distinguish between “normal” oracle-users and “external-identified” ones. DRIVING_SITE ‘Hint’ If the source and target are on separate instances, the Source Qualifier transformation should be executed on the target instance. For example, you want to join two source tables (A and B) together, which may reduce the number of selected rows. However, Oracle fetches all of the data from both tables, moves the data across the network to the target instance, then processes everything on the target instance. If either data source is large, this causes a great deal of network traffic. To force the Oracle optimizer to process the join on the source instance, use the ‘Generate SQL’ option in the source qualifier and include the ‘driving_site’ hint in the SQL statement as: SELECT /*+ DRIVING_SITE */ …;

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

739 of 954

Performance Tuning Databases (SQL Server) Challenge Database tuning can result in tremendous improvement in loading performance. This Best Practice offers tips on tuning SQL Server.

Description Proper tuning of the source and target database is a very important consideration in the scalability and usability of a business data integration environment. Managing performance on an SQL Server involves the following points. ●

Manage system memory usage (RAM caching).

●

Create and maintain good indexes.

●

Partition large data sets and indexes.

●

Monitor disk I/O subsystem performance.

●

Tune applications and queries.

●

Optimize active data.

Taking advantage of grid computing is another option for improving the overall SQL Server performance. To set up a SQL Server cluster environment, you need to set up a cluster where the databases are split among the nodes. This provides the ability to distribute the load across multiple nodes. To achieve high performance, Informatica recommends using a fibre-attached SAN device for shared storage.

Manage RAM Caching Managing RAM buffer cache is a major consideration in any database server environment. Accessing data in RAM cache is much faster than accessing the same information from disk. If database I/O can be reduced to the minimal required set of data and index pages, the pages stay in RAM longer. Too much unnecessary data and index information flowing into buffer cache quickly pushes out valuable pages. The primary goal of performance tuning is to reduce I/O so that buffer cache is used effectively.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

740 of 954

Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM usage: ●

●

Max async I/O is used to specify the number of simultaneous disk I/O operations that SQL Server can submit to the operating system. Note that this setting is automated in SQL Server 2000 SQL Server allows several selectable models for database recovery, these include: ❍

Full Recovery

❍

Bulk-Logged Recovery

❍

Simple Recovery

Create and Maintain Good Indexes Creating and maintaining good indexes is key to maintaining minimal I/O for all database queries.

Partition Large Data Sets and Indexes To reduce overall I/O contention and improve parallel operations, consider partitioning table data and indexes. Multiple techniques for achieving and managing partitions using SQL Server 2000 are addressed in this document.

Tune Applications and Queries Tuning applications and queries is especially important when a database server is likely to be servicing requests from hundreds or thousands of connections through a given application. Because applications typically determine the SQL queries that are executed on a database server, it is very important for application developers to understand SQL Server architectural basics and know how to take full advantage of SQL Server indexes to minimize I/O.

Partitioning for Performance The simplest technique for creating disk I/O parallelism is to use hardware partitioning and create a single "pool of drives" that serves all SQL Server database files except transaction log files, which should always be stored on physically-separate disk drives dedicated to log files. (See Microsoft documentation for installation procedures.)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

741 of 954

Objects For Partitioning Consideration The following areas of SQL Server activity can be separated across different hard drives, RAID controllers, and PCI channels (or combinations of the three): ●

Transaction logs

●

Tempdb

●

Database

●

Tables

●

Nonclustered Indexes

Note: In SQL Server 2000, Microsoft introduced enhancements to distributed partitioned views that enable the creation of federated databases (commonly referred to as scale-out), which spread resource load and I/O activity across multiple servers. Federated databases are appropriate for some high-end online analytical processing (OLTP) applications, but this approach is not recommended for addressing the needs of a data warehouse.

Segregating the Transaction Log Transaction log files should be maintained on a storage device that is physically separate from devices that contain data files. Depending on your database recovery model setting, most update activity generates both data device activity and log activity. If both are set up to share the same device, the operations to be performed compete for the same limited resources. Most installations benefit from separating these competing I/O activities.

Segregating tempdb SQL Server creates a database, tempdb, on every server instance to be used by the server as a shared working area for various activities, including temporary tables, sorting, processing subqueries, building aggregates to support GROUP BY or ORDER BY clauses, queries using DISTINCT (temporary worktables have to be created to remove duplicate rows), cursors, and hash joins. To move the tempdb database, use the ALTER DATABASE command to change the physical file location of the SQL Server logical file name associated with tempdb. For example, to move tempdb and its associated log to the new file locations E:\mssql7 and C:\temp, use the following commands:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

742 of 954

alterdatabasetempdbmodifyfile(name='tempdev',filename= 'e:\mssql7\tempnew_location.mDF') alterdatabasetempdbmodifyfile(name='templog',filename= 'c:\temp\tempnew_loglocation.mDF')

The master database, msdb, and model databases are not used much during production (as compared to user databases), so it is generally y not necessary to consider them in I/O performance tuning considerations. The master database is usually used only for adding new logins, databases, devices, and other system objects.

Database Partitioning Databases can be partitioned using files and/or filegroups. A filegroup is simply a named collection of individual files grouped together for administration purposes. A file cannot be a member of more than one filegroup. Tables, indexes, text, ntext, and image data can all be associated with a specific filegroup. This means that all their pages are allocated from the files in that filegroup. The three types of filegroups are: ●

●

●

Primary filegroup. Contains the primary data file and any other files not placed into another filegroup. All pages for the system tables are allocated from the primary filegroup. User-defined filegroup. Any filegroup specified using the FILEGROUP keyword in a CREATE DATABASE or ALTER DATABASE statement, or on the Properties dialog box within SQL Server Enterprise Manager. Default filegroup. Contains the pages for all tables and indexes that do not have a filegroup specified when they are created. In each database, only one filegroup at a time can be the default filegroup. If no default filegroup is specified, the default is the primary filegroup.

Files and filegroups are useful for controlling the placement of data and indexes and eliminating device contention. Quite a few installations also leverage files and filegroups as a mechanism that is more granular than a database in order to exercise more control over their database backup/recovery strategy.

Horizontal Partitioning (Table)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

743 of 954

Horizontal partitioning segments a table into multiple tables, each containing the same number of columns but fewer rows. Determining how to partition tables horizontally depends on how data is analyzed. A general rule of thumb is to partition tables so queries reference as few tables as possible. Otherwise, excessive UNION queries, used to merge the tables logically at query time, can impair performance. When you partition data across multiple tables or multiple servers, queries accessing only a fraction of the data can run faster because there is less data to scan. If the tables are located on different servers, or on a computer with multiple processors, each table involved in the query can also be scanned in parallel, thereby improving query performance. Additionally, maintenance tasks, such as rebuilding indexes or backing up a table, can execute more quickly. By using a partitioned view, the data still appears as a single table and can be queried as such without having to reference the correct underlying table manually

Cost Threshold for Parallelism Option Use this option to specify the threshold where SQL Server creates and executes parallel plans. SQL Server creates and executes a parallel plan for a query only when the estimated cost to execute a serial plan for the same query is higher than the value set in cost threshold for parallelism. The cost refers to an estimated elapsed time in seconds required to execute the serial plan on a specific hardware configuration. Only set cost threshold for parallelism on symmetric multiprocessors (SMP).

Max Degree of Parallelism Option Use this option to limit the number of processors (from a maximum of 32) to use in parallel plan execution. The default value is zero, which uses the actual number of available CPUs. Set this option to one to suppress parallel plan generation. Set the value to a number greater than one to restrict the maximum number of processors used by a single query execution.

Priority Boost Option Use this option to specify whether SQL Server should run at a higher scheduling priority than other processors on the same computer. If you set this option to one, SQL Server runs at a priority base of 13. The default is zero, which is a priority base of seven.

Set Working Set Size Option INFORMATICA CONFIDENTIAL

BEST PRACTICES

744 of 954

Use this option to reserve physical memory space for SQL Server that is equal to the server memory setting. The server memory setting is configured automatically by SQL Server based on workload and available resources. It can vary dynamically among minimum server memory and maximum server memory. Setting ‘set working set’ size means the operating system does not attempt to swap out SQL Server pages, even if they can be used more readily by another process when SQL Server is idle.

Optimizing Disk I/O Performance When configuring a SQL Server that contains only a few gigabytes of data and does not sustain heavy read or write activity, you need not be particularly concerned with the subject of disk I/O and balancing of SQL Server I/O activity across hard drives for optimal performance. To build larger SQL Server databases however, which can contain hundreds of gigabytes or even terabytes of data and/or that sustain heavy read/ write activity (as in a DSS application), it is necessary to drive configuration around maximizing SQL Server disk I/O performance by load-balancing across multiple hard drives.

Partitioning for Performance For SQL Server databases that are stored on multiple disk drives, performance can be improved by partitioning the data to increase the amount of disk I/O parallelism. Partitioning can be performed using a variety of techniques. Methods for creating and managing partitions include configuring the storage subsystem (i.e., disk, RAID partitioning) and applying various data configuration mechanisms in SQL Server such as files, file groups, tables and views. Some possible candidates for partitioning include: ●

Transaction log

●

Tempdb

●

Database

●

Tables

●

Non-clustered indexes

Using bcp and BULK INSERT Two mechanisms exist inside SQL Server to address the need for bulk movement of data: the bcp utility and the BULK INSERT statement.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

745 of 954

● ●

Bcp is a command prompt utility that copies data into or out of SQL Server. BULK INSERT is a Transact-SQL statement that can be executed from within the database environment. Unlike bcp, BULK INSERT can only pull data into SQL Server. An advantage of using BULK INSERT is that it can copy data into instances of SQL Server using a Transact-SQL statement, rather than having to shell out to the command prompt.

TIP Both of these mechanisms enable you to exercise control over the batch size. Unless you are working with small volumes of data, it is good to get in the habit of specifying a batch size for recoverability reasons. If none is specified, SQL Server commits all rows to be loaded as a single batch. For example, you attempt to load 1,000,000 rows of new data into a table. The server suddenly loses power just as it finishes processing row number 999,999. When the server recovers, those 999,999 rows will need to be rolled back out of the database before you attempt to reload the data. By specifying a batch size of 10,000 you could have saved significant recovery time, because SQL Server would have only had to rollback 9999 rows instead of 999,999.

General Guidelines for Initial Data Loads While loading data: ●

Remove indexes.

●

Use Bulk INSERT or bcp.

●

Parallel load using partitioned data files into partitioned tables.

●

Run one load stream for each available CPU.

●

Set Bulk-Logged or Simple Recovery model.

●

Use the TABLOCK option.

●

Create indexes.

●

Switch to the appropriate recovery model.

●

Perform backups

General Guidelines for Incremental Data Loads ●

Load data with indexes in place.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

746 of 954

●

●

Use performance and concurrency requirements to determine locking granularity (sp_indexoption). Change from Full to Bulk-Logged Recovery mode unless there is an overriding need to preserve a point–in-time recovery, such as online users modifying the database during bulk loads. Read operations should not affect bulk loads.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

747 of 954

Performance Tuning Databases (Teradata) Challenge Database tuning can result in tremendous improvement in loading performance. This Best Practice provides tips on tuning Teradata.

Description Teradata offers several bulk load utilities including: ●

● ●

● ●

MultiLoad which supports inserts, updates, deletes, and “upserts” to any table. FastExport which is a high-performance bulk export utility. BTEQ which allows you to export data to a flat file but is suitable for smaller volumes than FastExport. FastLoad which is used for loading inserts into an empty table. TPump which is a light-weight utility that does not lock the table that is being loaded.

Tuning MultiLoad There are many aspects to tuning a Teradata database. Several aspects of tuning can be controlled by setting MultiLoad parameters to maximize write throughput. Other areas to analyze when performing a MultiLoad job include estimating space requirements and monitoring MultiLoad performance.

MultiLoad parameters Below are the MultiLoad-specific parameters that are available in PowerCenter: ● ●

●

TDPID. A client based operand that is part of the logon string. Date Format. Ensure that the date format used in your target flat file is equivalent to the date format parameter in your MultiLoad script. Also validate that your date format is compatible with the date format specified in the Teradata database. Checkpoint. A checkpoint interval is similar to a commit interval for other

INFORMATICA CONFIDENTIAL

BEST PRACTICES

748 of 954

databases. When you set the checkpoint value to less than 60, it represents the interval in minutes between checkpoint operations. If the checkpoint is set to a value greater than 60, it represents the number of records to write before performing a checkpoint operation. To maximize write speed to the database, try to limit the number of checkpoint operations that are performed. ●

●

●

●

●

Tenacity. Interval in hours between MultiLoad attempts to log on to the database when the maximum number of sessions are already running. Load Mode. Available load methods include Insert, Update, Delete, and Upsert. Consider creating separate external loader connections for each method, selecting the one that will be most efficient for each target table. Drop Error Tables. Allows you to specify whether to drop or retain the three error tables for a MultiLoad session. Set this parameter to 1 to drop error tables or 0 to retain error tables. Max Sessions. This parameter specifies the maximum number of sessions that are allowed to log on to the database. This value should not exceed one per working amp (Access Module Process). Sleep. This parameter specifies the number of minutes that MultiLoad waits before retrying a logon operation.

Estimating Space Requirements for MultiLoad Jobs Always estimate the final size of your MultiLoad target tables and make sure the destination has enough space to complete your MultiLoad job. In addition to the space that may be required by target tables, each MultiLoad job needs permanent space for: ●

Work tables

●

Error tables

●

Restart Log table

Note: Spool space cannot be used for MultiLoad work tables, error tables, or the restart log table. Spool space is freed at each restart. By using permanent space for the MultiLoad tables, data is preserved for restart operations after a system failure. Work tables, in particular, require a lot of extra permanent space. Also remember to account for the size of error tables since error tables are generated for each target table. Use the following formula to prepare the preliminary space estimate for one target table, assuming no fallback protection, no journals, and no non-unique secondary indexes:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

749 of 954

PERM = (using data size + 38) x (number of rows processed) x (number of apply conditions satisfied) x (number of Teradata SQL statements within the applied DML) Make adjustments to your preliminary space estimates according to the requirements and expectations of your MultiLoad job.

Monitoring MultiLoad Performance Below are tips for analyzing MultiLoad performance: 1. Determine which phase of the MultiLoad job is causing poor performance. ●

●

If the performance bottleneck is during the acquisition phase, as data is acquired from the client system, then the issue may be with the client system. If it is during the application phase, as data is applied to the target tables, then the issue is not likely to be with the client system. The MultiLoad job output lists the job phases and other useful information. Save these listings for evaluation.

2. Use the Teradata RDBMS Query Session utility to monitor the progress of the MultiLoad job. 3. Check for locks on the MultiLoad target tables and error tables. 4. Check the DBC.Resusage table for problem areas, such as data bus or CPU capacities at or near 100 percent for one or more processors. 5. Determine whether the target tables have non-unique secondary indexes (NUSIs). NUSIs degrade MultiLoad performance because the utility builds a separate NUSI change row to be applied to each NUSI sub-table after all of the rows have been applied to the primary table. 6. Check the size of the error tables. Write operations to the fallback error tables are performed at normal SQL speed, which is much slower than normal MultiLoad tasks. 7. Verify that the primary index is unique. Non-unique primary indexes can cause severe MultiLoad performance problems 8. Poor performance can happen when the input data is skewed with respect to the Primary Index of the database. Teradata depends upon random and well distributed data for data input and retrieval. For example, a file containing a million rows with a single value 'AAAAAA' for the Primary Index will take an infinite time to load. 9. One common tool used for determining load issues/skewed data/locks is Performance Monitor (PMON). PMON requires MONITOR access on the Teradata system. If you do not have Monitor access, then the DBA can help INFORMATICA CONFIDENTIAL

BEST PRACTICES

750 of 954

you to look at the system. 10. SQL against the system catalog can also be used to determine any performance bottle necks. The following query is used to see if the load is inserting data into the system. Spool space (a type of work space) is inside the build as data is transferred to the database. So if the load is going well, the spool will be built rapidly in the database. Use the following query to check: SELECT sum(currentspool) from dbc.diskspace where databasename = userid loading the database. After the spool rises has a reached its peak, spool will fall rapidly as data is inserted from spool into the table. If the spool grows slowly, then the input data is probably skewed.

FastExport FastExport is a bulk export Teradata utility. One way to pull up data for Lookup/ Sources is by using ODBC since there is not native connectivity to Teradata. However, ODBC is slow. For higher performance, use FastExport if the number of rows to be pulled is in the order of a million rows. FastExport writes to a file. The lookup or source qualifier then reads this file. FastExport integrated within PowerCenter.

BTEQ BTEQ is a SQL executor utility similar to SQL*Plus. Life FastExport, BTEQ allows you to export data to a flat file, but is suitable for smaller volumes of data. This provides faster performance than ODBC but doesn't tax Teradata system resources the way FastExport can. A possible use for BTEQ with PowerCenter is to export smaller volumes of data to a flat file (i.e., less than 1 million rows). The flat file is then read by PowerCenter. BTEQ is not integrated with PowerCenter but can be called from a presession script.

TPump TPump was a load utility primarily intended for streaming data (think of loading bundles of messages arriving from MQ using Power Center Real Time). TPump can also load from a file or a named pipe. While FastLoad and MultiLoad are bulk load utilities, TPump is a lightweight utility. Another important difference between MultiLoad and TPump is that TPump locks at the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

751 of 954

row-hash level instead of the table level thus providing users read access to fresher data. Although Teradata says that it has improved the speed of TPump for loading files to compare with that of MultiLoad. So, try a test load using TPump first. Also, be cautious with the use of TPump to load streaming data if the data throughput is large.

Push Down Optimization PowerCenter embeds a powerful engine that actually has a memory management system built within and all the smart algorithms built into the engine to perform various transformation operations such as aggregation, sorting, joining, lookup etc. This is a typically referred to as an ETL architecture where Extracts, Transformations and Loads are performed. So, data is extracted from the data source to the PowerCenter Engine (can be on the same machine as the source or a separate machine) where all the transformations are applied and then pushed to the target. Some of the performance considerations for this type of architecture are: ●

Is the network fast enough and tuned effectively to support the necessary data transfer? ●

Is the hardware on which PowerCenter is running sufficiently robust with high processing capability and high memory capacity. ELT (Extract, Load, Transform) is a relatively new design or runtime paradigm that became popular with the advent of high performance RDBM systems such asDSS and OLTP. Because Teradata typically runs on well tuned operating systems and well tuned hardware, the ELT paradigm tries to push as much of the transformation logic as possible onto the Teradata system. The ELT design paradigm can be achieved through the Pushdown Optimization option offered with PowerCenter.

ETL or ELT Because many database vendors and consultants advocate using ELT (Extract, Load and Transform) over ETL (Extract, Transform and Load), the use of Pushdown Optimization can be somewhat controversial. Informatica advocates using Pushdown Optimization as an option to solve specific performance situations rather than as the default design of a mapping.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

752 of 954

The following scenarios can help in deciding on when to use ETL with PowerCenter and when to use ELT (i.e., Pushdown Optimization): 1. When the load needs to look up only dimension tables then there may be no need to use Pushdown Optimization. In this context, PowerCenter's ability to build dynamic, persistent caching is significant. If a daily load involves 10s or 100s of fact files to be loaded throughout the day, then dimension surrogate keys can be easily obtained from PowerCenter's cache in memory. Compare this with the cost of running the same dimension lookup queries on the database. 2. In many cases large Teradata systems contain only a small amount of data. In such cases there may be no need to push down. 3. When only simple filters or expressions need to be applied on the data then there may be no need to push down. The special case is that of applying filters or expression logic to non-unique columns in incoming data in PowerCenter. Compare this to loading the same data into the database and then applying a WHERE clause on a non-unique column, which is highly inefficient for a large table. The principle here is: Filter and resolve the data AS it gets loaded instead of loading it into a database, querying the RDBMS to filter/resolve and re-loading it into the database. In other words, ETL instead of ELT. 4. Push Down optimization needs to be considered only if a large set of data needs to be merged or queried for getting to your final load set.

Maximizing Performance using Pushdown Optimization You can push transformation logic to either the source or target database using pushdown optimization. The amount of work you can push to the database depends on the pushdown optimization configuration, the transformation logic, and the mapping and session configuration. When you run a session configured for pushdown optimization, the Integration Service analyzes the mapping and writes one or more SQL statements based on the mapping transformation logic. The Integration Service analyzes the transformation logic, mapping, and session configuration to determine the transformation logic it can push to the database. At run time, the Integration Service executes any SQL statement generated against the source or target tables, and processes any transformation logic that it cannot push to the database. Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to

INFORMATICA CONFIDENTIAL

BEST PRACTICES

753 of 954

Pushdown Optimization.

Known Issues with Teradata You may encounter the following problems using ODBC drivers with a Teradata database: ●

●

●

●

●

Teradata sessions fail if the session requires a conversion to a numeric data type and the precision is greater than 18. Teradata sessions fail when you use full pushdown optimization for a session containing a Sorter transformation. A sort on a distinct key may give inconsistent results if the sort is not case sensitive and one port is a character port. A session containing an Aggregator transformation may produce different results from PowerCenter if the group by port is a string data type and it is not case-sensitive. A session containing a Lookup transformation fails if it is configured for targetside pushdown optimization.

●

A session that requires type casting fails if the casting is from x to date/time.

●

A session that contains a date to string conversion fails

Working with SQL Overrides You can configure the Integration Service to perform an SQL override with Pushdown Optimization. To perform an SQL override, you configure the session to create a view. When you use a SQL override for a Source Qualifier transformation in a session configured for source or full Pushdown Optimization with a view, the Integration Service creates a view in the source database based on the override. After it creates the view in the database, the Integration Service generates a SQL query that it can push to the database. The Integration Service runs the SQL query against the view to perform Pushdown Optimization. Note: To use an SQL override with pushdown optimization, you must configure the session for pushdown optimization with a view.

Running a Query If the Integration Service did not successfully drop the view, you can run a query against the source database to search for the views generated by the Integration Service. When the Integration Service creates a view, it uses a prefix of PM_V. You INFORMATICA CONFIDENTIAL

BEST PRACTICES

754 of 954

can search for views with this prefix to locate the views created during pushdown optimization. Teradata specific SQL: SELECT TableName FROM DBC.Tables WHERE CreatorName = USER AND TableKind ='V' AND TableName LIKE 'PM\_V%' ESCAPE '\'

Rules and Guidelines for SQL OVERIDE Use the following rules and guidelines when you configure pushdown optimization for a session containing an SQL override:

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

755 of 954

Performance Tuning in a Real-Time Environment Challenge As Data Integration becomes a broader and more service-oriented Information Technology initiative, real-time and right-time solutions will become critical to the success of the overall architecture. Tuning real-time processes is often different then tuning batch processes.

Description To remain agile and flexible in increasingly competitive environments, today’s companies are dealing with sophisticated operational scenarios such as consolidation of customer data in real time to support a call center or the delivery of precise forecasts for supply chain operation optimization. To support such highly demanding operational environments, data integration platforms must do more than serve analytical data needs. They must also support real-time, 24x7, mission-critical operations that involve live or current information available across the enterprise and beyond. They must access, cleanse, integrate and deliver data in real time to ensure up-to-thesecond information availability. Also, data integration platforms must intelligently scale to meet both increasing data volumes and also increasing numbers of concurrent requests that are typical of shared services Integration Competency Center (ICC) environments. The data integration platforms must also be extremely reliable, providing high availability to minimize outages and ensure seamless failover and recovery as every minute of downtime can lead to huge impacts on business operations. PowerCenter can be used to process data in real time. Real-time processing is on-demand processing of data from real-time sources. A real-time session reads, processes and writes data to targets continuously. By default, a session reads and writes bulk data at scheduled intervals unless it is configured for real-time processing. To process data in real time, the data must originate from a real-time source. Real-time sources include JMS, WebSphere MQ, TIBCO, webMethods, MSMQ, SAP, and web services. Real-time processing can also be used for processes that require immediate access to dynamic data (i.e., financial data).

Latency Impact on performance Use the Real-time Flush Latency session condition to control the target commit latency when running in real-time mode. PWXPC commits source data to the target at the end of the specified maximum latency period. This parameter requires a valid value and has a valid default value. When the session runs, PWXPC begins to read data from the source. After data is provided to the source qualifier, the Real-Time Flush Latency interval begins. At the end of each Real-Time Flush Latency interval and an end-UOW boundary is reached, PWXPC issues a commit to the target. The following message appears in the session log to indicate that this has occurred: [PWXPC_10082] [INFO] [CDCDispatcher] raising real-time flush with restart tokens [restart1_token], [restart2_token] because Real-time Flush Latency [RTF_millisecs] occurred Only complete UOWs are committed during real-time flush processing. The commit to the target when reading CDC data is not strictly controlled by the Real-Time Flush Latency specification. The UOW Count and the Commit Threshold values also determine the commit frequency. The value specified for Real-Time Flush Latency also controls the PowerExchange Consumer API (CAPI) interface timeout value (PowerExchange latency) on the source platform. The CAPI interface timeout value is displayed in the INFORMATICA CONFIDENTIAL

BEST PRACTICES

756 of 954

following PowerExchange message on the source platform (and in the session log if “Retrieve PWX Log Entries” is specified in the Connection Attributes): PWX-09957 CAPI i/f: Read times out after seconds The CAPI interface timeout also affects latency as it will affect how quickly changes are returned to the PWXPC reader by PowerExchange. PowerExchange will ensure that it returns control back to PWXPC at least once every CAPI interface timeout period. This allows the PWXPC to regain control and, if necessary, perform the real-time flush of data returned. A high RTF Latency specification will also impact the speed with which stop requests from PowerCenter are handled as the PWXPC CDC Reader must wait for PowerExchange to return control before it can handle the stop request. TIP Use the PowerExchange STOPTASK command to shutdown more quickly when using a high RTF Latency value. For example, if the value for Real-Time Flush Latency is 10 seconds, PWXPC will issue a commit for all data read after 10 seconds have elapsed and the next end-UOW boundary is received. The lower the value is set, the faster the data commits data to the target. As the lowest possible latency is required for the application of changes to the target, specify a low Real-Time Flush Latency value. Warning: When you specify a low Real-Time Flush Latency interval, the session might consume more system resources on the source and target platforms. This is because: ● ●

The session will commit to the target more frequently therefore consuming more target resources. PowerExchange will return more frequently to the PWXPC reader thereby passing fewer rows on each iteration and consuming more resources on the source PowerExchange platform

Balance performance and resource consumption with latency requirements when choosing the UOW Count and Real-Time Flush Latency values.

Commit Interval Impact on performance Commit Threshold is only applicable to Real-Time CDC sessions. Use the Commit Threshold session condition to cause commits before reaching the end of the UOW when processing large UOWs. This parameter requires a valid value and has a valid default value Commit Threshold can be used to cause a commit before the end of a UOW is received, a process also referred to as sub-packet commit. The value specified in the Commit Threshold is the number of records within a source UOW to process before inserting a commit into the change stream. This attribute is different from the UOW Count attribute in that it is a count records within a UOW rather than complete UOWs. The Commit Threshold counter is reset when either the number of records specified or the end of the UOW is reached. This attribute is useful when there are extremely large UOWs in the change stream that might cause locking issues on the target database or resource issues on the PowerCenter Integration Server. The Commit Threshold count is cumulative across all sources in the group. This means that sub-packet commits are inserted into the change stream when the count specified is reached regardless of the number of sources to which the changes actually apply. For example, a UOW contains 900 changes for one source followed by 100 changes for a second source and then 500 changes for the first source. If the Commit Threshold is set to 1000, the commit record is inserted after the 1000th change record which is after the 100 changes for the second source. Warning: A UOW may contain changes for multiple source tables. Using Commit Threshold can cause commits to be generated at points in the change stream where the relationship between these tables is inconsistent. This may INFORMATICA CONFIDENTIAL

BEST PRACTICES

757 of 954

then result in target commit failures. If 0 or no value is specified, commits will occur on UOW boundaries only. Otherwise, the value specified is used to insert commit records into the change stream between UOW boundaries, where applicable. The value of this attribute overrides the value specified in the PowerExchange DBMOVER configuration file parameter SUBCOMMIT_THRESHOLD. For more information on this PowerExchange parameter, refer to the PowerExchange Reference Manual. The commit to the target when reading CDC data is not strictly controlled by the Commit Threshold specification. The commit records inserted into the change stream as a result of the Commit Threshold value affect the UOW Count counter. The UOW Count and the Real-Time Flush Latency values determine the target commit frequency. For example, a UOW contains 1,000 change records (any combination of inserts, updates, and deletes). If 100 is specified for the Commit Threshold and 5 for the UOW Count, then a commit record will be inserted after each 100 records and a target commit will be issued after every 500 records.

Last updated: 29-May-08 18:40

INFORMATICA CONFIDENTIAL

BEST PRACTICES

758 of 954

Performance Tuning UNIX Systems Challenge Identify opportunities for performance improvement within the complexities of the UNIX operating environment.

Description This section provides an overview of the subject area, followed by discussion of the use of specific tools.

Overview All system performance issues are fundamentally resource contention issues. In any computer system, there are three essential resources: CPU, memory, and I/O - namely disk and network I/O. From this standpoint, performance tuning for PowerCenter means ensuring that the PowerCenter and its sub-processes have adequate resources to execute in a timely and efficient manner. Each resource has its own particular set of problems. Resource problems are complicated because all resources interact with each other. Performance tuning is about identifying bottlenecks and making trade-off to improve the situation. Your best approach is to initially take a baseline measurement and to obtain a good understanding of how it behaves, then evaluate any bottleneck revealed on each system resource during your load window and determine the removal of whichever resource contention offers the greatest opportunity for performance enhancement. Here is a summary of each system resource area and the problems it can have.

CPU ●

On any multiprocessing and multi-user system, many processes want to use the CPUs at the same time. The UNIX kernel is responsible for allocation of a finite number of CPU cycles across all running processes. If the total demand on the CPU exceeds its finite capacity, then all processing is likely to reflect a negative impact on performance; the system scheduler puts each process in a queue to wait for CPU availability.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

759 of 954

●

●

An average of the count of active processes in the system for the last 1, 5, and 15 minutes is reported as load average when you execute the command uptime. The load average provides you a basic indicator of the number of contenders for CPU time. Likewise vmstat command provides an average usage of all the CPUs along with the number of processes contending for CPU (the value under the r column). On SMP (symmetric multiprocessing) architecture servers, watch the even utilization of all the CPUs. How well all the CPUs are utilized depends on how well an application can be parallelized, If a process is incurring a high degree of involuntary context switch by the kernel; binding the process to a specific CPU may improve performance.

Memory ●

●

Memory contention arises when the memory requirements of the active processes exceed the physical memory available on the system; at this point, the system is out of memory. To handle this lack of memory, the system starts paging, or moving portions of active processes to disk in order to reclaim physical memory. When this happens, performance decreases dramatically. Paging is distinguished from swapping, which means moving entire processes to disk and reclaiming their space. Paging and excessive swapping indicate that the system can't provide enough memory for the processes that are currently running. Commands such as vmstat and pstat show whether the system is paging; ps, prstat and sar can report the memory requirements of each process.

Disk I/O ●

●

The I/O subsystem is a common source of resource contention problems. A finite amount of I/O bandwidth must be shared by all the programs (including the UNIX kernel) that currently run. The system's I/O buses can transfer only so many megabytes per second; individual devices are even more limited. Each type of device has its own peculiarities and, therefore, its own problems. Tools are available to evaluate specific parts of the I/O subsystem ❍

iostat can give you information about the transfer rates for each disk drive. ps and vmstat can give some information about how many processes are blocked waiting for I/O.

❍

sar can provide voluminous information about I/O efficiency.

❍

sadp can give detailed information about disk access patterns.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

760 of 954

Network I/O ●

The source data, the target data, or both the source and target data are likely to be connected through an Ethernet channel to the system where PowerCenter resides. Be sure to consider the number of Ethernet channels and bandwidth available to avoid congestion. ❍

❍

netstat shows packet activity on a network, watch for high collision rate of output packets on each interface. nfstat monitors NFS traffic; execute nfstat –c from a client machine (not from the nfs server); watch for high time rate of total call and “not responding” message.

Given that these issues all boil down to access to some computing resource, mitigation of each issue con sists of making some adjustment to the environment to provide more (or preferential) access to the resource; for instance: ●

●

●

●

●

Adjusting execution schedules to allow leverage of low usage times may improve availability of memory, disk, network bandwidth, CPU cycles, etc. Migrating other applications to other hardware is likely tol reduce demand on the hardware hosting PowerCenter. For CPU intensive sessions, raising CPU priority (or lowering priority for competing processes) provides more CPU time to the PowerCenter sessions. Adding hardware resources, such as adding memory, can make more resource available to all processes. Re-configuring existing resources may provide for more efficient usage, such as assigning different disk devices for input and output, striping disk devices, or adjusting network packet sizes.

Detailed Usage The following tips have proven useful in performance tuning UNIX-based machines. While some of these tips are likely to be more helpful than others in a particular environment, all are worthy of consideration. Availability, syntax and format of each varies across UNIX versions.

Running ps -axu

INFORMATICA CONFIDENTIAL

BEST PRACTICES

761 of 954

Run ps -axu to check for the following items: ●

●

●

●

Are there any processes waiting for disk access or for paging? If so check the I/ O and memory subsystems. What processes are using most of the CPU? This may help to distribute the workload better. What processes are using most of the memory? This may help to distribute the workload better. Does ps show that your system is running many memory-intensive jobs? Look for jobs with a large set (RSS) or a high storage integral.

Identifying and Resolving Memory Issues Use vmstat or sar to check for paging/swapping actions. Check the system to ensure that excessive paging/swapping does not occur at any time during the session processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of paging/ swapping. If paging or excessive swapping does occur at any time, increase memory to prevent it. Paging/swapping, on any database system, causes a major performance decrease and increased I/O. On a memory-starved and I/O-bound server, this can effectively shut down the PowerCenter process and any databases running on the server. Some swapping may occur normally regardless of the tuning settings. This occurs because some processes use the swap space by their design. To check swap space availability, use pstat and swap. If the swap space is too small for the intended applications, it should be increased. Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory problems and check for the following: ● ●

●

Are pages-outs occurring consistently? If so, you are short of memory. Are there a high number of address translation faults? (System V only). This suggests a memory shortage. Are swap-outs occurring consistently? If so, you are extremely short of memory. Occasional swap-outs are normal; BSD systems swap-out inactive jobs. Long bursts of swap-outs mean that active jobs are probably falling victim and indicate extreme memory shortage. If you dont have vmstat S, look at the w and de fields of vmstat. These should always be zero.

If memory seems to be the bottleneck, try following remedial steps: INFORMATICA CONFIDENTIAL

BEST PRACTICES

762 of 954

●

●

●

●

●

Reduce the size of the buffer cache (if your system has one) by decreasing BUFPAGES. If you have statically allocated STREAMS buffers, reduce the number of large (e.g., 2048- and 4096-byte) buffers. This may reduce network performance, but netstat-m should give you an idea of how many buffers you really need. Reduce the size of your kernels tables. This may limit the systems capacity (i. e., number of files, number of processes, etc.). Try running jobs requiring a lot of memory at night. This may not help the memory problems, but you may not care about them as much. Try running jobs requiring a lot of memory in a batch queue. If only one memory-intensive job is running at a time, your system may perform satisfactorily.

●

Try to limit the time spent running sendmail, which is a memory hog.

●

If you dont see any significant improvement, add more memory.

Identifying and Resolving Disk I/O Issues Use iostat to check I/O load and utilization as well as CPU load. Iostat can be used to monitor the I/O load on the disks on the UNIX server. Using iostat permits monitoring the load on specific disks. Take notice of how evenly disk activity is distributed among the system disks. If it is not, are the most active disks also the fastest disks? Run sadp to get a seek histogram of disk activity. Is activity concentrated in one area of the disk (good), spread evenly across the disk (tolerable), or in two well-defined peaks at opposite ends (bad)? ●

●

●

●

●

Reorganize your file systems and disks to distribute I/O activity as evenly as possible. Using symbolic links helps to keep the directory structure the same throughout while still moving the data files that are causing I/O contention. Use your fastest disk drive and controller for your root file system; this almost certainly has the heaviest activity. Alternatively, if single-file throughput is important, put performance-critical files into one file system and use the fastest drive for that file system. Put performance-critical files on a file system with a large block size: 16KB or 32KB (BSD). Increase the size of the buffer cache by increasing BUFPAGES (BSD). This may hurt your systems memory performance.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

763 of 954

●

●

●

Rebuild your file systems periodically to eliminate fragmentation (i.e., backup, build a new file system, and restore). If you are using NFS and using remote files, look at your network situation. You don’t have local disk I/O problems. Check memory statistics again by running vmstat 5 (sar-rwpg). If your system is paging or swapping consistently, you have memory problems, fix memory problem first. Swapping makes performance worse.

If your system has disk capacity problem and is constantly running out of disk space try the following actions: ●

●

●

Write a find script that detects old core dumps, editor backup and auto-save files, and other trash and deletes it automatically. Run the script through cron. Use the disk quota system (if your system has one) to prevent individual users from gathering too much storage. Use a smaller block size on file systems that are mostly small files (e.g., source code files, object modules, and small data files).

Identifying and Resolving CPU Overload Issues Use uptime or sar -u to check for CPU loading. Sar provides more detail, including % usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A target goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10. If %wio is higher, the disk and I/O contention should be investigated to eliminate I/O bottleneck on the UNIX server. If the system shows a heavy load of %sys, and %usr has a high %idle, this is indicative of memory and contention of swapping/paging problems. In this case, it is necessary to make memory changes to reduce the load on the system server. When you run iostat 5, also watch for CPU idle time. Is the idle time always 0, without letup? It is good for the CPU to be busy, but if it is always busy 100 percent of the time, work must be piling up somewhere. This points to CPU overload. ●

●

Eliminate unnecessary daemon processes. rwhod and routed are particularly likely to be performance problems, but any savings will help. Get users to run jobs at night with at or any queuing system thats available. You may not care if the CPU (or the memory or I/O system) is overloaded at night, provided the work is done in the morning.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

764 of 954

●

Using nice to lower the priority of CPU-bound jobs improves interactive performance. Also, using nice to raise the priority of CPU-bound jobs expedites them but may hurt interactive performance. In general though, using nice is really only a temporary solution. If your workload grows, it will soon become insufficient. Consider upgrading your system, replacing it, or buying another system to share the load.

Identifying and Resolving Network I/O Issues Suspect problems with network capacity or with data integrity if users experience slow performance when they are using rlogin or when they are accessing files via NFS. Look at netsat-i. If the number of collisions is large, suspect an overloaded network. If the number of input or output errors is large, suspect hardware problems. A large number of input errors indicate problems somewhere on the network. A large number of output errors suggests problems with your system and its interface to the network. If collisions and network hardware are not a problem, figure out which system appears to be slow. Use spray to send a large burst of packets to the slow system. If the number of dropped packets is large, the remote system most likely cannot respond to incoming data fast enough. Look to see if there are CPU, memory or disk I/O problems on the remote system. If not, the system may just not be able to tolerate heavy network workloads. Try to reorganize the network so that this system isn’t a file server. A large number of dropped packets may also indicate data corruption. Run netstat-s on the remote system, then spray the remote system from the local system and run netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is equal to or greater than the number of drop packets that spray reports, the remote system is slow network server If the increase of socket full drops is less than the number of dropped packets, look for network errors. Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent of calls, the network or an NFS server is overloaded. If timeout is high, at least one NFS server is overloaded, the network may be faulty, or one or more servers may have crashed. If badmix is roughly equal to timeout, at least one NFS server is overloaded. If timeout and retrans are high, but badxid is low, some part of the network between the NFS client and server is overloaded and dropping packets. Try to prevent users from running I/O- intensive programs across the network. The greputility is a good example of an I/O intensive program. Instead, have users log into the remote system to do their work. INFORMATICA CONFIDENTIAL

BEST PRACTICES

765 of 954

Reorganize the computers and disks on your network so that as many users as possible can do as much work as possible on a local system. Use systems with good network performance as file servers. lsattr E l sys0 is used to determine some current settings on some UNIX environments. (In Solaris, you execute prtenv.) Of particular attention is maxuproc, the setting to determine the maximum level of user background processes. On most UNIX environments, this is defaulted to 40, but should be increased to 250 on most systems. Choose a file system. Be sure to check the database vendor documentation to determine the best file system for the specific machine. Typical choices include: s5, the UNIX System V file system; ufs, the UNIX file system derived from Berkeley (BSD); vxfs, the Veritas file system; and lastly raw devices that, in reality are not a file system at all. Additionally, for the PowerCenter Enterprise Grid Option cluster file system (CFS), products such as GFS for RedHat Linux, Veritas CFS, and GPFS for IBM AIX are some of the available choices.

Cluster File System Tuning In order to take full advantage of the PowerCenter Enterprise Grid Option , cluster file system (CFS) is recommended. PowerCenter Grid option requires that the directories for each Integration Service to be shared with other servers. This allows Integration Services to share files such as cache files between different session runs. CFS performance is a result of tuning parameters and tuning the infrastructure. Therefore, using the parameters recommended by each CFS vendor is the best approach for CFS tuning.

PowerCenter Options The Integration Service Monitor is available to display system resource usage information about associated nodes. The window displays resource usage information about the running tasks, including CPU%, memory, and swap usage. The PowerCenter 64-bit option can allocate more memory to sessions and achieve higher throughputs compared to 32-bit version of PowerCenter.

Last updated: 06-Dec-07 15:16 INFORMATICA CONFIDENTIAL

BEST PRACTICES

766 of 954

Performance Tuning Windows 2000/2003 Systems Challenge Windows Server is designed as a self-tuning operating system. Standard installation of Windows Server provides good performance out-of-the-box, but optimal performance can be achieved by tuning. Note: Tuning is essentially the same for both Windows 2000 and 2003-based systems.

Description The following tips have proven useful in performance-tuning Windows Servers. While some are likely to be more helpful than others in any particular environment, all are worthy of consideration. The two places to begin tuning an NT server are: ● ●

Performance Monitor. Performance tab (hit ctrl+alt+del, choose task manager, and click on the Performance tab).

Although the Performance Monitor can be tracked in real-time, creating a result-set representative of a full day is more likely to render an accurate view of system performance.

Resolving Typical Windows Server Problems The following paragraphs describe some common performance problems in a Windows Server environment and suggest tuning solutions. Server Load: Assume that some software will not be well coded, and some background processes (e.g., a mail server or web server) running on a single machine, can potentially starve the machine's CPUs. In this situation, off-loading the CPU hogs may be the only recourse.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

767 of 954

Device Drivers: The device drivers for some types of hardware are notorious for inefficient CPU clock cycles. Be sure to obtain the latest drivers from the hardware vendor to minimize this problem. Memory and services: Although adding memory to Windows Server is always a good solution, it is also expensive and usually must be planned in advance. Before adding memory, check the Services in Control Panel because many background applications do not uninstall the old service when installing a new version. Thus, both the unused old service and the new service may be using valuable CPU memory resources. I/O Optimization: This is, by far, the best tuning option for database applications in the Windows Server environment. If necessary, level the load across the disk devices by moving files. In situations where there are multiple controllers, be sure to level the load across the controllers too. Using electrostatic devices and fast-wide SCSI can also help to increase performance. Further, fragmentation can usually be eliminated by using a Windows Server disk defragmentation product. Finally, on Windows Servers, be sure to implement disk striping to split single data files across multiple disk drives and take advantage of RAID (Redundant Arrays of Inexpensive Disks) technology. Also increase the priority of the disk devices on the Windows Server. Windows Server, by default, sets the disk device priority low.

Monitoring System Performance in Windows Server In Windows Server, PowerCenter uses system resources to process transformation, session execution, and reading and writing of data. The PowerCenter Integration Service also uses system memory for other data such as aggregate, joiner, rank, and cached lookup tables. With Windows Server, you can use the system monitor in the Performance Console of the administrative tools, or system tools in the task manager, to monitor the amount of system resources used by the PowerCenter and to identify system bottlenecks. Windows Server provides the following tools (accessible under the Control Panel/ Administration Tools/Performance) for monitoring resource usage on your computer: ●

System Monitor

●

Performance Logs and Alerts

These Windows Server monitoring tools enable you to analyze usage and detect INFORMATICA CONFIDENTIAL

BEST PRACTICES

768 of 954

bottlenecks at the disk, memory, processor, and network level.

System Monitor The System Monitor displays a graph which is flexible and configurable. You can copy counter paths and settings from the System Monitor display to the Clipboard and paste counter paths from Web pages or other sources into the System Monitor display. Because the System Monitor is portable, it is useful in monitoring other systems that require administration.

Performance Monitor The Performance Logs and Alerts tool provides two types of performance-related logs— counter logs and trace logs—and an alerting function. Counter logs record sampled data about hardware resources and system services based on performance objects and counters in the same manner as System Monitor. They can, therefore, be viewed in System Monitor. Data in counter logs can be saved as comma-separated or tab-separated files that are easily viewed with Excel. Trace logs collect event traces that measure performance statistics associated with events such as disk and file I/O, page faults, or thread activity. The alerting function allows you to define a counter value that will trigger actions such as sending a network message, running a program, or starting a log. Alerts are useful if you are not actively monitoring a particular counter threshold value but want to be notified when it exceeds or falls below a specified value so that you can investigate and determine the cause of the change. You may want to set alerts based on established performance baseline values for your system. Note: You must have Full Control access to a subkey in the registry in order to create or modify a log configuration. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM \CurrentControlSet\Services\SysmonLog\Log_Queries). The predefined log settings under Counter Logs (i.e., System Overview) are configured to create a binary log that, after manual start-up, updates every 15 seconds and logs continuously until it achieves a maximum size. If you start logging with the default settings, data is saved to the Perflogs folder on the root directory and includes the counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and Processor(_Total)\ % Processor Time. If you want to create your own log setting, press the right mouse on one of the log

INFORMATICA CONFIDENTIAL

BEST PRACTICES

769 of 954

types.

PowerCenter Options The Integration Service Monitor is available to display system resource usage information about associated nodes. The window displays resource usage information about running task including CPU%, Memory and Swap usage. PowerCenter's 64-bit option running on Intel Itanium processor-based machines and 64bit Windows Server 2003 can allocate more memory to sessions and achieve higher throughputs than the 32-bit version of PowerCenter on Windows Server. Using PowerCenter Grid option on Windows Server enables distribution of a session or sessions in a workflow to multiple servers and reduces the processing load window. The PowerCenter Grid option requires that the directories for each integration service to be shared with other servers. This allows integration services to share files such as cache files among various session runs. With a Cluster File System (CFS), integration services running on various servers can perform concurrent reads and write to the same block of data.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

770 of 954

Recommended Performance Tuning Procedures Challenge To optimize PowerCenter load times by employing a series of performance tuning procedures.

Description When a PowerCenter session or workflow is not performing at the expected or desired speed, there is a methodology that can help to diagnose problems that may be adversely affecting various components of the data integration architecture. While PowerCenter has its own performance settings that can be tuned, you must consider the entire data integration architecture, including the UNIX/Windows servers, network, disk array, and the source and target databases to achieve optimal performance. More often than not, an issue external to PowerCenter is the cause of the performance problem. In order to correctly and scientifically determine the most logical cause of the performance problem, you need to execute the performance tuning steps in a specific order. This enables you to methodically rule out individual pieces and narrow down the specific areas on which to focus your tuning efforts.

1. Perform Benchmarking You should always have a baseline of current load times for a given workflow or session with a similar row count. Maybe you are not achieving your required load window or simply think your processes could run more efficiently based on comparison with other similar tasks running faster. Use the benchmark to estimate what your desired performance goal should be and tune to that goal. Begin with the problem mapping that you created, along with a session and workflow that use all default settings. This helps to identify which changes have a positive impact on performance.

2. Identify the Performance Bottleneck Area This step helps to narrow down the areas on which to focus further. Follow the areas and sequence below when attempting to identify the bottleneck: ●

Target

INFORMATICA CONFIDENTIAL

BEST PRACTICES

771 of 954

●

Source

●

Mapping

●

Session/Workflow

●

System.

The methodology steps you through a series of tests using PowerCenter to identify trends that point where next to focus. Remember to go through these tests in a scientific manner; running them multiple times before reaching any conclusion and always keep in mind that fixing one bottleneck area may create a different bottleneck. For more information, see Determining Bottlenecks.

3. "Inside" or "Outside" PowerCenter Depending on the results of the bottleneck tests, optimize “inside” or “outside” PowerCenter. Be sure to perform the bottleneck test in the order prescribed in Determining Bottlenecks, since this is also the order in which you should make any performance changes. Problems “outside” PowerCenter refers to anything that indicates the source of the performance problem is external to PowerCenter. The most common performance problems “outside” PowerCenter are source/target database problem, network bottleneck, server, or operating system problem. ●

●

●

For source database related bottlenecks, refer to Tuning SQL Overrides and Environment for Better Performance For target database related problems, refer to Performance Tuning Databases - Oracle, SQL Server, or Teradata For operating system problems, refer to Performance Tuning UNIX Systems or Performance Tuning Windows 2000/2003 Systems for more information.

Problems “inside” PowerCenter refers to anything that PowerCenter controls, such as actual transformation logic, and PowerCenter Workflow/Session settings. The session settings contain quite a few memory settings and partitioning options that can greatly improve performance. Refer to the Tuning Sessions for Better Performance for more information. Although there are certain procedures to follow to optimize mappings, keep in mind that, in most cases, the mapping design is dictated by business logic; there may be a

INFORMATICA CONFIDENTIAL

BEST PRACTICES

772 of 954

more efficient way to perform the business logic within the mapping, but you cannot ignore the necessary business logic to improve performance. Refer to Tuning Mappings for Better Performance for more information.

4. Re-Execute the Problem Workflow or Session After you have completed the recommended steps for each relevant performance bottleneck, re-run the problem workflow or session and compare the results to the benchmark and compare load performance against the baseline. This step is iterative, and should be performed after any performance-based setting is changed. You are trying to answer the question, “Did the performance change have a positive impact?” If so, move on to the next bottleneck. Be sure to prepare detailed documentation at every step along the way so you have a clear record of what was and wasn't tried. While it may seem like there are an enormous number of areas where a performance problem can arise, if you follow the steps for finding the bottleneck(s), and apply the tuning techniques specific to it, you are likely to improve performance and achieve your desired goals.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

773 of 954

Tuning and Configuring Data Analyzer and Data Analyzer Reports Challenge A Data Analyzer report that is slow to return data means lag time to a manager or business analyst. It can be a crucial point of failure in the acceptance of a data warehouse. This Best Practice offers some suggestions for tuning Data Analyzer and Data Analyzer reports.

Description Performance tuning reports occurs both at the environment level and the reporting level. Often report performance can be enhanced by looking closely at the objective of the report rather than the suggested appearance. The following guidelines should help with tuning the environment and the report itself. 1. Perform Benchmarking. Benchmark the reports to determine an expected rate of return. Perform benchmarks at various points throughout the day and evening hours to account for inconsistencies in network traffic, database server load, and application server load. This provides a baseline to measure changes against. 2. Review Report. Confirm that all data elements are required in the report. Eliminate any unnecessary data elements, filters, and calculations. Also be sure to remove any extraneous charts or graphs. Consider if the report can be broken into multiple reports or presented at a higher level. These are often ways to create more visually appealing reports and allow for linked detail reports or drill down to detail level. 3. Scheduling of Reports. If the report is on-demand but can be changed to a scheduled report, schedule the report to run during hours when the system use is minimized. Consider scheduling large numbers of reports to run overnight. If mid-day updates are required, test the performance at lunch hours and consider scheduling for that time period. Reports that require filters by users can often be copied and filters precreated to allow for scheduling of the report. 4. Evaluate Database. Database tuning occurs on multiple levels. Begin by reviewing the tables used in the report. Ensure that indexes have been created on dimension keys. If filters are used on attributes, test the creation of secondary indices to improve the efficiency of the query. Next, execute reports while a DBA monitors the database environment. This provides the DBA the opportunity to tune the database for querying. Finally, look into changes in database settings. Increasing the database memory in the initialization file often improves Data Analyzer performance significantly. 5. Investigate Network. Reports are simply database queries, which can be found by clicking the "View SQL" button on the report. Run the query from the report, against the database using a client tool on the server that the database resides on. One caveat to this is that even the database tool on the server may contact the outside network. Work with the DBA during this test to use a local database connection, (e.g., Bequeath / IPC Oracle’s local database communication protocol) and monitor the database throughout this process. This test may pinpoint if the bottleneck is occurring on the network or in the database. If, for instance, the query performs well regardless of where it is executed, but the report continues to be slow, this indicates an application server bottleneck. Common locations for network bottlenecks include router tables, web server demand, and server input/output. Informatica does recommend installing Data Analyzer on a dedicated application server. 6. Tune the Schema. Having tuned the environment and minimized the report requirements, the final level of tuning involves changes to the database tables. Review the under performing reports. Can any of these be generated from aggregate tables instead of from base tables? Data Analyzer makes efficient use of linked aggregate tables by determining on a report-by-report basis if the report can utilize an aggregate table. By studying the existing reports and future requirements, you can determine what key aggregates can be created in the ETL tool and stored in the database. Calculated metrics can also be created in an ETL tool and stored in the database instead of created in Data INFORMATICA CONFIDENTIAL

BEST PRACTICES

774 of 954

Analyzer. Each time a calculation must be done in Data Analyzer, it is being performed as part of the query process. To determine if a query can be improved by building these elements in the database, try removing them from the report and comparing report performance. Consider if these elements are appearing in a multitude of reports or simply a few. 7.

Database Queries. As a last resort for under-performing reports, you may want to edit the actual report query. To determine if the query is the bottleneck, select the View SQL button on the report. Next, copy the SQL into a query utility and execute. (DBA assistance may be beneficial here.) If the query appears to be the bottleneck, revisit Steps 2 and 6 above to ensure that no additional report changes are possible. Once you have confirmed that the report is as required, work to edit the query while continuing to re-test it in a query utility. Additional options include utilizing database views to cache data prior to report generation. Reports are then built based on the view.

Note: Editing the report query requires query editing for each report change and may require editing during migrations. Be aware that this is a time-consuming process and a difficult-to-maintain method of performance tuning. The Data Analyzer repository database should be tuned for an OLTP workload.

Tuning Java Virtual Machine (JVM)

JVM Layout The Java Virtual Machine (JVM) is the repository for all live objects, dead objects, and free memory. It has the following primary jobs: ●

Execute code

●

Manage memory

●

Remove garbage objects

The size of the JVM determines how often and how long garbage collection runs. The JVM parameters can be set in the "startWebLogic.cmd" or "startWebLogic.sh" if using the Weblogic application server.

Parameters of the JVM 1. 2. 3.

-Xms and -Xmx parameters define the minimum and maximum heap size; for large applications like Data Analyzer, the values should be set equal to each other. Start with -ms=512m -mx=512m as needed, increase JVM by 128m or 256m to reduce garbage collection.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

775 of 954

4. 5.

Permanent generation, which holds the JVM's class and method objects -XX:MaxPermSize command line parameter controls the permanent generation's size. "NewSize" and "MaxNewSize" parameters control the new generation's minimum and maximum size. XX:NewRatio=5 divides the old-to-new in the order of 5:1 (i.e the old generation occupies 5/6 of the heap while the new generation occupies 1/6 of the heap). ❍

When the new generation fills up, it triggers a minor collection, in which surviving objects are moved to the old generation. ❍

When the old generation fills up, it triggers a major collection, which involves the entire object heap. This is more expensive in terms of resources than a minor collection. 6. 7. 8. 9.

If you increase the new generation size, the old generation size decreases. Minor collections occur less often, but the frequency of major collection increases. If you decrease the new generation size, the old generation size increases. Minor collections occur more, but the frequency of major collection decreases. As a general rule, keep the new generation smaller than half the heap size (i.e., 1/4 or 1/3 of the heap size). Enable additional JVM if you expect large numbers of users. Informatica typically recommends two to three CPUs per JVM.

Other Areas to Tune Execute Threads ●

Threads available to process simultaneous operations in Weblogic. ●

Too few threads means CPUs are under-utilized and jobs are waiting for threads to become available. ●

Too many threads means system is wasting resource in managing threads. The OS performs unnecessary context switching. ●

The default is 15 threads. Informatica recommends using the default value, but you may need to experiment to determine the optimal value for your environment.

Connection Pooling The application borrows a connection from the pool, uses it, and then returns it to the pool by closing it. ●

Initial capacity = 15

●

Maximum capacity = 15

●

Sum of connections of all pools should be equal to the number of execution threads.

Connection pooling avoids the overhead of growing and shrinking the pool size dynamically by setting the initial and INFORMATICA CONFIDENTIAL

BEST PRACTICES

776 of 954

maximum pool size at the same level. Performance packs use platform-optimized (i.e., native) sockets to improve server performance. They are available on: Windows NT/2000 (default installed), Solaris 2.6/2.7, AIX 4.3, HP/UX, and Linux. ●

Check Enable Native I/O on the server attribute tab.

●

Adds to config.xml as true.

For Websphere, use the Performance Tuner to modify the configurable parameters. For optimal configuration, separate the application server , the data warehouse database, and the repository database onto separate dedicated machines.

Application Server-Specific Tuning Details JBoss Application Server Web Container. Tune the web container by modifying the following configuration file so that it accepts a reasonable number of HTTP requests as required by the Data Analyzer installation. Ensure that the web container is made available to optimal number of threads so that it can accept and process more HTTP requests. /server/informatica/deploy/jbossweb-tomcat.sar/META-INF/jboss-service.xml The following is a typical configuration: < The following parameters may need tuning: ●

minProcessors. Number of threads created initially in the pool. ●

maxProcessors. Maximum number of threads that can ever be created in the pool. ●

acceptCount. Controls the length of the queue of waiting requests when no more threads are available from the pool to process the request. ●

connectionTimeout. Amount of time to wait before a URI is received from the stream. Default is 20 seconds. This avoid problems where a client opens a connection and does not send any data ●

tcpNoDelay. Set to true when data should be sent to the client without waiting for the buffer to be full. This reduces latency at the cost of more packets being sent over the network. The default is true. ●

enableLookups. Determines whether a reverse DNS lookup is performed. This can be enabled to prevent IP spoofing. Enabling this parameter can cause problems when a DNS is misbehaving. The enableLookups parameter can be turned off when you implicitly trust all INFORMATICA CONFIDENTIAL

BEST PRACTICES

777 of 954

clients. ●

connectionLinger. How long connections should linger after they are closed. Informatica recommends using the default value: -1 (no linger). In the Data Analyzer application, each web page can potentially have more than one request to the application server. Hence, the maxProcessors should always be more than the actual number of concurrent users. For an installation with 20 concurrent users, a minProcessors of 5 and maxProcessors of 100 is a suitable value. If the number of threads is too low, the following message may appear in the log files: ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads JSP Optimization. To avoid having the application server compile JSP scripts when they are executed for the first time, Informatica ships Data Analyzer with pre-compiled JSPs. The following is a typical configuration: /server/informatica/deploy/jbossweb-tomcat.sar/web.xml jsp org.apache.jasper.servlet.JspServlet logVerbosityLevel WARNING development false 3 The following parameter may need tuning: ●

Set the development parameter to false in a production installation. Database Connection Pool. Data Analyzer accesses the repository database to retrieve metadata information. When it runs reports, it accesses the data sources to get the report information. Data Analyzer keeps a pool of database connections for the repository. It also keeps a separate database connection pool for each data source. To optimize Data Analyzer database connections, you can tune the database connection pools. Repository Database Connection Pool. To optimize the repository database connection pool, modify the JBoss configuration file: /server/informatica/deploy/_ds.xml The name of the file includes the database type. can be Oracle, DB2, or other databases. For example, for an Oracle repository, the configuration file name is oracle_ds.xml. With some versions of Data Analyzer, the configuration file may simply be named DataAnalyzer-ds.xml. The following is a typical configuration:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

778 of 954

jdbc/IASDataSource jdbc:informatica:oracle://aries:1521;SID=prfbase8 com.informatica.jdbc.oracle.OracleDriver powera powera org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter 5 50 5000 1500 The following parameters may need tuning: ●

min-pool-size. The minimum number of connections in the pool. (The pool is lazily constructed, that is, it will be empty until it is first accessed. Once used, it will always have at least the min-pool-size connections.) ●

max-pool-size. The strict maximum size of the connection pool. ●

blocking-timeout-millis. The maximum time in milliseconds that a caller waits to get a connection when no more free connections are available in the pool. ●

idle-timeout-minutes. The length of time an idle connection remains in the pool before it is used. The max-pool-size value is recommended to be at least five more than maximum number of concurrent users because there may be several scheduled reports running in the background and each of them needs a database connection. A higher value is recommended for idle-timeout-minutes. Because Data Analyzer accesses the repository very frequently, it is inefficient to spend resources on checking for idle connections and cleaning them out. Checking for idle connections may block other threads that require new connections. Data Source Database Connection Pool. Similar to the repository database connection pools, the data source also has a pool of connections that Data Analyzer dynamically creates as soon as the first client requests a connection. The tuning parameters for these dynamic pools are present in following file: /bin/IAS.properties.file The following is a typical configuration: # # Datasource definition # dynapool.initialCapacity=5 dynapool.maxCapacity=50 INFORMATICA CONFIDENTIAL

BEST PRACTICES

779 of 954

dynapool.capacityIncrement=2 dynapool.allowShrinking=true dynapool.shrinkPeriodMins=20 dynapool.waitForConnection=true dynapool.waitSec=1 dynapool.poolNamePrefix=IAS_dynapool.refreshTestMinutes=60 datamart.defaultRowPrefetch=20 The following JBoss-specific parameters may need tuning: ●

dynapool.initialCapacity. The minimum number of initial connections in the data source pool. ●

dynapool.maxCapacity. The maximum number of connections that the data source pool may grow to. ●

dynapool.poolNamePrefix. This parameter is a prefix added to the dynamic JDB pool name for identification purposes. ●

dynapool.waitSec. The maximum amount of time (in seconds) a client will wait to grab a connection from the pool if none is readily available. ●

dynapool.refreshTestMinutes. This parameter determines the frequency at which a health check is performed on the idle connections in the pool. This should not be performed too frequently because it locks up the connection pool and may prevent other clients from grabbing connections from the pool. ●

dynapool.shrinkPeriodMins. This parameter determines the amount of time (in minutes) an idle connection is allowed to be in the pool. After this period, the number of connections in the pool shrinks back to the value of its initialCapacity parameter. This is done only if the allowShrinking parameter is set to true.

EJB Container Data Analyzer uses EJBs extensively. It has more than 50 stateless session beans (SLSB) and more than 60 entity beans (EB). In addition, there are six message-driven beans (MDBs) that are used for the scheduling and real-time functionalities. Stateless Session Beans (SLSB). For SLSBs, the most important tuning parameter is the EJB pool. You can tune the EJB pool parameters in the following file: /server/Informatica/conf/standardjboss.xml. The following is a typical configuration: Standard Stateless SessionBean false stateless-rmi-invoker org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor org.jboss.ejb.plugins.LogInterceptor INFORMATICA CONFIDENTIAL

BEST PRACTICES

780 of 954

org.jboss.ejb.plugins.SecurityInterceptor org.jboss.ejb.plugins.TxInterceptorCMT org.jboss.ejb.plugins.MetricsInterceptor org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor org.jboss.ejb.plugins.TxInterceptorBMT org.jboss.ejb.plugins.MetricsInterceptor org.jboss.resource.connectionmanager.CachedConnectionInterceptor org.jboss.ejb.plugins.StatelessSessionInstancePool 100 The following parameter may need tuning: ●

MaximumSize. Represents the maximum number of objects in the pool. If is set to true, then is a strict upper limit for the number of objects that can be created. If is set to false, the number of active objects can exceed the if there are requests for more objects. However, only the number of objects can be returned to the pool. ●

Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data Analyzer to increase the throughput for highconcurrency installations. ●

strictMaximumSize. When the value is set to true, the enforces a rule that only number of objects can be active. Any subsequent requests must wait for an object to be returned to the pool. ●

strictTimeout. If you set to true, then is the amount of time that requests wait for an object to be made available in the pool. Message-Driven Beans (MDB). MDB tuning parameters are very similar to stateless bean tuning parameters. The main difference is that MDBs are not invoked by clients. Instead, the messaging system delivers messages to the MDB when they are available.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

781 of 954

To tune the MDB parameters, modify the following configuration file: /server/informatica/conf/standardjboss.xml The following is a typical configuration: Standard Message Driven Bean false message-driven-bean org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor org.jboss.ejb.plugins.LogInterceptor org.jboss.ejb.plugins.RunAsSecurityInterceptor org.jboss.ejb.plugins.TxInterceptorCMT org.jboss.ejb.plugins.MetricsInterceptor org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor org.jboss.ejb.plugins.MessageDrivenTxInterceptorBMT org.jboss.ejb.plugins.MetricsInterceptor org.jboss.resource.connectionmanager.CachedConnectionInterceptor org.jboss.ejb.plugins.MessageDrivenInstancePool 100 The following parameter may need tuning: MaximumSize. Represents the maximum number of objects in the pool. If is set to true, then is a strict upper limit for the number of objects that can be created. Otherwise, if is set to false, the number of active objects can exceed the if there are request for more objects. However, only the number of objects can be returned to the pool. Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data INFORMATICA CONFIDENTIAL

BEST PRACTICES

782 of 954

Analyzer to increase the throughput for high-concurrency installations. ●

strictMaximumSize. When the value is set to true, the parameter enforces a rule that only number of objects will be active. Any subsequent requests must wait for an object to be returned to the pool. ●

strictTimeout. If you set to true, then is the amount of time that requests wait for an object to be made available in the pool. Enterprise Java Beans (EJB). Data Analyzer EJBs use BMP (bean-managed persistence) as opposed to CMP (container-managed persistence). The EJB tuning parameters are very similar to the stateless bean tuning parameters. The EJB tuning parameters are in the following configuration file: /server/informatica/conf/standardjboss.xml. The following is a typical configuration: Standard BMP EntityBean false entity-rmi-invoker false org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor org.jboss.ejb.plugins.LogInterceptor org.jboss.ejb.plugins.SecurityInterceptor org.jboss.ejb.plugins.TxInterceptorCMT org.jboss.ejb.plugins.MetricsInterceptor org.jboss.ejb.plugins.EntityCreationInterceptor org.jboss.ejb.plugins.EntityLockInterceptor org.jboss.ejb.plugins.EntityInstanceInterceptor org.jboss.ejb.plugins.EntityReentranceInterceptor org.jboss.resource.connectionmanager.CachedConnectionInterceptor org.jboss.ejb.plugins.EntitySynchronizationInterceptor org.jboss.ejb.plugins.EntityInstancePool org.jboss.ejb.plugins.EntityInstanceCache org.jboss.ejb.plugins.BMPPersistenceManager INFORMATICA CONFIDENTIAL

BEST PRACTICES

783 of 954

org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock org.jboss.ejb.plugins.LRUEnterpriseContextCachePolicy 50 1000000 300 600 400 60 1 0.75 100 A The following parameter may need tuning: MaximumSize. Represents the maximum number of objects in the pool. If is set to true, then is a strict upper limit for the number of objects that can be created. Otherwise, if is set to false, the number of active objects can exceed the if there are request for more objects. However, only the number of objects are returned to the pool. Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data Analyzer to increase the throughput for high-concurrency installations. ●

strictMaximumSize. When the value is set to true, the parameter enforces a rule that only number of objects can be active. Any subsequent requests must wait for an object to be returned to the pool. ●

strictTimeout. If you set to true, then is the amount of time that requests will wait for an object to be made available in the pool.

RMI Pool The JBoss Application Server can be configured to have a pool of threads to accept connections from clients for remote method invocation (RMI). If you use the Java RMI protocol to access the Data Analyzer API from other custom applications, you can optimize the RMI thread pool parameters. To optimize the RMI pool, modify the following configuration file: /server/informatica/conf/jboss-service.xml The following is a typical configuration:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

784 of 954

1 300 300 60000 0 0 false jboss:service=TransactionManager The following parameters may need tuning: ●

NumAcceptThreads. The controlling threads used to accept connections from the client. ●

MaxPoolSize. A strict maximum size for the pool of threads to service requests on the server. ●

ClientMaxPoolSize. A strict maximum size for the pool of threads to service requests on the client. ●

●

Backlog. The number of requests in the queue when all the processing threads are in use. EnableTcpDelay. Indicates whether information should be sent before the buffer is full. Setting it to true may increase the network traffic because more packets will be sent across the network.

WebSphere Application Server 5.1. The Tivoli Performance Viewer can be used to observe the behavior of some of the parameters and arrive at a good settings.

Web Container Navigate to “Application Servers > [your_server_instance] > Web Container > Thread Pool” to tune the following parameters. ●

Minimum Size: Specifies the minimum number of threads to allow in the pool. The default value of 10 is appropriate. ● Maximum Size: Specifies the minimum number of threads to allow in the pool. For a highly concurrent usage scenario (with a 3 VM load-balanced configuration), the value of 50-60 has been determined to be optimal. ●

Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that should elapse before a thread is reclaimed. The default of 3500ms is considered optimal.

●

Is Growable: Specifies whether the number of threads can increase beyond the maximum size configured for the thread pool. Be sure to leave this option unchecked. Also, the maximum threads should be hard-limited to the value given in the “Maximum Size”. Note: In a load-balanced environment, there is likely to be more than one server instance that may be spread across multiple machines. In such a scenario, be sure that the changes have been properly propagated to all of the server INFORMATICA CONFIDENTIAL

BEST PRACTICES

785 of 954

instances.

Transaction Services Total transaction lifetime timeout: In certain circumstances (e.g., import of large XML files), the default value of 120 seconds may not be sufficient and should be increased. This parameter can be modified during runtime also.

Diagnostic Trace Services ●

Disable the trace in a production environment . ●

Navigate to “Application Servers > [your_server_instance] > Administration Services > Diagnostic Trace Service “ and make sure “Enable Tracing” is not checked.

Debugging Services Ensure that the tracing is disabled in a production environment. Navigate to “Application Servers > [your_server_instance] > Logging and Tracing > Diagnostic Trace Service > Debugging Service “ and make sure “Startup” is not checked.

Performance Monitoring Services This set of parameters is for monitoring the health of the Application Server. This monitoring service tries to ping the application server after a certain interval; if the server is found to be dead, it then tries to restart the server. Navigate to “Application Servers > [your_server_instance] > Process Definition > MonitoringPolicy “ and tune the parameters according to a policy determined for each Data Analyzer installation. Note: The parameter “Ping Timeout” determines the time after which a no-response from the server implies that it is faulty. The monitoring service then attempts to kill the server and restart it if “Automatic restart” is checked. Take care that “Ping Timeout” is not set to too small a value.

Process Definitions (JVM Parameters) For a Data Analyzer installation with a high number of concurrent users, Informatica recommends that the minimum and the maximum heap size be set to the same values. This avoids the heap allocation-reallocation expense during a high-concurrency scenario. Also, for a high-concurrency scenario, Informatica recommends setting the values of minimum heap and maximum heap size to at least 1000MB. Further tuning of this heap-size is recommended after carefully studying the garbage collection behavior by turning on the verbosegc option. The following is a list of java parameters (for IBM JVM 1.4.1) that should not be modified from the default values for Data Analyzer installation: ●

-Xnocompactgc. This parameter switches off heap compaction altogether. Switching off heap compaction results in heap fragmentation. Since Data Analyzer frequently allocates large objects, heap fragmentation can result in OutOfMemory exceptions. ●

-Xcompactgc. Using this parameter leads to each garbage collection cycle carrying out INFORMATICA CONFIDENTIAL

BEST PRACTICES

786 of 954

compaction, regardless of whether it's useful. ●

-Xgcthreads. This controls the number of garbage collection helper threads created by the JVM during startup. The default is N-1 threads for an N-processor machine. These threads provide the parallelism in parallel mark and parallel sweep modes, which reduces the pause time during garbage collection. ●

●

-Xclassnogc. This disables collection of class objects. -Xinitsh. This sets the initial size of the application-class system heap. The system heap is expanded as needed and is never garbage collected.

You may want to alter the following parameters after carefully examining the application server processes: ●

Navigate to “Application Servers > [your_server_instance] > Process Definition > Java Virtual Machine"

●

Verbose garbage collection. Check this option to turn on verbose garbage collection. This can help in understanding the behavior of the garbage collection for the application. It has a very low overhead on performance and can be turned on even in the production environment. ●

Initial heap size. This is the –ms value. Only the numeric value (without MB) needs to be specified. For concurrent usage, the initial heap-size should be started with a 1000 and, depending on the garbage collection behavior, can be potentially increased up to 2000. A value beyond 2000 may actually reduce throughput because the garbage collection cycles will take more time to go through the large heap, even though the cycles may be occurring less frequently. ●

Maximum heap size. This is the –mx value. It should be equal to the “Initial heap size” value. ●

RunHProf:. This should remain unchecked in production mode, because it slows down the VM considerably. ●

Debug Mode. This should remain unchecked in production mode, because it slows down the VM considerably. ●

Disable JIT.: This should remain unchecked (i.e., JIT should never be disabled).

Performance Monitoring Services Be sure that performance monitoring services are not enabled in a production environment. Navigate to “Application Servers > [your_server_instance] > Performance Monitoring Services“ and be sure “Startup” is not checked.

Database Connection Pool The repository database connection pool can be configured by navigating to “JDBC Providers > User-defined JDBC Provider > Data Sources > IASDataSource > Connection Pools” The various parameters that may need tuning are: ●

Connection Timeout. The default value of 180 seconds should be good. This implies that after 180 seconds, the request to grab a connection from the pool will timeout. After it times out, DataAnalyzer will throw an exception. In that case, the pool size may need to be increased.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

787 of 954

●

●

●

●

●

Max Connections. The maximum number of connections in the pool. Informatica recommends a value of 50 for this. Min Connections. The minimum number of connections in the pool. Informatica recommends a value of 10 for this. Reap Time. This specifies the frequency of pool maintenance thread. This should not be set very high because when pool maintenance thread is running, it blocks the whole pool and no process can grab a new connection form the pool. If the database and the network are reliable, this should have a very high value (e. g., 1000). Unused Timeout. This specifies the time in seconds after which an unused connection will be discarded until the pool size reaches the minimum size. In a highly concurrent usage, this should be a high value. The default of 1800 seconds should be fine. Aged Timeout. Specifies the interval in seconds before a physical connection is discarded. If the database and the network are stable, there should not be a reason for age timeout. The default is 0 (i.e., connections do not age). If the database or the network connection to the repository database frequently comes down (compared to the life of the AppServer), this can be used to age-out the stale connections.

Much like the repository database connection pools, the data source or data warehouse databases also have a pool of connections that are created dynamically by Data Analyzer as soon as the first client makes a request. The tuning parameters for these dynamic pools are present in /AppServer/IAS.properties file. The following is a typical configuration:.

# # Datasource definition # dynapool.initialCapacity=5 dynapool.maxCapacity=50 dynapool.capacityIncrement=2 dynapool.allowShrinking=true dynapool.shrinkPeriodMins=20 dynapool.waitForConnection=true dynapool.waitSec=1 dynapool.poolNamePrefix=IAS_ dynapool.refreshTestMinutes=60 datamart.defaultRowPrefetch=20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

788 of 954

The various parameters that may need tuning are: ●

dynapool.initialCapacity - the minimum number of initial connections in the data-source pool.

●

dynapool.maxCapacity - the maximum number of connections that the data-source pool may grow up to.

●

dynapool.poolNamePrefix - a prefix added to the dynamic JDB pool name for identification purposes.

●

●

●

dynapool.waitSec - the maximum amount of time (in seconds) that a client will wait to grab a connection from the pool if none is readily available. dynapool.refreshTestMinutes - determines the frequency at which a health check on the idle connections in the pool is performed. Such checks should not be performed too frequently because they lock up the connection pool and may prevent other clients from grabbing connections from the pool. dynapool.shrinkPeriodMins - determines the amount of time (in minutes) an idle connection is allowed to be in the pool. After this period, the number of connections in the pool decreases (to its initialCapacity). This is done only if allowShrinking is set to true.

Message Listeners Services To process scheduled reports, Data Analyzer uses Message-Driven-Beans. It is possible to run multiple reports within one schedule in parallel by increasing the number of instances of the MDB catering to the Scheduler (InfScheduleMDB). Take care however, not to increase the value to some arbitrarily high value since each report consumes considerable resources (e.g., database connections, and CPU processing at both the application-server and database server levels) and setting this to a very high value may actually be detrimental to the whole system. Navigate to “Application Servers > [your_server_instance] > Message Listener Service > Listener Ports > IAS_ScheduleMDB_ListenerPort” . The parameters that can be tuned are: ●

●

Maximum sessions. The default value is one. On a highly-concurrent user scenario, Informatica does not recommend going beyond five. Maximum messages. This should remain as one. This implies that each report in a schedule will be executed in a separate transaction instead of a batch. Setting it to more than one may have unwanted effects like transaction timeouts, and the failure of one report may cause all the reports in the batch to fail.

Plug-in Retry Intervals and Connect Timeouts When Data Analyzer is set up in a clustered WebSphere environment, a plug-in is normally used to perform the loadbalancing between each server in the cluster. The proxy http-server sends the request to the plug-in and the plug-in then routes the request to the proper application-server. The plug-in file can be generated automatically by navigating to “ Environment > Update web server plugin configuration”. The default plug-in file contains ConnectTimeOut=0, which means that it relies on the tcp timeout setting of the server. It is possible to have different timeout settings for different servers in the cluster. The timeout settings implies that after the given number of seconds if the server doesn’t respond, then it is marked as down and the request is sent over to the next available member of the cluster. The RetryInterval parameter allows you to specify how long to wait before retrying a server that is marked as down. The default value is 10 seconds. This means if a cluster member is marked as down, the server does not try to send a request to the same member for 10 seconds.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

789 of 954

Tuning Mappings for Better Performance Challenge In general, mapping-level optimization takes time to implement, but can significantly boost performance. Sometimes the mapping is the biggest bottleneck in the load process because business rules determine the number and complexity of transformations in a mapping. Before deciding on the best route to optimize the mapping architecture, you need to resolve some basic issues. Tuning mappings is a grouped approach. The first group can be of assistance almost universally, bringing about a performance increase in all scenarios. The second group of tuning processes may yield only small performance increase, or can be of significant value, depending on the situation. Some factors to consider when choosing tuning processes at the mapping level include the specific environment, software/ hardware limitations, and the number of rows going through a mapping. This Best Practice offers some guidelines for tuning mappings.

Description Analyze mappings for tuning only after you have tuned the target and source for peak performance. To optimize mappings, you generally reduce the number of transformations in the mapping and delete unnecessary links between transformations. For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup transformations), limit connected input/output or output ports. Doing so can reduce the amount of data the transformations store in the data cache. Having too many Lookups and Aggregators can encumber performance because each requires index cache and data cache. Since both are fighting for memory space, decreasing the number of these transformations in a mapping can help improve speed. Splitting them up into different mappings is another option. Limit the number of Aggregators in a mapping. A high number of Aggregators can increase I/O activity on the cache directory. Unless the seek/access time is fast on the directory itself, having too many Aggregators can cause a bottleneck. Similarly, too many Lookups in a mapping causes contention of disk and memory, which can lead to thrashing, leaving insufficient memory to run a mapping efficiently.

Consider Single-Pass Reading If several mappings use the same data source, consider a single-pass reading. If you have several sessions that use the same sources, consolidate the separate mappings with either a single Source Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the separate data flows. Similarly, if a function is used in several mappings, a single-pass reading reduces the number of times that function is called in the session. For example, if you need to subtract percentage from the PRICE ports for both the Aggregator and Rank transformations, you can minimize work by subtracting the percentage before splitting the pipeline.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

790 of 954

Optimize SQL Overrides When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in the update override of a target object, be sure the SQL statement is tuned. The extent to which and how SQL can be tuned depends on the underlying source or target database system. See Tuning SQL Overrides and Environment for Better Performance for more information .

Scrutinize Datatype Conversions PowerCenter Server automatically makes conversions between compatible datatypes. When these conversions are performed unnecessarily, performance slows. For example, if a mapping moves data from an integer port to a decimal port, then back to an integer port, the conversion may be unnecessary. In some instances however, datatype conversions can help improve performance. This is especially true when integer values are used in place of other datatypes for performing comparisons using Lookup and Filter transformations.

Eliminate Transformation Errors Large numbers of evaluation errors significantly slow performance of the PowerCenter Server. During transformation errors, the PowerCenter Server engine pauses to determine the cause of the error, removes the row causing the error from the data flow, and logs the error in the session log. Transformation errors can be caused by many things including: conversion errors, conflicting mapping logic, any condition that is specifically set up as an error, and so on. The session log can help point out the cause of these errors. If errors recur consistently for certain transformations, re-evaluate the constraints for these transformations. If you need to run a session that generates a large number of transformation errors, you might improve performance by setting a lower tracing level. However, this is not a long-term response to transformation errors. Any source of errors should be traced and eliminated.

Optimize Lookup Transformations There are a several ways to optimize lookup transformations that are set up in a mapping.

When to Cache Lookups Cache small lookup tables. When caching is enabled, the PowerCenter Server caches the lookup table and queries the lookup cache during the session. When this option is not enabled, the PowerCenter Server queries the lookup table on a row-by-row basis. Note: All of the tuning options mentioned in this Best Practice assume that memory and cache sizing for lookups are sufficient to ensure that caches will not page to disks. Information regarding memory and cache sizing for Lookup transformations are covered in the Best Practice: Tuning Sessions for Better Performance. A better rule of thumb than memory size is to determine the size of the potential lookup cache with regard to the number of rows expected to be processed. For example, consider the following example. INFORMATICA CONFIDENTIAL

BEST PRACTICES

791 of 954

In Mapping X, the source and lookup contain the following number of records:

ITEMS (source): MANUFACTURER: DIM_ITEMS:

5000 records 200 records 100000 records

Number of Disk Reads

Cached Lookup

Un-cached Lookup

Build Cache

200

0

Read Source Records

5000

5000

0

5000

5200

100000

100000

0

5000

5000

0

5000

105000

10000

LKP_Manufacturer

Execute Lookup Total # of Disk Reads LKP_DIM_ITEMS Build Cache Read Source Records Execute Lookup Total # of Disk Reads

Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will take a total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not cached, then it will take a total of 10,000 total disk reads to execute the lookup. In this case, the number of records in the lookup table is small in comparison with the number of times the lookup is executed. So this lookup should be cached. This is the more likely scenario. Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result in 105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then the disk reads would total 10,000. In this case the number of records in the lookup table is not small in comparison with the number of times the lookup will be executed. Thus, the lookup should not be cached. INFORMATICA CONFIDENTIAL

BEST PRACTICES

792 of 954

Use the following eight step method to determine if a lookup should be cached: 1. Code the lookup into the mapping. 2. Select a standard set of data from the source. For example, add a "where" clause on a relational source to load a sample 10,000 rows. 3. Run the mapping with caching turned off and save the log. 4. Run the mapping with caching turned on and save the log to a different name than the log created in step 3. 5. Look in the cached lookup log and determine how long it takes to cache the lookup object. Note this time in seconds: LOOKUP TIME IN SECONDS = LS. 6. In the non-cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND = NRS. 7. In the cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS. 8. Use the following formula to find the breakeven row point: (LS*NRS*CRS)/(CRS-NRS) = X Where X is the breakeven point. If your expected source records is less than X, it is better to not cache the lookup. If your expected source records is more than X, it is better to cache the lookup. For example: Assume the lookup takes 166 seconds to cache (LS=166). Assume with a cached lookup the load is 232 rows per second (CRS=232). Assume with a non-cached lookup the load is 147 rows per second (NRS = 147). The formula would result in: (166*147*232)/(232-147) = 66,603. Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more than 66,603 records, then the lookup should be cached.

Sharing Lookup Caches There are a number of methods for sharing lookup caches: ●

●

●

Within a specific session run for a mapping, if the same lookup is used multiple times in a mapping, the PowerCenter Server will re-use the cache for the multiple instances of the lookup. Using the same lookup multiple times in the mapping will be more resource intensive with each successive instance. If multiple cached lookups are from the same table but are expected to return different columns of data, it may be better to setup the multiple lookups to bring back the same columns even though not all return ports are used in all lookups. Bringing back a common set of columns may reduce the number of disk reads. Across sessions of the same mapping, the use of an unnamed persistent cache allows multiple runs to use an existing cache file stored on the PowerCenter Server. If the option of creating a persistent cache is set in the lookup properties, the memory cache created for the lookup during the initial run is saved to the PowerCenter Server. This can improve performance because the Server builds the memory cache from cache files instead of the database. This feature should only be used when the lookup table is not expected to change between session runs. Across different mappings and sessions, the use of a named persistent cache allows

INFORMATICA CONFIDENTIAL

BEST PRACTICES

793 of 954

sharing an existing cache file.

Reducing the Number of Cached Rows There is an option to use a SQL override in the creation of a lookup cache. Options can be added to the WHERE clause to reduce the set of records included in the resulting cache. Note: If you use a SQL override in a lookup, the lookup must be cached.

Optimizing the Lookup Condition In the case where a lookup uses more than one lookup condition, set the conditions with an equal sign first in order to optimize lookup performance.

Indexing the Lookup Table The PowerCenter Server must query, sort, and compare values in the lookup condition columns. As a result, indexes on the database table should include every column used in a lookup condition. This can improve performance for both cached and un-cached lookups. ●

In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement used to create the cache. Columns used in the ORDER BY condition should be indexed. The session log will contain the ORDER BY statement. ●

In the case of an un-cached lookup, since a SQL statement is created for each row passing into the lookup transformation, performance can be helped by indexing columns in the lookup condition.

Use a Persistent Lookup Cache for Static Lookups If the lookup source does not change between sessions, configure the Lookup transformation to use a persistent lookup cache. The PowerCenter Server then saves and reuses cache files from session to session, eliminating the time required to read the lookup source.

Optimize Filter and Router Transformations Filtering data as early as possible in the data flow improves the efficiency of a mapping. Instead of using a Filter Transformation to remove a sizeable number of rows in the middle or end of a mapping, use a filter on the Source Qualifier or a Filter Transformation immediately after the source qualifier to improve performance. Avoid complex expressions when creating the filter condition. Filter transformations are most effective when a simple integer or TRUE/FALSE expression is used in the filter condition. Filters or routers should also be used to drop rejected rows from an Update Strategy transformation if rejected rows do not need to be saved. INFORMATICA CONFIDENTIAL

BEST PRACTICES

794 of 954

Replace multiple filter transformations with a router transformation. This reduces the number of transformations in the mapping and makes the mapping easier to follow.

Optimize Aggregator Transformations Aggregator Transformations often slow performance because they must group data before processing it. Use simple columns in the group by condition to make the Aggregator Transformation more efficient. When possible, use numbers instead of strings or dates in the GROUP BY columns. Also avoid complex expressions in the Aggregator expressions, especially in GROUP BY ports. Use the Sorted Input option in the Aggregator. This option requires that data sent to the Aggregator be sorted in the order in which the ports are used in the Aggregator's group by. The Sorted Input option decreases the use of aggregate caches. When it is used, the PowerCenter Server assumes all data is sorted by group and, as a group is passed through an Aggregator, calculations can be performed and information passed on to the next transformation. Without sorted input, the Server must wait for all rows of data before processing aggregate calculations. Use of the Sorted Inputs option is usually accompanied by a Source Qualifier which uses the Number of Sorted Ports option. Use an Expression and Update Strategy instead of an Aggregator Transformation. This technique can only be used if the source data can be sorted. Further, using this option assumes that a mapping is using an Aggregator with Sorted Input option. In the Expression Transformation, the use of variable ports is required to hold data from the previous row of data processed. The premise is to use the previous row of data to determine whether the current row is a part of the current group or is the beginning of a new group. Thus, if the row is a part of the current group, then its data would be used to continue calculating the current group function. An Update Strategy Transformation would follow the Expression Transformation and set the first row of a new group to insert, and the following rows to update. Use incremental aggregation if you can capture changes from the source that changes less than half the target. When using incremental aggregation, you apply captured changes in the source to aggregate calculations in a session. The PowerCenter Server updates your target incrementally, rather than processing the entire source and recalculating the same calculations every time you run the session.

Joiner Transformation Joining Data from the Same Source You can join data from the same source in the following ways: ●

Join two branches of the same pipeline.

●

Create two instances of the same source and join pipelines from these source instances.

You may want to join data from the same source if you want to perform a calculation on part of the data and join the transformed data with the original data. When you join the data using this method, you can maintain the original data and transform parts of that data within one mapping. When you join data from the same source, you can create two branches of the pipeline. When you branch INFORMATICA CONFIDENTIAL

BEST PRACTICES

795 of 954

a pipeline, you must add a transformation between the Source Qualifier and the Joiner transformation in at least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for sorted input. If you want to join unsorted data, you must create two instances of the same source and join the pipelines. For example, you may have a source with the following ports: ●

Employee

●

Department

●

Total Sales

In the target table, you want to view the employees who generated sales that were greater than the average sales for their respective departments. To accomplish this, you create a mapping with the following transformations: ● ●

●

●

Sorter transformation. Sort the data. Sorted Aggregator transformation. Average the sales data and group by department. When you perform this aggregation, you lose the data for individual employees. To maintain employee data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch with the same data to the Joiner transformation to maintain the original data. When you join both branches of the pipeline, you join the aggregated data with the original data. Sorted Joiner transformation. Use a sorted Joiner transformation to join the sorted aggregated data with the original data. Filter transformation. Compare the average sales data against sales data for each employee and filter out employees with less than above average sales.

Note: You can also join data from output groups of the same transformation, such as the Custom transformation or XML Source Qualifier transformations. Place a Sorter transformation between each output group and the Joiner transformation and configure the Joiner transformation to receive sorted input. Joining two branches can affect performance if the Joiner transformation receives data from one branch much later than the other branch. The Joiner transformation caches all the data from the first branch, and writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk when it receives the data from the second branch. This can slow processing. You can also join same source data by creating a second instance of the source. After you create the second source instance, you can join the pipelines from the two source instances. Note: When you join data using this method, the PowerCenter Server reads the source data for each source instance, so performance can be slower than joining two branches of a pipeline. Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a source: ●

●

Join two branches of a pipeline when you have a large source or if you can read the source data only once. For example, you can only read source data from a message queue once. Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you

INFORMATICA CONFIDENTIAL

BEST PRACTICES

796 of 954

use a Sorter transformation to sort the data, branch the pipeline after you sort the data. ●

●

Join two instances of a source when you need to add a blocking transformation to the pipeline between the source and the Joiner transformation. Join two instances of a source if one pipeline may process much more slowly than the other pipeline.

Performance Tips Use the database to do the join when sourcing data from the same database schema. Database systems usually can perform the join more quickly than the PowerCenter Server, so a SQL override or a join condition should be used when joining multiple tables from the same database schema. Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting set of data is also smaller. Join sorted data when possible. You can improve session performance by configuring the Joiner transformation to use sorted input. When you configure the Joiner transformation to use sorted data, the PowerCenter Server improves performance by minimizing disk input and output. You see the greatest performance improvement when you work with large data sets. For an unsorted Joiner transformation, designate as the master sourcethe source with fewer rows. For optimal performance and disk storage, designate the master source as the source with the fewer rows. During a session, the Joiner transformation compares each row of the master source against the detail source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which speeds the join process. For a sorted Joiner transformation, designate as the master source the source with fewer duplicate key values. For optimal performance and disk storage, designate the master source as the source with fewer duplicate key values. When the PowerCenter Server processes a sorted Joiner transformation, it caches rows for one hundred keys at a time. If the master source contains many rows with the same key value, the PowerCenter Server must cache more rows, and performance can be slowed. Optimizing sorted joiner transformations with partitions. When you use partitions with a sorted Joiner transformation, you may optimize performance by grouping data and using n:n partitions.

Add a hash auto-keys partition upstream of the sort origin To obtain expected results and get best performance when partitioning a sorted Joiner transformation, you must group and sort data. To group data, ensure that rows with the same key value are routed to the same partition. The best way to ensure that data is grouped and distributed evenly among partitions is to add a hash auto-keys or key-range partition point before the sort origin. Placing the partition point before you sort the data ensures that you maintain grouping and sort the data within each group.

Use n:n partitions You may be able to improve performance for a sorted Joiner transformation by using n:n partitions. When you use n:n partitions, the Joiner transformation reads master and detail rows concurrently and does not need to cache all of the master data. This reduces memory usage and speeds processing. When you use 1:n partitions, the Joiner transformation caches all the data from the master pipeline and writes the cache INFORMATICA CONFIDENTIAL

BEST PRACTICES

797 of 954

to disk if the memory cache fills. When the Joiner transformation receives the data from the detail pipeline, it must then read the data from disk to compare the master and detail pipelines.

Optimize Sequence Generator Transformations Sequence Generator transformations need to determine the next available sequence number; thus, increasing the Number of Cached Values property can increase performance. This property determines the number of values the PowerCenter Server caches at one time. If it is set to cache no values, then the PowerCenter Server must query the repository each time to determine the next number to be used. You may consider configuring the Number of Cached Values to a value greater than 1000. Note that any cached values not used in the course of a session are lost since the sequence generator value in the repository is set when it is called next time, to give the next set of cache values.

Avoid External Procedure Transformations For the most part, making calls to external procedures slows a session. If possible, avoid the use of these Transformations, which include Stored Procedures, External Procedures, and Advanced External Procedures.

Field-Level Transformation Optimization As a final step in the tuning process, you can tune expressions used in transformations. When examining expressions, focus on complex expressions and try to simplify them when possible. To help isolate slow expressions, do the following: 1. 2. 3. 4.

Time the session with the original expression. Copy the mapping and replace half the complex expressions with a constant. Run and time the edited session. Make another copy of the mapping and replace the other half of the complex expressions with a constant. 5. Run and time the edited session.

Processing field level transformations takes time. If the transformation expressions are complex, then processing is even slower. It’s often possible to get a 10 to 20 percent performance improvement by optimizing complex field level transformations. Use the target table mapping reports or the Metadata Reporter to examine the transformations. Likely candidates for optimization are the fields with the most complex expressions. Keep in mind that there may be more than one field causing performance problems.

Factoring Out Common Logic Factoring out common logic can reduce the number of times a mapping performs the same logic. If a mapping performs the same logic multiple times, moving the task upstream in the mapping may allow the logic to be performed just once. For example, a mapping has five target tables. Each target requires a Social Security Number lookup. Instead of performing the lookup right before each target, move the lookup to a position before the data flow splits.

Minimize Function Calls INFORMATICA CONFIDENTIAL

BEST PRACTICES

798 of 954

Anytime a function is called it takes resources to process. There are several common examples where function calls can be reduced or eliminated. Aggregate function calls can sometime be reduced. In the case of each aggregate function call, the PowerCenter Server must search and group the data. Thus, the following expression: SUM(Column A) + SUM(Column B) Can be optimized to: SUM(Column A + Column B) In general, operators are faster than functions, so operators should be used whenever possible. For example if you have an expression which involves a CONCAT function such as: CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME) It can be optimized to: FIRST_NAME || LAST_NAME Remember that IIF() is a function that returns a value, not just a logical test. This allows many logical statements to be written in a more compact fashion. For example: IIF(FLG_A=Y and FLG_B=Y and FLG_C= Y, VAL_A+VAL_B+VAL_C, IIF(FLG_A=Y and FLG_B=Y and FLG_C= N, VAL_A+VAL_B, IIF(FLG_A=Y and FLG_B=N and FLG_C= Y, VAL_A+VAL_C, IIF(FLG_A=Y and FLG_B=N and FLG_C= N, VAL_A, IIF(FLG_A=N and FLG_B=Y and FLG_C= Y, VAL_B+VAL_C, IIF(FLG_A=N and FLG_B=Y and FLG_C= N, VAL_B, IIF(FLG_A=N and FLG_B=N and FLG_C= Y, VAL_C, IIF(FLG_A=N and FLG_B=N and FLG_C= N, 0.0)))))))) Can be optimized to: IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C= Y, VAL_C, 0.0) The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results in three IIFs, three comparisons, and two additions.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

799 of 954

Be creative in making expressions more efficient. The following is an example of rework of an expression that eliminates three comparisons down to one: IIF(X=1 OR X=5 OR X=9, 'yes', 'no') Can be optimized to: IIF(MOD(X, 4) = 1, 'yes', 'no')

Calculate Once, Use Many Times Avoid calculating or testing the same value multiple times. If the same sub-expression is used several times in a transformation, consider making the sub-expression a local variable. The local variable can be used only within the transformation in which it was created. Calculating the variable only once and then referencing the variable in following sub-expressions improves performance.

Choose Numeric vs. String Operations The PowerCenter Server processes numeric operations faster than string operations. For example, if a lookup is performed on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance.

Optimizing Char-Char and Char-Varchar Comparisons When the PowerCenter Server performs comparisons between CHAR and VARCHAR columns, it slows each time it finds trailing blank spaces in the row. To resolve this, treat CHAR as the CHAR On Read option in the PowerCenter Server setup so that the server does not trim trailing spaces from the end of CHAR source fields.

Use DECODE Instead of LOOKUP When a LOOKUP function is used, the PowerCenter Server must lookup a table in the database. When a DECODE function is used, the lookup values are incorporated into the expression itself so the server does not need to lookup a separate table. Thus, when looking up a small set of unchanging values, using DECODE may improve performance.

Reduce the Number of Transformations in a Mapping Because there is always overhead involved in moving data among transformations, try, whenever possible, to reduce the number of transformations. Also, resolve unnecessary links between transformations to minimize the amount of data moved. This is especially important with data being pulled from the Source Qualifier Transformation.

Use Pre- and Post-Session SQL Commands You can specify pre- and post-session SQL commands in the Properties tab of the Source Qualifier transformation and in the Properties tab of the target instance in a mapping. To increase the load speed, use these commands to drop indexes on the target before the session runs, then recreate them when the INFORMATICA CONFIDENTIAL

BEST PRACTICES

800 of 954

session completes. Apply the following guidelines when using SQL statements: ●

●

You can use any command that is valid for the database type. However, the PowerCenter Server does not allow nested comments, even though the database may. You can use mapping parameters and variables in SQL executed against the source, but not against the target.

●

Use a semi-colon (;) to separate multiple statements.

●

The PowerCenter Server ignores semi-colons within single quotes, double quotes, or within /* ...*/.

●

●

If you need to use a semi-colon outside of quotes or comments, you can escape it with a back slash (\). The Workflow Manager does not validate the SQL.

Use Environmental SQL For relational databases, you can execute SQL commands in the database environment when connecting to the database. You can use this for source, target, lookup, and stored procedure connections. For instance, you can set isolation levels on the source and target systems to avoid deadlocks. Follow the guidelines listed above for using the SQL statements.

Use Local Variables You can use local variables in Aggregator, Expression, and Rank transformations.

Temporarily Store Data and Simplify Complex Expressions Rather than parsing and validating the same expression each time, you can define these components as variables. This also allows you to simplyfy complex expressions. For example, the following expressions: AVG( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) SUM( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) can use variables to simplify complex expressions and temporarily store data:

Port

Value

V_CONDITION1

JOB_STATUS = 'Full-time'

V_CONDITION2

OFFICE_ID = 1000

AVG_SALARY

AVG( SALARY, V_CONDITION1 AND V_CONDITION2 )

SUM_SALARY

SUM( SALARY, V_CONDITION1 AND V_CONDITION2 )

INFORMATICA CONFIDENTIAL

BEST PRACTICES

801 of 954

Store Values Across Rows You can use variables to store data from prior rows. This can help you perform procedural calculations. To compare the previous state to the state just read: IIF( PREVIOUS_STATE = STATE, STATE_COUNTER + 1, 1 )

Capture Values from Stored Procedures Variables also provide a way to capture multiple columns of return values from stored procedures.

Last updated: 13-Feb-07 17:43

INFORMATICA CONFIDENTIAL

BEST PRACTICES

802 of 954

Tuning Sessions for Better Performance Challenge Running sessions is where the pedal hits the metal. A common misconception is that this is the area where most tuning should occur. While it is true that various specific session options can be modified to improve performance, PowerCenter 8 comes with PowerCenter Enterprise Grid Option and Pushdown optimizations that also improve performance tremendously.

Description Once you optimize the source and target database, and mapping, you can focus on optimizing the session. The greatest area for improvement at the session level usually involves tweaking memory cache settings. The Aggregator (without sorted ports), Joiner, Rank, Sorter and Lookup transformations (with caching enabled) use caches. The PowerCenter Server uses index and data caches for each of these transformations. If the allocated data or index cache is not large enough to store the data, the PowerCenter Server stores the data in a temporary disk file as it processes the session data. Each time the PowerCenter Server pages to the temporary file, performance slows. You can see when the PowerCenter Server pages to the temporary file by examining the performance details. The transformation_readfromdisk or transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or Joiner transformation indicate the number of times the PowerCenter Server must page to disk to process the transformation. Index and data caches should both be sized according to the requirements of the individual lookup. The sizing can be done using the estimation tools provided in the Transformation Guide, or through observation of actual cache sizes on in the session caching directory. The PowerCenter Server creates the index and data cache files by default in the PowerCenter Server variable directory, $PMCacheDir. The naming convention used by the PowerCenter Server for these files is PM [type of transformation] [generated session instance id number] _ [transformation instance id number] _ [partition index]. dat or .idx. For example, an aggregate data cache file would be named PMAGG31_19. dat. The cache directory may be changed however, if disk space is a constraint. Informatica recommends that the cache directory be local to the PowerCenter Server. A RAID 0 arrangement that gives maximum performance with no redundancy is INFORMATICA CONFIDENTIAL

BEST PRACTICES

803 of 954

recommended for volatile cache file directories (i.e., no persistent caches). If the PowerCenter Server requires more memory than the configured cache size, it stores the overflow values in these cache files. Since paging to disk can slow session performance, the RAM allocated needs to be available on the server. If the server doesn’t have available RAM and uses paged memory, your session is again accessing the hard disk. In this case, it is more efficient to allow PowerCenter to page the data rather than the operating system. Adding additional memory to the server is, of course, the best solution. Refer to Session Caches in the Workflow Administration Guide for detailed information on determining cache sizes. The PowerCenter Server writes to the index and data cache files during a session in the following cases: ●

●

●

●

The mapping contains one or more Aggregator transformations, and the session is configured for incremental aggregation. The mapping contains a Lookup transformation that is configured to use a persistent lookup cache, and the PowerCenter Server runs the session for the first time. The mapping contains a Lookup transformation that is configured to initialize the persistent lookup cache. The Data Transformation Manager (DTM) process in a session runs out of cache memory and pages to the local cache files. The DTM may create multiple files when processing large amounts of data. The session fails if the local directory runs out of disk space.

When a session is running, the PowerCenter Server writes a message in the session log indicating the cache file name and the transformation name. When a session completes, the DTM generally deletes the overflow index and data cache files. However, index and data files may exist in the cache directory if the session is configured for either incremental aggregation or to use a persistent lookup cache. Cache files may also remain if the session does not complete successfully.

Configuring Automatic Memory Settings PowerCenter 8 allows you to configure the amount of cache memory. Alternatively, you can configure the Integration Service to automatically calculate cache memory settings at run time. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source to the target. It also creates session INFORMATICA CONFIDENTIAL

BEST PRACTICES

804 of 954

caches in memory. Session caches include index and data caches for the Aggregator, Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches. The values stored in the data and index caches depend upon the requirements of the transformation. For example, the Aggregator index cache stores group values as configured in the group by ports, and the data cache stores calculations based on the group by ports. When the Integration Service processes a Sorter transformation or writes data to an XML target, it also creates a cache.

Configuring Session Cache Memory The Integration Service can determine cache memory requirements for the Lookup, Aggregator, Rank, Joiner, Sorter and XML. You can configure auto for the index and data cache size in the transformation properties or on the mappings tab of the session properties

Max Memory Limits Configuring maximum memory limits allows you to ensure that you reserve a designated amount or percentage of memory for other processes. You can configure the memory limit as a numeric value and as a percent of total memory. Because available memory varies, the Integration Service bases the percentage value on the total memory on the Integration Service process machine. For example, you configure automatic caching for three Lookup transformations in a session. Then, you configure a maximum memory limit of 500MB for the session. When you run the session, the Integration Service divides the 500MB of allocated memory among the index and data caches for the Lookup transformations. When you configure a maximum memory value, the Integration Service divides memory among transformation caches based on the transformation type. When you configure a numeric value and a percent both, the Integration Service compares the values and uses the lower value as the maximum memory limit. When you configure automatic memory settings, the Integration Service specifies a minimum memory allocation for the index and data caches. The Integration Service allocates 1,000,000 bytes to the index cache and 2,000,000 bytes to the data cache for each transformation instance. If you configure a maximum memory limit that is less than the minimum value for an index or data cache, the Integration Service overrides this value. For example, if you configure a maximum memory value of 500 bytes for

INFORMATICA CONFIDENTIAL

BEST PRACTICES

805 of 954

session containing a Lookup transformation, the Integration Service overrides or disable the automatic memory settings and uses the default values. When you run a session on a grid and you configure Maximum Memory Allowed for Auto Memory Attributes, the Integration Service divides the allocated memory among all the nodes in the grid. When you configure Maximum Percentage of Total Memory Allowed for Auto Memory Attributes, the Integration Service allocates the specified percentage of memory on each node in the grid.

Aggregator Caches Keep the following items in mind when configuring the aggregate memory cache sizes: ●

●

●

●

Allocate at least enough space to hold at least one row in each aggregate group. Remember that you only need to configure cache memory for an Aggregator transformation that does not use sorted ports. The PowerCenter Server uses Session Process memory to process an Aggregator transformation with sorted ports, not cache memory. Incremental aggregation can improve session performance. When it is used, the PowerCenter Server saves index and data cache information to disk at the end of the session. The next time the session runs, the PowerCenter Server uses this historical information to perform the incremental aggregation. The PowerCenter Server names these files PMAGG*.dat and PMAGG*.idx and saves them to the cache directory. Mappings that have sessions which use incremental aggregation should be set up so that only new detail records are read with each subsequent run. When configuring Aggregate data cache size, remember that the data cache holds row data for variable ports and connected output ports only. As a result, the data cache is generally larger than the index cache. To reduce the data cache size, connect only the necessary output ports to subsequent transformations.

Joiner Caches When a session is run with a Joiner transformation, the PowerCenter Server reads from master and detail sources concurrently and builds index and data caches based on the master rows. The PowerCenter Server then performs the join based on the detail source data and the cache data. The number of rows the PowerCenter Server stores in the cache depends on the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

806 of 954

partitioning scheme, the data in the master source, and whether or not you use sorted input. After the memory caches are built, the PowerCenter Server reads the rows from the detail source and performs the joins. The PowerCenter Server uses the index cache to test the join condition. When it finds source data and cache data that match, it retrieves row values from the data cache.

Lookup Caches Several options can be explored when dealing with Lookup transformation caches. ●

●

●

Persistent caches should be used when lookup data is not expected to change often. Lookup cache files are saved after a session with a persistent cache lookup is run for the first time. These files are reused for subsequent runs, bypassing the querying of the database for the lookup. If the lookup table changes, you must be sure to set the Recache from Database option to ensure that the lookup cache files are rebuilt. You can also delete the cache files before the session run to force the session to rebuild the caches. Lookup caching should be enabled for relatively small tables. Refer to the Best Practice Tuning Mappings for Better Performance to determine when lookups should be cached. When the Lookup transformation is not configured for caching, the PowerCenter Server queries the lookup table for each input row. The result of the lookup query and processing is the same, regardless of whether the lookup table is cached or not. However, when the transformation is configured to not cache, the PowerCenter Server queries the lookup table instead of the lookup cache. Using a lookup cache can usually increase session performance. Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on an eight-byte boundary, which helps increase the performance of the lookup

Allocating Buffer Memory The Integration Service can determine the memory requirements for the buffer memory: ●

DTM Buffer Size

●

Default Buffer Block Size

You can also configure DTM buffer size and the default buffer block size in the session properties. When the PowerCenter Server initializes a session, it allocates blocks of INFORMATICA CONFIDENTIAL

BEST PRACTICES

807 of 954

memory to hold source and target data. Sessions that use a large number of sources and targets may require additional memory blocks. To configure these settings, first determine the number of memory blocks the PowerCenter Server requires to initialize the session. Then you can calculate the buffer size and/or the buffer block size based on the default settings, to create the required number of session blocks. If there are XML sources or targets in the mappings, use the number of groups in the XML source or target in the total calculation for the total number of sources and targets.

Increasing the DTM Buffer Pool Size The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer memory to create the internal data structures and buffer blocks used to bring data into and out of the server. When the DTM buffer memory is increased, the PowerCenter Server creates more buffer blocks, which can improve performance during momentary slowdowns. If a session's performance details show low numbers for your source and target BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer pool size may improve performance. Using DTM buffer memory allocation generally causes performance to improve initially and then level off. (Conversely, it may have no impact on source or target-bottlenecked sessions at all and may not have an impact on DTM bottlenecked sessions). When the DTM buffer memory allocation is increased, you need to evaluate the total memory available on the PowerCenter Server. If a session is part of a concurrent batch, the combined DTM buffer memory allocated for the sessions or batches must not exceed the total memory for the PowerCenter Server system. You can increase the DTM buffer size in the Performance settings of the Properties tab.

Running Workflows and Sessions Concurrently The PowerCenter Server can process multiple sessions in parallel and can also process multiple partitions of a pipeline within a session. If you have a symmetric multiprocessing (SMP) platform, you can use multiple CPUs to concurrently process session data or partitions of data. This provides improved performance since true parallelism is achieved. On a single processor platform, these tasks share the CPU, so there is no parallelism.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

808 of 954

To achieve better performance, you can create a workflow that runs several sessions in parallel on one PowerCenter Server. This technique should only be employed on servers with multiple CPUs available.

Partitioning Sessions Performance can be improved by processing data in parallel in a single session by creating multiple partitions of the pipeline. If you have PowerCenter partitioning available, you can increase the number of partitions in a pipeline to improve session performance. Increasing the number of partitions allows the PowerCenter Server to create multiple connections to sources and process partitions of source data concurrently. When you create or edit a session, you can change the partitioning information for each pipeline in a mapping. If the mapping contains multiple pipelines, you can specify multiple partitions in some pipelines and single partitions in others. Keep the following attributes in mind when specifying partitioning information for a pipeline: ●

●

●

Location of partition points. The PowerCenter Server sets partition points at several transformations in a pipeline by default. If you have PowerCenter partitioning available, you can define other partition points. Select those transformations where you think redistributing the rows in a different way is likely to increase the performance considerably. Number of partitions. By default, the PowerCenter Server sets the number of partitions to one. You can generally define up to 64 partitions at any partition point. When you increase the number of partitions, you increase the number of processing threads, which can improve session performance. Increasing the number of partitions or partition points also increases the load on the server. If the server contains ample CPU bandwidth, processing rows of data in a session concurrently can increase session performance. However, if you create a large number of partitions or partition points in a session that processes large amounts of data, you can overload the system. You can also overload source and target systems, so that is another consideration. Partition types. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types: 1. Round-robin partitioning. PowerCenter distributes rows of data evenly to all partitions. Each partition processes approximately the same number of rows. In a pipeline that reads data from file sources of different sizes, you can use round-robin partitioning to ensure that each

INFORMATICA CONFIDENTIAL

BEST PRACTICES

809 of 954

partition receives approximately the same number of rows. 2. Hash keys. The PowerCenter Server uses a hash function to group rows of data among partitions. The Server groups the data based on a partition key. There are two types of hash partitioning: ❍

❍

Hash auto-keys. The PowerCenter Server uses all grouped or sorted ports as a compound partition key. You can use hash auto-keys partitioning at or before Rank, Sorter, and unsorted Aggregator transformations to ensure that rows are grouped properly before they enter these transformations. Hash user keys. The PowerCenter Server uses a hash function to group rows of data among partitions based on a user-defined partition key. You choose the ports that define the partition key.

3. Key range. The PowerCenter Server distributes rows of data based on a port or set of ports that you specify as the partition key. For each port, you define a range of values. The PowerCenter Server uses the key and ranges to send rows to the appropriate partition. Choose key range partitioning where the sources or targets in the pipeline are partitioned by key range. 4. -Pass-through partitioning. The PowerCenter Server processes data without redistributing rows among partitions. Therefore, all rows in a single partition stay in that partition after crossing a pass-through partition point. 5. Database partitioning partition. You can optimize session performance by using the database partitioning partition type instead of the pass-through partition type for IBM DB2 targets. If you find that your system is under-utilized after you have tuned the application, databases, and system for maximum single-partition performance, you can reconfigure your session to have two or more partitions to make your session utilize more of the hardware. Use the following tips when you add partitions to a session: ●

●

●

Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before you add each partition. Set DTM buffer memory. For a session with n partitions, this value should be at least n times the value for the session with one partition. Set cached values for Sequence Generator. For a session with n partitions, there should be no need to use the number of cached values property of the Sequence Generator transformation. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

810 of 954

session with one partition. ●

●

●

●

Partition the source data evenly. Configure each partition to extract the same number of rows. Or redistribute the data among partitions early using a partition point with round-robin. This is actually a good way to prevent hammering of the source system. You could have a session with multiple partitions where one partition returns all the data and the override SQL in the other partitions is set to return zero rows (where 1 = 2 in the where clause prevents any rows being returned). Some source systems react better to multiple concurrent SQL queries; others prefer smaller numbers of queries. Monitor the system while running the session. If there are CPU cycles available (twenty percent or more idle time), then performance may improve for this session by adding a partition. Monitor the system after adding a partition. If the CPU utilization does not go up, the wait for I/O time goes up, or the total data transformation rate goes down, then there is probably a hardware or software bottleneck. If the wait for I/ O time goes up a significant amount, then check the system for hardware bottlenecks. Otherwise, check the database configuration. Tune databases and system. Make sure that your databases are tuned properly for parallel ETL and that your system has no bottlenecks.

Increasing the Target Commit Interval One method of resolving target database bottlenecks is to increase the commit interval. Each time the target database commits, performance slows. If you increase the commit interval, the number of times the PowerCenter Server commits decreases and performance may improve. When increasing the commit interval at the session level, you must remember to increase the size of the database rollback segments to accommodate the larger number of rows. One of the major reasons that Informatica set the default commit interval to 10,000 is to accommodate the default rollback segment / extent size of most databases. If you increase both the commit interval and the database rollback segments, you should see an increase in performance. In some cases though, just increasing the commit interval without making the appropriate database changes may cause the session to fail part way through (i.e., you may get a database error like "unable to extend rollback segments" in Oracle).

Disabling High Precision If a session runs with high precision enabled, disabling high precision may improve session performance.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

811 of 954

The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a high-precision Decimal datatype in a session, you must configure it so that the PowerCenter Server recognizes this datatype by selecting Enable High Precision in the session property sheet. However, since reading and manipulating a high-precision datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter Server down, session performance may be improved by disabling decimal arithmetic. When you disable high precision, the PowerCenter Server reverts to using a dataype of Double.

Reducing Error Tracking If a session contains a large number of transformation errors, you may be able to improve performance by reducing the amount of data the PowerCenter Server writes to the session log. To reduce the amount of time spent writing to the session log file, set the tracing level to Terse. At this tracing level, the PowerCenter Server does not write error messages or row-level information for reject data. However, if terse is not an acceptable level of detail, you may want to consider leaving the tracing level at Normal and focus your efforts on reducing the number of transformation errors. Note that the tracing level must be set to Normal in order to use the reject loading utility. As an additional debug option (beyond the PowerCenter Debugger), you may set the tracing level to verbose initialization or verbose data. ●

●

Verbose initialization logs initialization details in addition to normal, names of index and data files used, and detailed transformation statistics. Verbose data logs each row that passes into the mapping. It also notes where the PowerCenter Server truncates string data to fit the precision of a column and provides detailed transformation statistics. When you configure the tracing level to verbose data, the PowerCenter Server writes row data for all rows in a block when it processes a transformation.

However, the verbose initialization and verbose data logging options significantly affect the session performance. Do not use Verbose tracing options except when testing sessions. Always remember to switch tracing back to Normal after the testing is complete. The session tracing level overrides any transformation-specific tracing levels within the mapping. Informatica does not recommend reducing error tracing as a long-term response to high levels of transformation errors. Because there are only a handful of INFORMATICA CONFIDENTIAL

BEST PRACTICES

812 of 954

reasons why transformation errors occur, it makes sense to fix and prevent any recurring transformation errors. PowerCenter uses the mapping tracing level when the session tracing level is set to none.

Pushdown Optimization You can push transformation logic to the source or target database using pushdown optimization. The amount of work you can push to the database depends on the pushdown optimization configuration, the transformation logic, and the mapping and session configuration. When you run a session configured for pushdown optimization, the Integration Service analyzes the mapping and writes one or more SQL statements based on the mapping transformation logic. The Integration Service analyzes the transformation logic, mapping, and session configuration to determine the transformation logic it can push to the database. At run time, the Integration Service executes any SQL statement generated against the source or target tables, and it processes any transformation logic that it cannot push to the database. Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to Pushdown Optimization.

Source-Side Pushdown Optimization Sessions In source-side pushdown optimization, the Integration Service analyzes the mapping from the source to the target until it reaches a downstream transformation that cannot be pushed to the database. The Integration Service generates a SELECT statement based on the transformation logic up to the transformation it can push to the database. Integration Service pushes all transformation logic that is valid to push to the database by executing the generated SQL statement at run time. Then, it reads the results of this SQL statement and continues to run the session. Similarly it create the view for SQL override and then generate SELECT statement and runs the SELECT statement against this view. When the session completes, the Integration Service drops the view from the database.

Target-Side Pushdown Optimization Sessions When you run a session configured for target-side pushdown optimization, the

INFORMATICA CONFIDENTIAL

BEST PRACTICES

813 of 954

Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the database. It generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database, starting with the first transformation in the pipeline it can push to the database. The Integration Service processes the transformation logic up to the point that it can push the transformation logic to the target database. Then, it executes the generated SQL.

Full Pushdown Optimization Sessions To use full pushdown optimization, the source and target must be on the same database. When you run a session configured for full pushdown optimization, the Integration Service analyzes the mapping from source to target and analyze each transformation in the pipeline until it analyzes the target. It generates and executes the SQL on sources and targets, When you run a session for full pushdown optimization, the database must run a long transaction if the session contains a large quantity of data. Consider the following database performance issues when you generate a long transaction: ● ●

●

A long transaction uses more database resources. A long transaction locks the database for longer periods of time, and thereby reduces the database concurrency and increases the likelihood of deadlock. A long transaction can increase the likelihood that an unexpected event may occur.

The Rank transformation cannot be pushed to the database. If you configure the session for full pushdown optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator transformation to the source. It pushes the Expression transformation and target to the target database, and it processes the Rank transformation. The Integration Service does not fail the session if it can push only part of the transformation logic to the database and the session is configured for full optimization.

Using a Grid You can use a grid to increase session and workflow performance. A grid is an alias assigned to a group of nodes that allows you to automate the distribution of workflows and sessions across nodes.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

814 of 954

When you use a grid, the Integration Service distributes workflow tasks and session threads across multiple nodes. Running workflows and sessions on the nodes of a grid provides the following performance gains: ●

Balances the Integration Service workload.

●

Processes concurrent sessions faster.

●

Processes partitions faster.

When you run a session on a grid, you improve scalability and performance by distributing session threads to multiple DTM processes running on nodes in the grid. To run a workflow or session on a grid, you assign resources to nodes, create and configure the grid, and configure the Integration Service to run on a grid.

Running a Session on Grid When you run a session on a grid, the master service process runs the workflow and workflow tasks, including the Scheduler. Because it runs on the master service process node, the Scheduler uses the date and time for the master service process node to start scheduled workflows. The Load Balancer distributes Command tasks as it does when you run a workflow on a grid. In addition, when the Load Balancer dispatches a Session task, it distributes the session threads to separate DTM processes. The master service process starts a temporary preparer DTM process that fetches the session and prepares it to run. After the preparer DTM process prepares the session, it acts as the master DTM process, which monitors the DTM processes running on other nodes. The worker service processes start the worker DTM processes on other nodes. The worker DTM runs the session. Multiple worker DTM processes running on a node might be running multiple sessions or multiple partition groups from a single session depending on the session configuration. For example, you run a workflow on a grid that contains one Session task and one Command task. You also configure the session to run on the grid. When the Integration Service process runs the session on a grid, it performs the following tasks: ●

On Node 1, the master service process runs workflow tasks. It also starts a

INFORMATICA CONFIDENTIAL

BEST PRACTICES

815 of 954

temporary preparer DTM process, which becomes the master DTM process. The Load Balancer dispatches the Command task and session threads to nodes in the grid. ●

●

On Node 2, the worker service process runs the Command task and starts the worker DTM processes that run the session threads. On Node 3, the worker service process starts the worker DTM processes that run the session threads.

For information about configuring and managing a grid, refer to the PowerCenter Administrator Guide and to the best practice PowerCenter Enterprise Grid Option. For information about how the DTM distributes session threads into partition groups, see "Running Workflows and Sessions on a Grid" in the Workflow Administration Guide.

Last updated: 06-Dec-07 15:20

INFORMATICA CONFIDENTIAL

BEST PRACTICES

816 of 954

Tuning SQL Overrides and Environment for Better Performance Challenge Tuning SQL Overrides and SQL queries within the source qualifier objects can improve performance in selecting data from source database tables, which positively impacts the overall session performance. This Best Practice explores ways to optimize a SQL query within the source qualifier object. The tips here can be applied to any PowerCenter mapping. While the SQL discussed here is executed in Oracle 8 and above, the techniques are generally applicable, but specifics for other RDBMS products (e.g., SQL Server, Sybase, etc.) are not included.

Description SQL Queries Performing Data Extractions Optimizing SQL queries is perhaps the most complex portion of performance tuning. When tuning SQL, the developer must look at the type of execution being forced by hints, the execution plan, and the indexes on the query tables in the SQL, the logic of the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of these areas in more detail.

DB2 Coalesce and Oracle NVL When examining data with NULLs, it is often necessary to substitute a value to make comparisons and joins work. In Oracle, the NVL function is used, while in DB2, the COALESCE function is used. Here is an example of the Oracle NLV function: SELECT DISTINCT bio.experiment_group_id, bio.database_site_code FROM exp.exp_bio_result bio, sar.sar_data_load_log log WHERE bio.update_date BETWEEN log.start_time AND log.end_time AND NVL(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’) AND log.seq_no = (SELECT MAX(seq_no) FROM sar.sar_data_load_log WHERE load_status = 'P')< Here is the same query in DB2: SELECT DISTINCT bio.experiment_group_id, bio.database_site_code FROM bio_result bio, data_load_log log WHERE bio.update_date BETWEEN log.start_time AND log.end_time AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’) AND log.seq_no = (SELECT MAX(seq_no) FROM data_load_log WHERE load_status = 'P')

INFORMATICA CONFIDENTIAL

BEST PRACTICES

817 of 954

Surmounting the Single SQL Statement Limitation in Oracle or DB2: In-line Views In source qualifiers and lookup objects, you are limited to a single SQL statement. There are several ways to get around this limitation. You can create views in the database and use them as you would tables, either as source tables, or in the FROM clause of the SELECT statement. This can simplify the SQL and make it easier to understand, but it also makes it harder to maintain. The logic is now in two places: in an Informatica mapping and in a database view You can use in-line views which are SELECT statements in the FROM or WHERE clause. This can help focus the query to a subset of data in the table and work more efficiently than using a traditional join. Here is an example of an in-line view in the FROM clause: SELECT N.DOSE_REGIMEN_TEXT as DOSE_REGIMEN_TEXT, N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT, N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER, N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID FROM DOSE_REGIMEN N, (SELECT DISTINCT R.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID FROM EXPERIMENT_PARAMETER R, NEW_GROUP_TMP TMP WHERE R.EXPERIMENT_PARAMETERS_ID = TMP.EXPERIMENT_PARAMETERS_ID AND R.SCREEN_PROTOCOL_ID = TMP.BDS_PROTOCOL_ID )X WHERE N.DOSE_REGIMEN_ID = X.DOSE_REGIMEN_ID ORDER BY N.DOSE_REGIMEN_ID

Surmounting the Single SQL Statement Limitation in DB2: Using the Common Table Expression temp tables and the WITH Clause The Common Table Expression (CTE) stores data in temp tables during the execution of the SQL statement. The WITH clause lets you assign a name to a CTE block. You can then reference the CTE block multiple places in the query by specifying the query name. For example: WITH maxseq AS (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE load_status = 'P') SELECT DISTINCT bio.experiment_group_id, bio.database_site_code FROM bio_result bio, data_load_log log WHERE bio.update_date BETWEEN log.start_time AND log.end_time INFORMATICA CONFIDENTIAL

BEST PRACTICES

818 of 954

AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’) AND log.seq_no = maxseq. seq_no Here is another example using a WITH clause that uses recursive SQL: WITH PERSON_TEMP (PERSON_ID, NAME, PARENT_ID) AS (SELECT PERSON_ID, NAME, PARENT_ID FROM PARENT_CHILD WHERE NAME IN (‘FRED’, ‘SALLY’, ‘JIM’) UNION ALL SELECT C.PERSON_ID, C.NAME, C.PARENT_ID FROM PARENT_CHILD C, PERSON_TEMP RECURS WHERE C.PERSON_ID = RECURS.PERSON_ID AND LEVEL < 5) SELECT * FROM PERSON_TEMP The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty stupid since we all have two parents, but you get the idea. The LEVEL clause prevents infinite recursion.

CASE (DB2) vs. DECODE (Oracle) The CASE syntax is allowed in ORACLE, but you are much more likely to see the DECODE logic, even for a single case since it was the only legal way to test a condition in earlier versions. DECODE is not allowed in DB2. In Oracle: SELECT EMPLOYEE, FNAME, LNAME, DECODE (SALARY) < 10000, ‘NEED RAISE’, > 1000000, ‘OVERPAID’, ‘THE REST OF US’) AS COMMENT FROM EMPLOYEE In DB2:

INFORMATICA CONFIDENTIAL

BEST PRACTICES

819 of 954

SELECT EMPLOYEE, FNAME, LNAME, CASE WHEN SALARY < 10000 THEN ‘NEED RAISE’ WHEN SALARY > 1000000 THEN ‘OVERPAID’ ELSE ‘THE REST OF US’ END AS COMMENT FROM EMPLOYEE

Debugging Tip: Obtaining a Sample Subset It is often useful to get a small sample of the data from a long running query that returns a large set of data. The logic can be commented out or removed after it is put in general use. DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows: SELECT EMPLOYEE, FNAME, LNAME FROM EMPLOYEE WHERE JOB_TITLE = ‘WORKERBEE’ FETCH FIRST 12 ROWS ONLY Oracle does it this way using the ROWNUM variable: SELECT EMPLOYEE, FNAME, LNAME FROM EMPLOYEE WHERE JOB_TITLE = ‘WORKERBEE’ AND ROWNUM

INTERSECT, INTERSECT ALL, UNION, UNION ALL Remember that both the UNION and INTERSECT operators return distinct rows, while UNION ALL and INTERSECT ALL return all rows.

System Dates in Oracle and DB2 Oracle uses the system variable SYSDATE for the current time and date, and allows you to display either the time and/or the date however you want with date functions. Here is an example that returns yesterday’s date in Oracle (default format as mm/dd/yyyy): SELECT TRUNC(SYSDATE) – 1 FROM DUAL INFORMATICA CONFIDENTIAL

BEST PRACTICES

820 of 954

DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT TIME and CURRENT TIMESTAMP Here is an example for DB2: SELECT FNAME, LNAME, CURRENT DATE AS TODAY FROM EMPLOYEE

Oracle: Using Hints Hints affect the way a query or sub-query is executed and can therefore, provide a significant performance increase in queries. Hints cause the database engine to relinquish control over how a query is executed, thereby giving the developer control over the execution. Hints are always honored unless execution is not possible. Because the database engine does not evaluate whether the hint makes sense, developers must be careful in implementing hints. Oracle has many types of hints: optimizer hints, access method hints, join order hints, join operation hints, and parallel execution hints. Optimizer and access method hints are the most common. In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-based analysis is no longer possible. It was in Rule-based Oracle systems that hints mentioning specific indexes were most helpful. In Oracle version 9.2, however, the use of /*+ INDEX */ hints may actually decrease performance significantly in many cases. If you are using older versions of Oracle however, the use of the proper INDEX hints should help performance. The optimizer hint allows the developer to change the optimizer's goals when creating the execution plan. The table below provides a partial list of optimizer hints and descriptions.

Optimizer hints: Choosing the best join method Sort/merge and hash joins are in the same group, but nested loop joins are very different. Sort/merge involves two sorts while the nested loop involves no sorts. The hash join also requires memory to build the hash table. Hash joins are most effective when the amount of data is large and one table is much larger than the other. Here is an example of a select that performs best as a hash join: SELECT COUNT(*) FROM CUSTOMERS C, MANAGERS M WHERE C.CUST_ID = M.MANAGER_ID Considerations

Join Type

Better throughput

Sort/Merge

Better response time

Nested loop

Large subsets of data

Sort/Merge

Index available to support join

Nested loop

Limited memory and CPU available for sorting

Nested loop

Parallel execution

Sort/Merge or Hash

INFORMATICA CONFIDENTIAL

BEST PRACTICES

821 of 954

Joining all or most of the rows of large tables

Sort/Merge or Hash

Joining small sub-sets of data and index available

Nested loop

Hint

Description

ALL_ROWS

The database engine creates an execution plan that optimizes for throughput. Favors full table scans. Optimizer favors Sort/Merge

FIRST_ROWS

The database engine creates an execution plan that optimizes for response time. It returns the first row of data as quickly as possible. Favors index lookups. Optimizer favors Nested-loops

CHOOSE

The database engine creates an execution plan that uses cost-based execution if statistics have been run on the tables. If statistics have not been run, the engine uses rule-based execution. If statistics have been run on empty tables, the engine still uses cost-based execution, but performance is extremely poor.

RULE

The database engine creates an execution plan based on a fixed set of rules.

USE NL

Use nested loops

USE MERGE

Use sort merge joins

HASH

The database engine performs a hash scan of the table. This hint is ignored if the table is not clustered.

Access method hints Access method hints control how data is accessed. These hints are used to force the database engine to use indexes, hash scans, or row id scans. The following table provides a partial list of access method hints. Hint

Description

ROWID

The database engine performs a scan of the table based on ROWIDS.

INDEX

DO NOT USE in Oracle 9.2 and above. The database engine performs an index scan of a specific table, but in 9.2 and above, the optimizer does not use any indexes other than those mentioned.

USE_CONCAT

The database engine converts a query with an OR condition into two or more queries joined by a UNION ALL statement.

The syntax for using a hint in a SQL statement is as follows: Select /*+ FIRST_ROWS */ empno, ename From emp;

INFORMATICA CONFIDENTIAL

BEST PRACTICES

822 of 954

Select /*+ USE_CONCAT */ empno, ename From emp;

SQL Execution and Explain Plan The simplest change is forcing the SQL to choose either rule-based or cost-based execution. This change can be accomplished without changing the logic of the SQL query. While cost-based execution is typically considered the best SQL execution; it relies upon optimization of the Oracle parameters and updated database statistics. If these statistics are not maintained, cost-based query execution can suffer over time. When that happens, rule-based execution can actually provide better execution time. The developer can determine which type of execution is being used by running an explain plan on the SQL query in question. Note that the step in the explain plan that is indented the most is the statement that is executed first. The results of that statement are then used as input by the next level statement. Typically, the developer should attempt to eliminate any full table scans and index range scans whenever possible. Full table scans cause degradation in performance. Information provided by the Explain Plan can be enhanced using the SQL Trace Utility. This utility provides the following additional information including: ●

The number of executions

●

The elapsed time of the statement execution

●

The CPU time used to execute the statement

The SQL Trace Utility adds value because it definitively shows the statements that are using the most resources, and can immediately show the change in resource consumption after the statement has been tuned and a new explain plan has been run.

Using Indexes The explain plan also shows whether indexes are being used to facilitate execution. The data warehouse team should compare the indexes being used to those available. If necessary, the administrative staff should identify new indexes that are needed to improve execution and ask the database administration team to add them to the appropriate tables. Once implemented, the explain plan should be executed again to ensure that the indexes are being used. If an index is not being used, it is possible to force the query to use it by using an access method hint, as described earlier.

Reviewing SQL Logic The final step in SQL optimization involves reviewing the SQL logic itself. The purpose of this review is to determine whether the logic is efficiently capturing the data needed for processing. Review of the logic may uncover the need for additional filters to select only certain data, as well as the need to restructure the where clause to use indexes. In extreme cases, the entire SQL statement may need to be re-written to become more efficient.

Reviewing SQL Syntax SQL Syntax can also have a great impact on query performance. Certain operators can slow performance, for example: ●

EXISTS clauses are almost always used in correlated sub-queries. They are executed for each row of the parent query and cannot take advantage of indexes, while the IN clause is executed once and does use indexes, and may be translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN clause. For example: SELECT * FROM DEPARTMENTS WHERE DEPT_ID IN

INFORMATICA CONFIDENTIAL

BEST PRACTICES

823 of 954

(SELECT DISTINCT DEPT_ID FROM MANAGERS) -- Faster SELECT * FROM DEPARTMENTS D WHERE EXISTS (SELECT * FROM MANAGERS M WHERE M.DEPT_ID = D.DEPT_ID)

●

●

Situation

Exists

In

Index supports subquery

Yes

Yes

No Index to support subquery

No Table scans per parent row

Yes Table scan once

Sub-query returns many rows

Probably not

Yes

Sub-query returns one or a few rows

Yes

Yes

Most of the sub-query rows are eliminated by the parent query

No

Yes

Index in parent that match sub-query columns

Possibly not since the Yes – IN uses the EXISTS cannot use the index index

Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply modifying the query in this way can improve performance by more than100 percent. Where possible, limit the use of outer joins on tables. Remove the outer joins from the query and create lookup objects within the mapping to fill in the optional information.

Choosing the Best Join Order Place the smallest table first in the join order. This is often a staging table holding the IDs identifying the data in the incremental ETL load. Always put the small table column on the right side of the join. Use the driving table first in the WHERE clause, and work from it outward. In other words, be consistent and orderly about placing columns in the WHERE clause. Outer joins limit the join order that the optimizer can use. Don’t use them needlessly.

Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN ●

Avoid use of the NOT IN clause. This clause causes the database engine to perform a full table scan. While this may not be a problem on small tables, it can become a performance drain on large tables. SELECT NAME_ID FROM CUSTOMERS WHERE NAME_ID NOT IN (SELECT NAME_ID FROM EMPLOYEES)

INFORMATICA CONFIDENTIAL

BEST PRACTICES

824 of 954

●

Avoid use of the NOT EXISTS clause. This clause is better than the NOT IN, but still may cause a full table scan. SELECT C.NAME_ID FROM CUSTOMERS C WHERE NOT EXISTS (SELECT * FROM EMPLOYEES E WHERE C.NAME_ID = E.NAME_ID)

●

In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the equivalent EXCEPT operator. SELECT C.NAME_ID FROM CUSTOMERS C MINUS SELECT E.NAME_ID* FROM EMPLOYEES E

●

Also consider using outer joins with IS NULL conditions for anti-joins. SELECT C.NAME_ID FROM CUSTOMERS C, EMPLOYEES E WHERE C.NAME_ID = E.NAME_ID (+) AND C.NAME_ID IS NULL

Review the database SQL manuals to determine the cost benefits or liabilities of certain SQL clauses as they may change based on the database engine. ●

●

In lookups from large tables, try to limit the rows returned to the set of rows matching the set in the source qualifier. Add the WHERE clause conditions to the lookup. For example, if the source qualifier selects sales orders entered into the system since the previous load of the database, then, in the product information lookup, only select the products that match the distinct product IDs in the incremental sales orders. Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause that uses values retrieved from a table as limits in the BETWEEN. Here is an example: SELECT R.BATCH_TRACKING_NO, R.SUPPLIER_DESC, R.SUPPLIER_REG_NO, R.SUPPLIER_REF_CODE, R.GCW_LOAD_DATE FROM CDS_SUPPLIER R, (SELECT LOAD_DATE_PREV AS LOAD_DATE_PREV,

INFORMATICA CONFIDENTIAL

BEST PRACTICES

825 of 954

L.LOAD_DATE) AS LOAD_DATE FROM ETL_AUDIT_LOG L WHERE L.LOAD_DATE_PREV IN (SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV FROM ETL_AUDIT_LOG Y) )Z WHERE R.LOAD_DATE BETWEEN Z.LOAD_DATE_PREV AND Z.LOAD_DATE The work-around is to use an in-line view to get the lower range in the FROM clause and join it to the main query that limits the higher date range in its where clause. Use an ORDER BY the lower limit in the in-line view. This is likely to reduce the throughput time from hours to seconds. Here is the improved SQL: SELECT R.BATCH_TRACKING_NO, R.SUPPLIER_DESC, R.SUPPLIER_REG_NO, R.SUPPLIER_REF_CODE, R.LOAD_DATE FROM /* In-line view for lower limit */ (SELECT R1.BATCH_TRACKING_NO, R1.SUPPLIER_DESC, R1.SUPPLIER_REG_NO, R1.SUPPLIER_REF_CODE, R1.LOAD_DATE FROM CDS_SUPPLIER R1, (SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV INFORMATICA CONFIDENTIAL

BEST PRACTICES

826 of 954

FROM ETL_AUDIT_LOG Y) Z WHERE R1.LOAD_DATE >= Z.LOAD_DATE_PREV ORDER BY R1.LOAD_DATE) R, /* end in-line view for lower limit */ (SELECT MAX(D.LOAD_DATE) AS LOAD_DATE FROM ETL_AUDIT_LOG D) A /* upper limit /* WHERE R. LOAD_DATE

Tuning System Architecture Use the following steps to improve the performance of any system: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Establish performance boundaries (baseline). Define performance objectives. Develop a performance monitoring plan. Execute the plan. Analyze measurements to determine whether the results meet the objectives. If objectives are met, consider reducing the number of measurements because performance monitoring itself uses system resources. Otherwise continue with Step 6. Determine the major constraints in the system. Decide where the team can afford to make trade-offs and which resources can bear additional load. Adjust the configuration of the system. If it is feasible to change more than one tuning option, implement one at a time. If there are no options left at any level, this indicates that the system has reached its limits and hardware upgrades may be advisable. Return to Step 4 and continue to monitor the system. Return to Step 1. Re-examine outlined objectives and indicators. Refine monitoring and tuning strategy.

System Resources The PowerCenter Server uses the following system resources: ●

CPU

●

Load Manager shared memory

●

DTM buffer memory

●

Cache memory

When tuning the system, evaluate the following considerations during the implementation process. ●

● ●

●

Determine if the network is running at an optimal speed. Recommended best practice is to minimize the number of network hops between the PowerCenter Server and the databases. Use multiple PowerCenter Servers on separate systems to potentially improve session performance. When all character data processed by the PowerCenter Server is US-ASCII or EBCDIC, configure the PowerCenter Server for ASCII data movement mode. In ASCII mode, the PowerCenter Server uses one byte to store each character. In Unicode mode, the PowerCenter Server uses two bytes for each character, which can potentially slow session performance Check hard disks on related machines. Slow disk access on source and target databases, source and target file

INFORMATICA CONFIDENTIAL

BEST PRACTICES

827 of 954

systems, as well as the PowerCenter Server and repository machines can slow session performance. ●

●

●

●

●

When an operating system runs out of physical memory, it starts paging to disk to free physical memory. Configure the physical memory for the PowerCenter Server machine to minimize paging to disk. Increase system memory when sessions use large cached lookups or sessions have many partitions. In a multi-processor UNIX environment, the PowerCenter Server may use a large amount of system resources. Use processor binding to control processor usage by the PowerCenter Server. In a Sun Solaris environment, use the psrset command to create and manage a processor set. After creating a processor set, use the pbind command to bind the PowerCenter Server to the processor set so that the processor set only runs the PowerCenter Sever. For details, see project system administrator and Sun Solaris documentation. In an HP-UX environment, use the Process Resource Manager utility to control CPU usage in the system. The Process Resource Manager allocates minimum system resources and uses a maximum cap of resources. For details, see project system administrator and HP-UX documentation. In an AIX environment, use the Workload Manager in AIX 5L to manage system resources during peak demands. The Workload Manager can allocate resources and manage CPU, memory, and disk I/O bandwidth. For details, see project system administrator and AIX documentation.

Database Performance Features Nearly everything is a trade-off in the physical database implementation. Work with the DBA in determining which of the many available alternatives is the best implementation choice for the particular database. The project team must have a thorough understanding of the data, database, and desired use of the database by the end-user community prior to beginning the physical implementation process. Evaluate the following considerations during the implementation process. ●

●

●

●

●

●

●

Denormalization. The DBA can use denormalization to improve performance by eliminating the constraints and primary key to foreign key relationships, and also eliminating join tables. Indexes. Proper indexing can significantly improve query response time. The trade-off of heavy indexing is a degradation of the time required to load data rows in to the target tables. Carefully written pre-session scripts are recommended to drop indexes before the load and rebuilding them after the load using post-session scripts. Constraints. Avoid constraints if possible and try to exploit integrity enforcement through the use of incorporating that additional logic in the mappings. Rollback and Temporary Segments. Rollback and temporary segments are primarily used to store data for queries (temporary) and INSERTs and UPDATES (rollback). The rollback area must be large enough to hold all the data prior to a COMMIT. Proper sizing can be crucial to ensuring successful completion of load sessions, particularly on initial loads. OS Priority. The priority of background processes is an often-overlooked problem that can be difficult to determine after the fact. DBAs must work with the System Administrator to ensure all the database processes have the same priority. Striping. Database performance can be increased significantly by implementing either RAID 0 (striping) or RAID 5 (pooled disk sharing) disk I/O throughput. Disk Controllers. Although expensive, striping and RAID 5 can be further enhanced by separating the disk controllers.

Last updated: 13-Feb-07 17:47

INFORMATICA CONFIDENTIAL

BEST PRACTICES

828 of 954

Using Metadata Manager Console to Tune the XConnects Challenge Improving the efficiency and reducing the run-time of your XConnects through the parameter settings of the Metadata Manager console.

Description Remember that the minimum system requirements for a machine hosting the Metadata Manager console are: ●

Windows operating system (2000, NT 4.0 SP 6a)

●

400MB disk space

●

128MB RAM (256MB recommended)

●

133 MHz processor.

If the system meets or exceeds the minimal requirements, but an XConnect is still taking an inordinately long time to run, use the following steps to try to improve its performance. To improve performance of your XConnect loads from database catalogs: ●

●

Modify the inclusion/exclusion schema list (if schema to be loaded is more than exclusion, then use exclusion) Carefully examine how many old objects the project needs by default. Modify the “sysdate -5000” to a smaller value to reduce the result set.

To improve performance of your XConnect loads from the PowerCenter repository: ● ●

Load only the production folders that are needed for a particular project. Run the XConnects with just one folder at a time, or select the list of folders for a particular run.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

829 of 954

Advanced Client Configuration Options Challenge Setting the Registry to ensure consistent client installations, resolve potential missing or invalid license key issues, and change the Server Manager Session Log Editor to your preferred editor.

Description Ensuring Consistent Data Source Names To ensure the use of consistent data source names for the same data sources across the domain, the Administrator can create a single "official" set of data sources, then use the Repository Manager to export that connection information to a file. You can then distribute this file and import the connection information for each client machine. Solution: ● ●

From Repository Manager, choose Export Registry from the Tools drop-down menu. For all subsequent client installs, simply choose Import Registry from the Tools drop-down menu.

Resolving Missing or Invalid License Keys The “missing or invalid license key” error occurs when attempting to install PowerCenter Client tools on NT 4.0 or Windows 2000 with a userid other than Administrator. This problem also occurs when the client software tools are installed under the Administrator account, and a user with a non-administrator ID subsequently attempts to run the tools. The user who attempts to log in using the normal ‘non-administrator’ userid will be unable to start the PowerCenter Client tools. Instead, the software displays the message indicating that the license key is missing or invalid. Solution: ●

● ●

While logged in as the installation user with administrator authority, use regedt32 to edit the registry. Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/. From the menu bar, select Security/Permissions, and grant read access to the users that should be permitted to use the PowerMart Client. (Note that the registry entries for both PowerMart and PowerCenter Server and client tools are stored as PowerMart Server and PowerMart Client tools.)

Changing the Session Log Editor

INFORMATICA CONFIDENTIAL

BEST PRACTICES

830 of 954

In PowerCenter versions 6.0 to 7.1.2, the session and workflow log editor defaults to Wordpad within the workflow monitor client tool. To choose a different editor, just select Tools>Options in the workflow monitor. Then browse for the editor that you want on the General tab. For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad unless the wordpad.exe can be found in the path statement. Instead, a window appears the first time a session log is viewed from the PowerCenter Server Manager prompting the user to enter the full path name of the editor to be used to view the logs. Users often set this parameter incorrectly and must access the registry to change it. Solution: ●

●

● ●

●

While logged in as the installation user with administrator authority, use regedt32 to go into the registry. Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart Client Tools\[CLIENT VERSION]\Server Manager\Session Files. From the menu bar, select View Tree and Data. Select the Log File Editor entry by double clicking on it. Replace the entry with the appropriate editor entry (i.e., typically WordPad.exe or Write. exe). Select Registry --> Exit from the menu bar to save the entry.

For PowerCenter version 7.1 and above, you should set the log editor option in the Workflow Monitor. The following figure shows the Workflow Monitor Options Dialog box to use for setting the editor for workflow and session logs.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

831 of 954

Adding a New Command Under Tools Menu Other tools, in addition to the PowerCenter client tools, are often needed during development and testing. For example, you may need a tool such as Enterprise manager (SQL Server) or Toad (Oracle) to query the database. You can add shortcuts to executable programs from any client tool’s ‘Tools’ drop-down menu to provide quick access to these programs. Solution: Choose ‘Customize’ under the Tools menu and add a new item. Once it is added, browse to find the executable it is going to call (as shown below).

INFORMATICA CONFIDENTIAL

BEST PRACTICES

832 of 954

When this is done once, you can easily call another program from your PowerCenter client tools. In the following example, TOAD can be called quickly from the Repository Manager tool.

Changing Target Load Type In PowerCenter versions 6.0 and earlier, each time a session was created, it defaulted to be of type ‘bulk’, although this was not necessarily what was desired and could cause the session to fail under certain conditions if not changed. In versions 7.0 and above, you can set a property in Workflow Manager to choose the default load type to be either 'bulk' or 'normal'.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

833 of 954

Solution: ●

In the Workflow Manager tool, choose Tools > Options and go to the Miscellaneous tab.

●

Click the button for either 'normal' or 'bulk', as desired.

●

Click OK, then close and open the Workflow Manager tool.

After this, every time a session is created, the target load type for all relational targets will default to your choice.

Resolving Undocked Explorer Windows The Repository Navigator window sometimes becomes undocked. Docking it again can be frustrating because double clicking on the window header does not put it back in place.

Solution: ●

●

To get the Window correctly docked, right-click in the white space of the Navigator window. Make sure that ‘Allow Docking’ option is checked. If it is checked, double-click on the title bar of the Navigator Window.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

834 of 954

Resolving Client Tool Window Display Issues If one of the windows (e.g., Navigator or Output) in a PowerCenter 7.x or later client tool (e. g., Designer) disappears, try the following solutions to recover it: ●

Clicking View > Navigator

●

Toggling the menu bar

●

Uninstalling and reinstalling Client tools

Note: If none of the above solutions resolve the problem, you may want to try the following solution using the Registry Editor. Be aware, however, that using the Registry Editor incorrectly can cause serious problems that may require reinstalling the operating system. Informatica does not guarantee that any problems caused by using Registry Editor incorrectly can be resolved. Use the Registry Editor at your own risk. Solution: Starting with PowerCenter 7.x, the settings for the client tools are in the registry. Display issues can often be resolved as follows: ●

Close the client tool.

●

Go to Start > Run and type "regedit".

●

Go to the key HKEY_CURRENT_USER\Software\Informatica\PowerMart Client Tools\x.y.z Where x.y.z is the version and maintenance release level of the PowerCenter client as follows:

PowerCenter Folder Version Name

●

7.1

7.1

7.1.1

7.1.1

7.1.2

7.1.1

7.1.3

7.1.1

7.1.4

7.1.1

8.1

8.1

Open the key of the affected tool (for the Repository Manager open Repository Manager Options).

●

Export all of the Toolbars sub-folders and rename them.

●

Re-open the client tool.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

835 of 954

Enhancing the Look of the Client Tools The PowerCenter client tools allow you to customize the look and feel of the display. Here are a few examples of what you can do.

Designer

●

From the Menu bar, select Tools > Options.

●

In the dialog box, choose the Format tab.

●

Select the feature that you want to modify (i.e., workspace colors, caption colors, or fonts).

Changing the background workspace colors can help identify which workspace is currently open. For example, changing the Source Analyzer workspace color to green or the Target Designer workspace to purple to match their respective metadata definitions helps to identify the workspace. Alternatively, click the Select Theme button to choose a color theme, which displays background colors based on predefined themes.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

836 of 954

Workflow Manager You can modify the Workflow Manager using the same approach as the Designer tool. From the Menu bar, select Tools > Options and click the Format tab. Select a color theme or customize each element individually.

Workflow Monitor You can modify the colors in the Gantt Chart view to represent the various states of a task. You can also select two colors for one task to give it a dimensional appearance; this can be helpful in INFORMATICA CONFIDENTIAL

BEST PRACTICES

837 of 954

distinguishing between running tasks, succeeded tasks, etc. To modify the Gantt chart appearance, go to the Menu bar and select Tools > Options and Gantt Chart.

Using Macros in Data Stencil Data Stencil contains unsigned macros. Set the security level in Visio to Medium so you can enable macros when you start Data Stencil. If the security level for Visio is set to High or Very High, you INFORMATICA CONFIDENTIAL

BEST PRACTICES

838 of 954

cannot run the Data Stencil macros. To use the security level for the Visio, select Tools > Macros > Security from the menu. On the Security Level tab, select Medium. When you start Data Stencil, Visio displays a security warning about viruses in macros. Click Enable Macros to enable the macros for Data Stencil.

Last updated: 19-Mar-08 19:00

INFORMATICA CONFIDENTIAL

BEST PRACTICES

839 of 954

Advanced Server Configuration Options Challenge Correctly configuring Advanced Integration Service properties, Integration Service process variables, and automatic memory settings; using custom properties to write service logs to files; and adjusting semaphore and shared memory settings in the UNIX environment.

Description Configuring Advanced Integration Service Properties Use the Administration Console to configure the advanced properties, such as the character set of the Integration Service logs. To edit the advanced properties, select the Integration Service in the Navigator, and click the Properties tab > Advanced Properties > Edit. The following Advanced properties are included:

Limit on Resilience Optional Timeouts

Maximum amount of time (in seconds) that the service holds on to resources for resilience purposes. This property places a restriction on clients that connect to the service. Any resilience timeouts that exceed the limit are cut off at the limit. If the value of this property is blank, the value is derived from the domain-level settings. Valid values are between 0 and 2592000, inclusive. Default is blank.

Resilience Timeout Optional

Period of time (in seconds) that the service tries to establish or reestablish a connection to another service. If blank, the value is derived from the domainlevel settings. Valid values are between 0 and 2592000, inclusive. Default is blank.

Configuring Integration Service Process Variables One configuration best practice is to properly configure and leverage the Integration service (IS) process variables. The benefits include: ●

Ease of deployment across environments (DEV > TEST > PRD)

●

Ease of switching sessions from one IS to another without manually editing all the sessions to change directory paths.

●

All the variables are related to directory paths used by a given Integration Service.

You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files. Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the runtime files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. State of operation files must be accessible by all Integration Service processes.When you enable an Integration Service, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption. All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

840 of 954

By default, the installation program creates a set of Integration Service directories in the server\infa_shared directory. You can set the shared location for these directories by configuring the process variable $PMRootDir to point to the same location for each Integration Service process. You must specify the directory path for each type of file. You specify the following directories using service process variables: Each registered server has its own set of variables. The list is fixed, not user-extensible.

Service Process Variable

Value

$PMRootDir

(no default – user must insert a path)

$PMSessionLogDir

$PMRootDir/SessLogs

$PMBadFileDir

$PMRootDir/BadFiles

$PMCacheDir

$PMRootDir/Cache

$PMTargetFileDir

$PMRootDir/TargetFiles

$PMSourceFileDir

$PMRootDir/SourceFiles

$PMExtProcDir

$PMRootDir/ExtProc

$PMTempDir

$PMRootDir/Temp

$PMSuccessEmailUser

(no default – user must insert a path)

$PMFailureEmailUser

(no default – user must insert a path)

$PMSessionLogCount

0

$PMSessionErrorThreshold 0 $PMWorkflowLogCount

0

$PMWorkflowLogDir

$PMRootDir/WorkflowLogs

$PMLookupFileDir

$PMRootDir/LkpFiles

$PMStorageDir

$PMRootDir/Storage

Writing PowerCenter 8 Service Logs to Files Starting with PowerCenter 8, all the logging for the services and sessions created use the log service and can only be viewed through the PowerCenter Administration Console. However, it is still possible to get this information logged into a file similar to the previous versions. To write all Integration Service logs (session, workflow, server, etc.) to files: 1. Log in to the Admin Console. 2. Select the Integration Service INFORMATICA CONFIDENTIAL

BEST PRACTICES

841 of 954

3. Add a Custom property called UseFileLog and set its value to "Yes". 4. Add a Custom property called LogFileName and set its value to the desired file name. 5. Restart the service. Integration Service Custom Properties (undocumented server parameters) can be entered here as well: 1. At the bottom of the list enter the Name and Value of the custom property 2. Click OK.

Adjusting Semaphore Settings on UNIX Platforms When PowerCenter runs on a UNIX platform, it uses operating system semaphores to keep processes synchronized and to prevent collisions when accessing shared data structures. You may need to increase these semaphore settings before installing the server. Seven semaphores are required to run a session. Most installations require between 64 and 128 available semaphores, depending on the number of sessions the server runs concurrently. This is in addition to any semaphores required by other software, such as database servers. The total number of available operating system semaphores is an operating system configuration parameter, with a limit per user and system. The method used to change the parameter depends on the operating system: ●

HP/UX: Use sam (1M) to change the parameters.

●

Solaris: Use admintool or edit /etc/system to change the parameters.

●

AIX: Use smit to change the parameters.

Setting Shared Memory and Semaphore Parameters on UNIX Platforms Informatica recommends setting the following parameters as high as possible for the UNIX operating system. However, if you set these parameters too high, the machine may not boot. Always refer to the operating system documentation for parameter limits. Note that different UNIX operating systems set these variables in different ways or may be self tuning. Always reboot the system after configuring the UNIX kernel.

HP-UX For HP-UX release 11i the CDLIMIT and NOFILES parameters are not implemented. In some versions, SEMMSL is hard-coded to 500. NCALL is referred to as NCALLOUT. Use the HP System V IPC Shared-Memory Subsystem to update parameters. To change a value, perform the following steps: 1. 2. 3. 4. 5. 6. 7.

Enter the /usr/sbin/sam command to start the System Administration Manager (SAM) program. Double click the Kernel Configuration icon. Double click the Configurable Parameters icon. Double click the parameter you want to change and enter the new value in the Formula/Value field. Click OK. Repeat these steps for all kernel configuration parameters that you want to change. When you are finished setting all of the kernel configuration parameters, select Process New Kernel from the Action menu.

The HP-UX operating system automatically reboots after you change the values for the kernel configuration parameters.

IBM AIX None of the listed parameters requires tuning because each is dynamically adjusted as needed by the kernel.

SUN Solaris Keep the following points in mind when configuring and tuning the SUN Solaris platform: INFORMATICA CONFIDENTIAL

BEST PRACTICES

842 of 954

1. Edit the /etc/system file and add the following variables to increase shared memory segments: set shmsys:shminfo_shmmax=value set shmsys:shminfo_shmmin=value set shmsys:shminfo_shmmni=value set shmsys:shminfo_shmseg=value set semsys:seminfo_semmap=value set semsys:seminfo_semmni=value set semsys:seminfo_semmns=value set semsys:seminfo_semmsl=value set semsys:seminfo_semmnu=value set semsys:seminfo_semume=value 2. Verify the shared memory value changes: # grep shmsys /etc/system 3. Restart the system: # init 6

Red Hat Linux The default shared memory limit (shmmax) on Linux platforms is 32MB. This value can be changed in the proc file system without a restart. For example, to allow 128MB, type the following command: $ echo 134217728 >/proc/sys/kernel/shmmax You can put this command into a script run at startup. Alternatively, you can use sysctl(8), if available, to control this parameter. Look for a file called /etc/sysctl.conf and add a line similar to the following: kernel.shmmax = 134217728 This file is usually processed at startup, but sysctl can also be called explicitly later. To view the values of other parameters, look in the files /usr/src/linux/include/asm-xxx/shmparam.h and /usr/src/linux/include/linux/ sem.h.

SuSE Linux The default shared memory limits (shhmax and shmall) on SuSE Linux platforms can be changed in the proc file system without a restart. For example, to allow 512MB, type the following commands: #sets shmall and shmmax shared memory echo 536870912 >/proc/sys/kernel/shmall echo 536870912 >/proc/sys/kernel/shmmax

#Sets shmall to 512 MB #Sets shmmax to 512 MB

You can also put these commands into a script run at startup. Also change the settings for the system memory user limits by modifying a file called /etc/profile. Add lines similar to the following: #sets user limits (ulimit) for system memory resources ulimit -v 512000

#set virtual (swap) memory to 512 MB

INFORMATICA CONFIDENTIAL

BEST PRACTICES

843 of 954

ulimit -m 512000

#set physical memory to 512 MB

Configuring Automatic Memory Settings With Informatica PowerCenter 8, you can configure the Integration Service to determine buffer memory size and session cache size at runtime. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source to the target. It also creates session caches in memory. Session caches include index and data caches for the Aggregator, Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches. Configure buffer memory and cache memory settings in the Transformation and Session Properties. When you configure buffer memory and cache memory settings, consider the overall memory usage for best performance. Enable automatic memory settings by configuring a value for the Maximum Memory Allowed for Auto Memory Attributes or the Maximum Percentage of Total Memory Allowed for Auto Memory Attributes. If the value is set to zero for either of these attributes, the Integration Service disables automatic memory settings and uses default values.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

844 of 954

Causes and Analysis of UNIX Core Files Challenge This Best Practice explains what UNIX core files are and why they are created, and offers some tips on analyzing them.

Description Fatal run-time errors in UNIX programs usually result in the termination of the UNIX process by the operating system. Usually, when the operating system terminates a process, a "core dump" file is also created, which can be used to analyze the reason for the abnormal termination.

What is a Core File and What Causes it to be Created? UNIX operating systems may terminate a process before its normal, expected exit for several reasons. These reasons are typically for bad behavior by the program, and include attempts to execute illegal or incorrect machine instructions, attempts to allocate memory outside the memory space allocated to the program, attempts to write to memory marked read-only by the operating system, and other similar incorrect lowlevel operations. Most of these bad behaviors are caused by errors in programming logic in the program. UNIX may also terminate a process for some reasons that are not caused by programming errors. The main examples of this type of termination are when a process exceeds its CPU time limit, and when a process exceeds its memory limit. When UNIX terminates a process in this way, it normally writes an image of the processes memory to disk in a single file. These files are called "core files", and are intended to be used by a programmer to help determine the cause of the failure. Depending on the UNIX version, the name of the file may be "core", or in more recent UNIX versions, "core.nnnn" where nnnn is the UNIX process ID of the process that was terminated. Core files are not created for "normal" runtime errors such as incorrect file permissions, lack of disk space, inability to open a file or network connection, and other errors that a program is expected to detect and handle. However, under certain error conditions a program may not handle the error conditions correctly and may follow a path of INFORMATICA CONFIDENTIAL

BEST PRACTICES

845 of 954

execution that causes the OS to terminate it and cause a core dump. Mixing incompatible versions of UNIX, vendor, and database libraries can often trigger behavior that causes unexpected core dumps. For example, using an odbc driver library from one vendor and an odbc driver manager from another vendor may result in a core dump if the libraries are not compatible. A similar situation can occur if a process is using libraries from different versions of a database client, such as a mixed installation of Oracle 8i and 9i. An installation like this should not exist, but if it does, core dumps are often the result.

Core File Locations and Size Limits A core file is written to the current working directory of the process that was terminated. For PowerCenter, this is always the directory the services were started from. For other applications, this may not be true. UNIX also implements a per user resource limit on the maximum size of core files. This is controlled by the ulimit command. If the limit is 0, then core files will not be created. If the limit is less than the total memory size of the process, a partial core file will be written. Refer to the Best Practice Understanding and Setting UNIX Resources for PowerCenter Installations .

Analyzing Core Files Core files provide valuable insight into the state and condition the process was in just before it was terminated. It also contains the history or log of routines that the process went through before that fateful function call; this log is known as the stack trace. There is little information in a core file that is relevant to an end user; most of the contents of a core file are only relevant to a developer, or someone who understands the internals of the program that generated the core file. However, there are a few things that an end user can do with a core file in the way of initial analysis. The most important aspect of analyzing a core file is the task of extracting this stack trace out of the core dump. Debuggers are the tools that help retrieve this stack trace and other vital information out of the core. Informatica recommends using the pmstack utility. The first step is to save the core file under a new name so that it is not overwritten by a later crash of the same application. One option is to append a timestamp to the core, but it can be renamed to anything: mv core core.ddmmyyhhmi

INFORMATICA CONFIDENTIAL

BEST PRACTICES

846 of 954

The second step is to log in with the same UNIX user id that started up the process that crashed. This sets the debugger's environment to be same as that of the process at startup time. The third step is to go to the directory where the program is installed. Run the "file" command on the core file. This returns the name of the process that created the core file. file /core.ddmmyyhhmi Core files can be generated by the PowerCenter executables (i.e., pmserver, infaservices, and pmdtm) as well as from other UNIX commands executed by the Integration Service, typically from command tasks and per- or post-session commands. If a PowerCenter process is terminated by the OS and a core is generated, the session or server log typically indicates ‘Process terminating on Signal/Exception’ as its last entry.

Using the pmstack Utility Informatica provides a ‘pmstack’ utility that can automatically analyze a core file. If the core file is from PowerCenter, it will generate a complete stack trace from the core file, which can be sent to Informatica Customer Support for further analysis. The track contains everything necessary to further diagnose the problem. Core files themselves are normally not useful on a system other than the one where they were generated. The pmstack utility can be downloaded from the Informatica Support knowledge base as article 13652, and from the support ftp server at tsftp.informatica.com. Once downloaded, run pmstack with the –c option, followed by the name of the core file: $ pmstack -c core.21896 ================================= SSG pmstack ver 2.0 073004 ================================= Core info : -rw------- 1 pr_pc_d pr_pc_d 58806272 Mar 29 16:28 core.21896 core.21896: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4style, from ''''''''pmdtm'''''''' Process name used for analyzing the core : pmdtm Generating stack trace, please wait.. Pmstack completed successfully Please send file core.21896.trace to Informatica Technical Support You can then look at the generated trace file or send it to support. INFORMATICA CONFIDENTIAL

BEST PRACTICES

847 of 954

Pmstack also supports a –p option, which can be used to extract a stack trace from a running process. This is sometimes useful if the process appears to be hung to determine what the process is doing.

Last updated: 19-Mar-08 19:01

INFORMATICA CONFIDENTIAL

BEST PRACTICES

848 of 954

Domain Configuration Challenge The domain architecture in PowerCenter simplifies the administration of disparate PowerCenter services across the enterprise as well as the maintenance of security throughout PowerCenter. It allows for the grouping of previously separately administered application services and nodes into logically-grouped folders within the domain, based on administrative ownership. It is vital when installing or upgrading PowerCenter, that the Application Administrator understand the terminology and architecture surrounding the Domain Configuration in order to effectively administer, upgrade, deploy, and maintain PowerCenter Services throughout the enterprise.

Description The domain architecture allows PowerCenter to provide a service-oriented architecture where you can specify which services are running on which node or physical machine from one central location. The components in the domain are ‘aware’ of each other’s presence and continually monitor one another via ‘heartbeats’. The various services within the domain can move from one physical machine to another without any interruption to the PowerCenter environment. As long as clients can connect to the domain, the domain can route their needs to the appropriate physical machine. From a monitoring perspective, the domain provides the ability to monitor all services in the domain as well as control security from a central location. You no longer have to log into and ping multiple machines in a robust PowerCenter environment – instead a single screen displays the current availability state of all services. For more details on the individual components and detailed configuration of a domain, refer to the PowerCenter Administrator Guide.

Key Domain Components There are several key domain components to consider during installation and setup: ●

Master Gateway – The node designated as the master gateway or domain controller is the main ’entry point’ to the domain. This server or set of servers should be your most reliable and available machine in the architecture. It is the first point of entry for all clients wishing to connect to one of the PowerCenter services. If the master gateway is unavailable, the entire domain is unavailable. You may designate more than one node to run the gateway service. One gateway is always the master or primary, but by having the gateway services running on more than one node in a multimode configuration, your domain can continue to function if the master gateway is no longer available. In a highavailability environment, it is critical to have one or more nodes running the gateway service as a backup to the master gateway.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

849 of 954

●

●

Shared File System – The PowerCenter domain architecture provides centralized logging capability and; when high-availability is enabled, a highly available environment with automatic fail-over of workflows and sessions. In order to achieve this, the base PowerCenter server file directories must reside on a file system that is accessible by all nodes in the domain. When PowerCenter is initially installed, this directory is called infa_shared and is located under the server directory of the PowerCenter installation. It includes logs and checkpoint information that is shared among nodes of the domain. Ideally, this file system is both high-performance and highly available. Domain Metadata – As of PowerCenter 8, a store of metadata exists to hold all of the configuration settings for the domain. This domain repository is separate from the one or more PowerCenter repositories in a domain. Instead, it is a handful of tables that replace the older version 7.xpmserver.cfg, pmrep.cfg and other PowerCenter configuration information. As of PowerCenter 8.5, all PowerCenter security is also maintained here. Upon installation you will be prompted for the RDBMS location for the domain repository. This information should be treated like a PowerCenter repository, with regularlyscheduled backups and a disaster recovery plan. Without this metadata, a domain is unable to function. The RDBMS user provided to PowerCenter requires permissions to create and drop tables, as well as insert, update, and delete records. Ideally, if you are going to be grouping multiple independent nodes within this domain, the domain configuration database should reside on a separate and independent server so as to eliminate the single point of failure if the node hosting the domain configuration database fails.

Domain Architecture Just as in other PowerCenter architectures, the premise of the architecture is to maintain flexibility and scalability across the environment. There is no single best way to deploy the architecture. Rather, each environment should be assessed for external factors and then PowerCenter should be configured appropriately to function best in that particular environment. The advantage of the service-oriented architecture is that components in the architecture (i.e., repository services, integration services, and others) can be moved among nodes without needing to make changes to the mappings or workflows. Starting in PowerCenter 8.5, all reporting components of PowerCenter (Data Analyzer and Metadata Manager) are now all configured and administered from the Domain. Because of this architecture, it is very simple to alter architecture components if you find a suboptimal configuration and want to alter it in your environment. The key here is that you are not tied to any choices you make at installation time and have the flexibility to make changes to your architecture as your business needs change.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

850 of 954

Tip While the architecture is very flexible and provides easy movement of services throughout the environment, an item to carefully consider at installation time is the name of the domain and its subsequent nodes. These are somewhat troublesome to change later because of their criticality to the domain. It is not recommended that you imbed server IP addresses and names in the domain name or the node names. You never know when you may need to move to new hardware or move nodes to new locations. For example, instead of naming your domain ‘PowerCenter_11.5.8.20’, consider naming it ‘Enterprise_Dev_Test’. This makes it more intuitive to understand what domain you are attaching to and if you ever decide to move the main gateway to another server, you don’t need to change the domain or node name. While these names can be changed, the change is not easy and requires using command line programs to alter the domain metadata.

In the next sections, we look at a couple of sample domain configurations.

Single Node Domain Even in a single server/single node installation, you must still create a domain. In this case, all domain components reside on a single physical machine (i.e., node). You can have any number of PowerCenter services running in this domain. It is important to note that with PowerCenter 8 and beyond, you can run multiple integration services at the same time on the same machine – even in a NT/Windows environment. Naturally this configuration exposes a single point of failure for every component in the domain and high availability is not available in this situation.

Multiple Node Domains INFORMATICA CONFIDENTIAL

BEST PRACTICES

851 of 954

Domains can continue to expand to meet the demands of true enterprise-wide data integration.

Domain Architecture for Production/Development/Quality Assurance Environments The architecture picture becomes more complex when you consider a typical development environment, which usually includes some level of a Development, Quality Assurance, and Production environment. In most implementations, these are separate PowerCenter repositories and associated servers. It is possible to define a single domain to include one or more of these development environments. However, there are a few points to consider: ●

If the domain gateway is unavailable for any reason, the entire domain is inaccessible. Keep in mind that if you place your development, quality assurance and production services in a single domain, you have the possibility of affecting your production environment with development and quality assurance work. If you decide to restart the domain in Development for some reason, you are effectively restarting development, quality assurance and production at the same time. Also, if you experience some sort of failure that affects the domain in production, you have also brought down your development environment and have no place to test a fix the problem since your entire environment is compromised. ●

For the domain you should have a common, shared, high-performance file system to share the centralized logging and checkpoint files. If you have all three environments together on one domain, you are mixing production logs INFORMATICA CONFIDENTIAL

BEST PRACTICES

852 of 954

with development logs and other files on the same physical disk. Your production backups and disaster recovery files will have more than just production information in them. ● For a future upgrade, it is very likely that you will need to upgrade all components of the domain at once to the new version of PowerCenter. If you have placed development, quality assurance, and production in the same domain, you may need to upgrade all of it at once. This is an undesirable situation in most data integration environments. For these reasons, Informatica generally recommends having at least two separate domains in any environment: ●

Production Domain

●

Development/Quality Assurance Domain

Some architects choose to deploy a separate domain for each environment to further isolate them and to ensure no disruptions occur in the Quality Assurance environment due to any changes in the development environment. The tradeoff is an additional administration console to log into and maintain. One thing to keep in mind is that while you may have separate domains with separate domain metadata repositories, there is no need to migrate any of the metadata from the separate domain repositories between development, Quality Assurance and production. The domain metadata repositories collect information based on the physical location and connectivity of the components and thus, it makes no sense to migrate between environments. You do need to provide separate database locations for each, but there is no migration needs for the data within; each one is specific to the environment it services.

Administration The domain administrator has the access to start/shutdown all services within the domain, as well as the ability to create other users and delegate roles and responsibilities to them. Keep in mind that if the domain is shutdown, it has to be restarted via the command line or the host operating system GUI.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

853 of 954

PowerCenter's High Availability option provides the ability to create multiple gateway nodes to a domain, such that if the Master Gateway Node fails, another can assume its responsibilities; including authentication, logging, and service management.

Security and Folders Much like traditional repository security, security in the domain interface is set up on a “per-folder” basis, with owners being designated per logical groupings of objects/services in the domain. One of the major differences is that Domain security allows the creation of subfolders to segment nodes and services as desired. There are many considerations when deciding on a folder structure, keeping in mind that this logical administrative interface should be accessible to Informatica Administrators only and not to users and groups associated with a developer role (which are designated at the Repository level). New legislation in the United States and Europe, such as Basel II and the Public Company Accounting Reform and Investor Protection Act of 2002 (also known as SOX, SarbOx and Sarbanes-Oxley) have been widely interpreted to place many restrictions on the ability of persons in development roles to have direct write access to production systems, and consequently, administration roles should be planned accordingly. An organization may simply need to use different folders to group objects into Development, Quality Assurance and Production roles; each with separate administrators. In some instances, systems may need to be entirely separate, with different domains for the Development, Quality Assurance, and Production systems. Sharing of metadata remains simple between separate domains, with PowerCenter’s ability to “link” domains, and copy data between linked domains. For Data Migration projects, it is recommended to establish a standardized architecture that includes a set of folders, connections and developer access in accordance with the needs of the project. Typically this includes folders for: ●

Acquiring data

●

Converting data to match the target system

●

The final load to the target application

●

Establishing reference data structures

When configuring security in PowerCenter 8.5, there are two interrelated security aspects that should be addressed when planning a PowerCenter security policy: ●

●

Role Differentiation – Groups should be created separately to define roles and privileges typically needed for an Informatica Administrator and for an Informatica Developer. Using this separation at the group level allows for a more efficient administration of PowerCenter user privileges and provides for a more secure PowerCenter environment. Maintenance of Privileges – As privileges typically are the same for several users within a PowerCenter environment, care should be taken to define these distinct separations ahead of time, so that privileges can be defined at a group level, rather than at an individual user level. As a best practice, users should not be granted user specific privileges, unless it is temporary.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

854 of 954

Maintenance As part of a regular backup of metadata, a recurring backup should be scheduled for the PowerCenter domain configuration database metadata. This can be accomplished through PowerCenter by using the infasetup command, further explained in the Command Line Reference. The schema should also be added to the normal RDBMS backup schedule, thus providing two reliable backup methods for disaster recovery purposes.

Licensing As part of PowerCenter 8.5’s new Service-Oriented Architecture (SOA), licensing for PowerCenter services is centralized within the domain. License key file(s) are received from Informatica at the same time the download location for the software is provided. Adding license object(s) and assigning individual PowerCenter Services to the license(s) is the method used to enable a PowerCenter Service. This can be done during install, or add initial/incremental license keys can be used after install via the Administration Console web-based utility (or the infacmd command line utility).

Last updated: 26-May-08 17:36

INFORMATICA CONFIDENTIAL

BEST PRACTICES

855 of 954

Managing Repository Size Challenge The PowerCenter repository is expected to grow over time as new development and production runs occur. Over time, the repository can be expected to grow to a size that may start slowing performance of the repository or make backups increasingly difficult. This Best Practice discusses methods to manage the size of the repository. The release of PowerCenter version 8.x added several features that aid in managing the repository size. Although the repository is slightly larger with version 8.x than it was with the previous versions, the client tools have increased functionality to limit the dependency on the size of the repository. PowerCenter versions earlier than 8.x require more administration to keep the repository sizes manageable.

Description Why should we manage the size of the repository? Repository size affects the following: ●

●

●

DB backups and restores. If database backups are being performed, the size required for the backup can be reduced. If PowerCenter backups are being used, you can limit what gets backed up. Overall query time of the repository, which slows performance of the repository over time. Analyzing tables on a regular basis can aid in repository table performance. Migrations (i.e., copying from one repository to the next). Limit data transfer between repositories to avoid locking up the repository for a long period of time. Some options are available to avoid transferring all run statistics when migrating. A typical repository starts off small (i.e., 50MB to 60MB for an empty repository) and grows to upwards of 1GB for a large repository. The type of information stored in the repository includes: ❍

Versions

❍

Objects

❍

Run statistics

❍

Scheduling information

INFORMATICA CONFIDENTIAL

BEST PRACTICES

856 of 954

❍

Variables

Tips for Managing Repository Size Versions and Objects Delete old versions or purged objects from the repository. Use your repository queries in the client tools to generate reusable queries that can determine out-of-date versions and objects for removal. Use Query Browser to run object queries on both versioned and non-versioned repositories.. Old versions and objects not only increase the size of the repository, but also make it more difficult to manage further into the development cycle. Cleaning up the folders makes it easier to determine what is valid and what is not. One way to keep repository size small is to use shortcuts by creating shared folders if you are using the same source/target definition, reusable transformations in multiple folders.

Folders Remove folders and objects that are no longer used or referenced. Unnecessary folders increase the size of the repository backups. These folders should not be a part of production but they may exist in development or test repositories.

Run Statistics Remove old run statistics from the repository if you no longer need them. History is important to determine trending, scaling, and performance tuning needs but you can always generate reports based on the PowerCenter Metadata Reporter and save reports of the data you need. To remove the run statistics, go to Repository Manager and truncate the logs based on the dates.

Recommendations Informatica strongly recommends upgrading to the latest version of PowerCenter since the most recent release includes such features as skip workflow and session log, skip deployment group history, skip MX data and so forth. The repository size in version 8.x and above is larger than the previous versions of PowerCenter, but the added size does not significantly affect the performance of the repository. It is still advisable to

INFORMATICA CONFIDENTIAL

BEST PRACTICES

857 of 954

analyze the tables or run statistics to optimize the tables. Informatica does not recommend directly querying the repository tables or performing deletes on them. Use the client tools unless otherwise advised by Informatica technical support personnel.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL

BEST PRACTICES

858 of 954

Organizing and Maintaining Parameter Files & Variables Challenge Organizing variables and parameters in Parameter files and maintaining Parameter files for ease of use.

Description Parameter files are a means of providing run time values for parameters and variables defined in a workflow, worklet, session, mapplet, or mapping. A parameter file can have values for multiple workflows, sessions, and mappings, and can be created using text editors such as notepad, vi, shell script, or an Informatica mapping. Variable values are stored in the repository and can be changed within mappings. However, variable values specified in parameter files supersede values stored in the repository. The values stored in the repository can be cleared or reset using workflow manager.

Parameter File Contents A Parameter File contains the values for variables and parameters. Although a parameter file can contain values for more than one workflow (or session), it is advisable to build a parameter file to contain values for a single or logical group of workflows for ease of administration. When using the command line mode to execute workflows, multiple parameter files can also be configured and used for a single workflow if the same workflow needs to be run with different parameters.

Types of Parameters and Variables A parameter file contains the following types of parameters and variables: ● ●

●

●

●

●

●

Service Variable. Defines a service variable for an Integration Service. Service Process Variable. Defines a service process variable for an Integration Service that runs on a specific node. Workflow Variable. References values and records information in a workflow. For example, use a workflow variable in a decision task to determine whether the previous task ran properly. Worklet Variable. References values and records information in a worklet. You can use predefined worklet variables in a parent workflow, but cannot use workflow variables from the parent workflow in a worklet. Session Parameter. Defines a value that can change from session to session, such a database connection or file name. Mapping Parameter. Defines a value that remains constant throughout a session, such as a state sales tax rate. Mapping Variable. Defines a value that can change during the session. The Integration Service saves the value of a mapping variable to the repository at the end of each successful

INFORMATICA CONFIDENTIAL

BEST PRACTICES

859 of 954

session run and uses that value the next time the session runs.

Configuring Resources with Parameter File If a session uses a parameter file, it must run on a node that has access to the file. You create a resource for the parameter file and make it available to one or more nodes. When you configure the session, you assign the parameter file resource as a required resource. The Load Balancer dispatches the Session task to a node that has the parameter file resource. If no node has the parameter file resource available, the session fails.

Configuring Pushdown Optimization with Parameter File Depending on the database workload, you may want to use source-side, target-side, or full pushdown optimization at different times. For example, you may want to use partial pushdown optimization during the database's peak hours and full pushdown optimization when activity is low. Use the $ $PushDownConfig mapping parameter to use different pushdown optimization configurations at different times. The parameter lets you run the same session using the different types of pushdown optimization. When you configure the session, choose $$PushdownConfig for the Pushdown Optimization attribute. Define the parameter in the parameter file. Enter one of the following values for $$PushdownConfig in the parameter file: ●

None. The Integration Service processes all transformation logic for the session.

●

Source. The Integration Service pushes part of the transformation logic to the source database.

●

Source with View. The Integration Service creates a view to represent the SQL override value, and runs an SQL statement against this view to push part of the transformation logic to the source database.

●

Target. The Integration Service pushes part of the transformation logic to the target database.

●

Full. The Integration Service pushes all transformation logic to the database.

●

Full with View. The Integration Service creates a view to represent the SQL override value, and runs an SQL statement against this view to push part of the transformation logic to the source database. The Integration Service pushes any remaining transformation logic to the target database.

Parameter File Name Informatica recommends giving the Parameter File the same name as the workflow with a suffix of “. par”. This helps in identifying and linking the parameter file to a workflow.

Parameter File: Order of Precedence While it is possible to assign Parameter Files to a session and a workflow, it is important to note that a file specified at the workflow level always supersedes files specified at session levels.

Parameter File Location INFORMATICA CONFIDENTIAL

BEST PRACTICES

860 of 954

Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. Place the Parameter Files in directory that can be accessed using the server variable. This helps to move the sessions and workflows to a different server without modifying workflow or session properties. You can override the location and name of parameter file specified in the session or workflow while executing workflows via the pmcmd command. The following points apply to both Parameter and Variable files, however these are more relevant to Parameters and Parameter files, and are therefore detailed accordingly.

Multiple Parameter Files for a Workflow To run a workflow with different sets of parameter values during every run: 1. Create multiple parameter files with unique names. 2. Change the parameter file name (to match the parameter file name defined in Session or Workflow properties). You can do this manually or by using a pre-session shell (or batch script). 3. Run the workflow. Alternatively, run the workflow using pmcmd with the -paramfile option in place of steps 2 and 3.

Generating Parameter Files Based on requirements, you can obtain the values for certain parameters from relational tables or generate them programmatically. In such cases, the parameter files can be generated dynamically using shell (or batch scripts) or using Informatica mappings and sessions. Consider a case where a session has to be executed only on specific dates (e.g., the last working day of every month), which are listed in a table. You can create the parameter file containing the next run date (extracted from the table) in more than one way.

Method 1: 1. The workflow is configured to use a parameter file. 2. The workflow has a decision task before running the session: comparing the Current System date against the date in the parameter file. 3. Use a shell (or batch) script to create a parameter file. Use an SQL query to extract a single date, which is greater than the System Date (today) from the table and write it to a file with required format. 4. The shell script uses pmcmd to run the workflow. 5. The shell script is scheduled using cron or an external scheduler to run daily. The following figure shows the use of a shell script to generate a parameter file.

INFORMATICA CONFIDENTIAL

BEST PRACTICES

861 of 954

The following figure shows a generated parameter file.

Method 2: 1. The Workflow is configured to use a parameter file. 2. The initial value for the data parameter is the first date on which the workflow is to run. 3. The workflow has a decision task before running the session: comparing the Current System date against the date in the parameter file 4. The last task in the workflow generates the parameter file for the next run of the workflow (using INFORMATICA CONFIDENTIAL

BEST PRACTICES

862 of 954

a command task calling a shell script) or a session task, which uses a mapping. This task extracts a date that is greater than the system date (today) from the table and writes into parameter file in the required format. 5. Schedule the workflow using Scheduler, to run daily (as shown in the following figure).

Parameter File Templates In some other cases, the parameter values change between runs, but the change can be incorporated into the parameter files programmatically. There is no need to maintain separate parameter files for each run. Consider, for example, a service provider who gets the source data for each client from flat files located in client-specific directories and writes processed data into global database. The source data structure, target data structure, and processing logic are all same. The log file for each client run has to be preserved in a client-specific directory. The directory names have the client id as part of directory structure (e.g., /app/data/Client_ID/) You can complete the work for all clients using a set of mappings, sessions, and a workflow, with one parameter file per client. However, the number of parameter files may become cumbersome to manage when the number of clients increases. INFORMATICA CONFIDENTIAL

BEST PRACTICES

863 of 954

In such cases, a parameter file template (i.e., a parameter file containing values for some parameters and placeholders for others) may prove useful. Use a shell (or batch) script at run time to create actual parameter file (for a specific client), replacing the placeholders with actual values, and then execute the workflow using pmcmd. [PROJ_DP.WF:Client_Data] $InputFile_1=/app/data/Client_ID/input/client_info.dat $LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log Using a script, replace “Client_ID” and “curdate” to actual values before executing the workflow. The following text is an excerpt from a parameter file that contains service variables for one Integration Service and parameters for four workflows: [Service:IntSvs_01] [email protected] [email protected] [HET_TGTS.WF:wf_TCOMMIT_INST_ALIAS] $$platform=unix [HET_TGTS.WF:wf_TGTS_ASC_ORDR.ST:s_TGTS_ASC_ORDR] $$platform=unix $DBConnection_ora=qasrvrk2_hp817 [ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1] $$DT_WL_lvl_1=02/01/2005 01:05:11 $$Double_WL_lvl_1=2.2 [ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1.WT:NWL_PARAM_Lvl_2] $$DT_WL_lvl_2=03/01/2005 01:01:01 $$Int_WL_lvl_2=3 $$String_WL_lvl_2=ccccc INFORMATICA CONFIDENTIAL

BEST PRACTICES

864 of 954

Use Case 1: Fiscal Calendar-Based Processing Some Financial and Retail industries use Fiscal calendar for accounting purposes. Use the mapping parameters to process the correct fiscal period. For example, create a calendar table in the database with the mapping between the Gregorian calendar and fiscal calendar. Create mapping parameters in the mappings for the starting and ending dates. Create another mapping with the logic to create a parameter file. Run the parameter file creation session before running the main session. The calendar table can be directly joined with the main table, but the performance may not be good in some databases depending upon how the indexes are defined. Using a parameter file can resolve the index and result in better performance.

Use Case 2: Incremental Data Extraction Mapping parameters and variables can be used to extract inserted/updated data since previous extract. Use the mapping parameters or variables in the source qualifier to determine the beginning timestamp and the end timestamp for extraction. For example, create a user-defined mapping variable $$PREVIOUS_RUN_DATE_TIME that saves the timestamp of the last row the Integration Service read in the previous session. Use this variable for the beginning timestamp and the built-in variable $$$SessStartTime for the end timestamp in the source filter. Use the following filter to incrementally extract data from the database: LOAN.record_update_timestamp > TO_DATE(‘$$PREVIOUS_DATE_TIME’) and LOAN.record_update_timestamp

InformaticaBest-Practices_Confidential.pdf

Short Description

Description

Comments

We need your help!