Abinitio Material

September 18, 2017 | Author: sarvesh_mishra | Category: Parameter (Computer Programming), Databases, Computer Data, Information Retrieval, Data

Share Embed Donate

Report this link

Short Description

ETL dataware house...

Description

1 Question

Answer =================================

Phases - are used to break the graph into pieces. Temp after its completion. Phases are used to effectively sep CPU, disk) parts of the application. Phases vs Checkpoints

Checkpoints - created for recovery purposes. These ar can recover to the latest saved point - and rerun from i

You can have phase breaks with or without checkpoin

xfr

three types of parallelism

A new sandbox will have many directories: mp, dml, x with extension .xfr containing your own custom funct "somepath/xfr/yourfile.xfr"). Usually XFR stores map

1) Data Parallesim - data (copies of the same - or diffe components. 2) Componnent Paralelism (execute simultaneously on 3) Pipeline (sequential). Multi-File System

MFS

m_mkfs - create a multifile (m_mkfs ctrlfile mpfile1 . m_ls - list all the multifiles m_rm - remove the multifile m_cp - copy a multifile

m_mkdir - to add more directories to existing director  

Memory requirements of a graph





How to calculate a SUM

dedup sort with null key

Each partition of a component uses: ~ 8 MB + Add size of lookup files used in phase (if mult once) Multiply by degree of parallelism. Add up all c is used in that phase. Select the largest-memory phase in the graph

SCAN ROLLUP SCANWITHROLLUP Scan followed by Dedup sort and select the last

If we don't use any key in the sort component while us

then the output depends on the keep parameter.  

first - only the first record last - only last record



unique_only - there will be no records in the o

join on partitioned flow

file1 (A,B,C) , file2 (A,B,D). We partition both files b should we partition by "A,B" ? Not clear.

checkin, checkout

You can do checkin/checkout using the wizard right fr

how to have different passwords for QA and production

parameterize the .dbc file - or use environmental varia

How to get records 50-75 out of 100

Hot to convert a serial file into FFS project parameters vs. sandbox parameters Bad-Straight-flow

merging graphs

partitioning, re-partitioning, departitioning

 

use scan and filter m_dump -start 50 -end 75



use next_in_sequence() function and filter by e && next_in_sequence() operating sy AB_AIR_ROOT - default location fo sandboxes standard environment AI_SORT_MAX_CORE, AI_HOME



from unix prompt: env | grep AI

wrapper script

unix script to run graphs

multistage component

A multistage component is a component whi in 5 stages (1.input select, 2.temporary initia output selection, 5.finalize). So it is a transfo packages. Examples: scan Normalize and De normalize and denormalize sorted.

Dynamic DML

Dynamic DML is used if the input metadata different time different input files are recieve different dml. in that case we can use flag in read in the input file recieved and according dml is used. 

fan out - partition component (increa



fan in departition component (decrea

fan in, fan out

lock

a user can lock the graph for editing so that o and can not edit the same graph.

join vs lookup

Lookup is good for spped for small files (wil memory). For large files use join. You may n limit to handle big joins.

multi update

multi update executes SQL statements - it tre completely separate piece of work. 

We can use Autosys, Control-M, or a



We can take care of dependencies in scripts should run sequentially, we ca Autosys, or we can create a wrapper sequential commands (nohup comma command2.ksh &; etc). We can even Initio to execute individual scripts as

scheduler

Api and Utility modes in input table

These are database interfaces (api - uses SQL whatever vendor provides) 

lookup file component. Functions: lo lookup_next, lookup_match, lookup_



Lookups are always used with combi components.

lookup file

Calling stored proc in DB

You can call stored proc (for example, from you can even write SP in Ab Initio. Make it " good performance.

Frequently used functions

string_ltrim, string_lrtrim, string_substring, now()

data validation

is_valid, is_null, is_blank, is_defined

driving port

When joining inputs (in0, in1, ...) one of the default - in0). Driving input is usually the lar smallest can have "Sorted-Input" parameter b sorted" because it will be loaded completely

Ab Initio benefits: parallelism built in, mulit amounts of data, easy to build and run. Gene easily modified as needed )if something coul itself). The scripts can be easily scheduled us and easily integrated with other systems.

Ab Initio doesn't require a dedicated adminis Ab Initio vs Informatica for ETL

Ab Initio doesn't have built-in CDC capabili Capture).

Ab Initio allows to (attach error / reject files) capture and analyze the message and data se Informatica which has just one huge log). Ab metrics for each component. override key

override key option is used when we need to different field names.

control file

control file should be in the multifile directo the serial files)

max-core

max-core parameter (for example, sort 100 M of memory used by a component (like Sort o before spilling to disk. Usually you don't nee default value. Setting it too high may degrad of OS swapping and degrading of the perform

Input Parameters

graph > select parameters tab > click "create Usage: $paramname. Edit > parameters. The substituted during run time. You may need to scope as formal.

Each component has reject, error, and log po records, Error captures corresponding error, execution statistics of the component. You ca each component by setting reject threshold to on first reject, or setting ramp/limit. You can function in transform function.

Error Trapping

3 Question

Answer =======================================

How to see resource usage

In GDE goto options View > Tracking Details - will se and memory usage, etc.

assign keys component

Easy and saves development time. Need to understand and you can't control it easily. 

Scenario 1 (preferred): we run query which joi us the result in just 1 DB component.



Scenario 2 (much slower): we use 2 database c and join them in Ab Initio.

Join in DB vs join in Ab Initio

Join with DB

not recommended if number of records is big. It is bet and then join in Ab Initio.

Parameter showing how data is unevenly distributed b Data Skew

skew = (partition size - avg.part.size)* 100 / (size of th

.dbc - database configuration file (dbname, nodes, ver the db directory dbc vs cfg

.cfg - any tyoe of config file. for example, remote con remote server, user/pwd to connect to db, location of O connection method). .cfg file resides in the config dir. depth not equal data format error etc...

compilation errors

depth error : we get this error.. when two components does't match there layout

types of partitions

broadcast pbyexpression pbyroundrobin pbykey pwith

unused port

when joining, used records go to the output port, unus port

tuning performance

       

   

 

   

   





Go parallel using partitionning. Roundrobin pa balance. Use Multi-file system (MFS). Use Ad Hoc MFS to read many serial files in p component. Once data is partitionned - do not switch it to s instead. Do not acceess large filess via NFS - use FTP use lookup local rather than lookup (especially Use rollup and Filter as soon as possible to red Ideally do it in the source (database ?) before y Remove unnecessary components. For exampl exp, you can implement the same function in r Another example - when joining data from 2 fi instead of adding an additional component for use gather instead of concatenate. it is faster to do a sort after a partitino, than to try to avoid using a join with the "db" compon when getting data from database - make sure y indexes, etc.). If possible, do necessary selectio the database before getting data into Ab Initio. tune Max_core for Optimal performance (for s the input file). Note - If in-memory join cannot fit its non-driv MAX-CORE, then it will drop all the inputs to not make sence. Using phase breaks let you allocate more mem components - thus improving performance. Use checkpoint after sort to land data on disk Use Join and rollup in-memory feature When joining very small dataset to a very larg to broadcast the small dataset to MFS using br the small file as lookup. But for large dataset d partitioner. Use Ab Initio layout instead of database defau Change AB_REPORT parameter to increased m Use catalogs for reusability Components like join/ rollup should have the o

if they are placed after a sort component. minimize number of sort components. Minimi component, and if possible replace them by inonly required fields in the sort reformat join co Groups" instead of just Sort when data was alr Use phasing/flow buffers in case of merge sort

  

Minimize the use of regular expression functio transfer functions Avoid repartitioning of data unnecessarily. Wh more than two flows, use Reformat rather than For joining records from 2 flows use Concaten there is a need to follow some specific order in is required then it is preferable to use Gather c



Instead of putting many Reformat components indexes parameter in the first Reformat compo condition there.



Delta table maintain the sequencer of each dat



Master (or base) table - a table on tp of which

delta table

scan vs rollup

rollup - performs aggregate calculations on groups, sc totals

packages

used in multistage components or transform compone 

Reformat - deriving new data by adding/dropp



Redefine format - rename fields

Reformat vs "Redefine Format" Conditional DML

DML which is separated based on a condition 

SORTWITHINGROUP

passing a condition as a parameter

The prerequisit for using sortwithingroup is th by the major key. sortwithingroup outputs the reading the major key group. It is like an impli

Define a Formal Keyword Parameter of type string. Fo FilterCondition, and you want it to do filtering on CO graph in your "Filter by expression" Component enter $FilterCondition

Now on your command line or in wrapper script give YourGraphname.ksh -FilterCondition COUNT > 0 Passing file name as a parameter

#!/bin/ksh #Running the set up script on enviornment typeset PROJ_DIR $(cd $(dirname $0)/..; pwd . $PROJ_DIR/ab_project_setup.ksh $PROJ_DI #Exporting the script parameter1 to INPUT if [ $# -ne 2 ]; then INPUT_FILE_PARAMETER_1 $1 INPUT_FILE_PARAMETER_2 $2

# This grpah is using the input file cd $AI_RUN ./my_graph1.ksh $INPUT_FILE_PARAMETER_1 # This graph also is using the input fi ./my_graph2.ksh $INPUT_FILE_PARAMETER_2 exit 0; else echo Insufficient parameters exit 1; fi ------------------------------------#!/bin/ksh

#Running the set up script on enviornment typeset PROJ_DIR $(cd $(dirname $0)/..; pwd . $PROJ_DIR/ab_project_setup.ksh $PROJ_DIR

#Exporting the script parameter1 to INPUT_F export INPUT_FILE_NAME $1 # This grpah is using the input file cd $AI_RUN ./my_graph1.ksh # This graph also is using the input file. ./my_graph2.ksh exit 0;

How to remove header and trailer lines?

How to create a multi file system on Windows

use conditional dml where you can separate detail from validations use reformat with count :3 (out0:header ou 

first method: in GDE go to RUN > Execute Co c:control c:dp1 c:dp2 c:dp3 c:dp4



second method: double-click on the file compo double-click on partitions - there you can enter

Vector

A vector is simply an array. It is an ordered set of elem can be any type, including a vector or a record).

Dependency Analysis

Dependency analysis will answer the questions regard does the data come from what applications prodeuce a

Question

Answer =======================================

Surrogate key

There are many ways to create a surrogate key. For ex

4

next_in_sequence() function in your transform. Or you values" component. Or you can write a stored procedu

Note: if you use partitions, then do something like this

(next_in_sequence()-1)*no_of_partition()+this_partiti

.abinitiorc

.profile

This is a config file for ab initio - in user's home direc $AB_HOME/Config. It sets abinitio home path, confi (AB_WORK_DIR, AB_DATA_DIR, etc.), login info login methods for hosts for execution (like EME host,

your ksh init file ( environment, aliases, path variables command prompt settings, etc.)

data mapping, data modelling Hwo to execute the graph

From GDE - whole graph or by phases. From checkpo

Write Multiplefiles

A component which allows to write simultaneously in

Testing

Run the graph - see the results. Use components from

Sandbox vs EME

Sandbox is your private area where you develop and t one version can be in the sandbox at any time. The EM versions of the code that have been checked into it (so

Layout

Where the data-files are and where the components ar data - serial or partitioned (multi-file). The layout is d file (or a control file for the multifile). In the graph the automatically (for multifile you have to provide detail

Latest versions

April 2009: GDE ver.1.15.6, Co-operative system ver

Graph parameters

menu edit > parameters - allows you to specify private They can be of 2 types - local and formal.

Plan>It

You can define pre- and post-processes, triggers. Also run on success or on failure of the graphs.

Frequently used components

     

input file / output file input table / output table lookup / lookup_local reformat gather / concatenate join

running on hosts

conventional loading vs direct loading

      

runsql join with db compression components filter by expression sort (single or multiple keys) rollup trash



partition by expression / partition by key

co>operating system is layered on top of native OS (u GDE, GDE generates a script (according to "run" setin execute the scripts on different machines (using specif connection methods, like rexec telnet rsh rlogin) - and codes back.

This is basically an Oracle question - regarding SQLL Conventional load - using insert statements. All trigge will be checked, all indexes will be updated.

Direct load - data is written directly block by block. C partition. Some constraints are checked, indexes may native options to skip index maintenance.

in abinitio there are 3 types of joins: inner join, outer j semi-join

 

for inner join 'record_requiredN' parameter is t for outer join it is false for all the "in" ports.



for semi join it is true for the required compon components.

http://www.geekinterview.com/Interview-Questions/Data-Warehouse/Abinitio/page10

Abinitio Material

Short Description

Description

Comments

We need your help!