Ab Initio FAQ's Part1.doc

Share Embed Donate


Short Description

Download Ab Initio FAQ's Part1.doc...

Description

1) What do you mean by specifying layouts in Ab Initio? Ans: The Layout is something which determines whether a component runs in a serial or parallel mode. If you specify the path as serial directory the component runs as single stream and if you specify the path as a multi file directory the component runs in parallel mode. Also the path which you specify there is serves as the working directory of the graph where all intermediate files are stored Layout can be specified as 1) propagate from neighbors 2) URL 3) custom 4) host Before you can run an Ab Initio graph, you must specify layouts to describe the following to the Co>Operating System:   

The location of files The number and locations of the partitions of multifiles The number of, and the locations in which, the partitions of program components execute

Layout is one of the following:   

A URL that specifies the location of a serial file A URL that specifies the location of the control partition of a multifile A list of URLs that specifies the locations of: o The partitions of an ad hoc multifile o The working directories of a program component

2) What is skew? Ans: Skew tells about the unbalanced behavior of data partitioning. You can do performance tuning by controlling Skew, Max-Core etc and there are so many ways. 3) How to read only 10 records from i/p file? Ans: 1) There is a component called Sample in your sort component folder. If you use this after the input file you can specify how many records you would like to pass through 2) In the Input Table component, in the parameters tab, you can specify how many records to read. 3) Also there is a Leading Records component (well in 2.11 anyway) that allows you to specify the number of records to read from a serial or mfs file. 4) One way to do this is with the Read-Raw component available in 2.11 or higher, although pragmatically you will have to describe and process the record structure as it works with raw data. 4) How do you make 4 way to 8 way in a graph Ans: Put a partition and a gather component... Partition component should be 4 way MFS and the gather should be 8way.

5) In Ab initio what is the upstream and downstream? Ans: upstream and downstream are used in conjunction with EME for dependency and impact analysis for the graphs we have developed and saved in to the repository. Basically it helps tracking the changes between the versions and changes in the individual components and variables in the components. 6) To extract files from both Oracle DB and from Mainframes (DB2). Is it possible to extract the data directly from the DB's or do i need to convert them into Flat files & load? Ans: You can extract directly from the database on each of these. You just have to make sure that you have a config file set up for DB2 and Oracle. You also want to make sure you have your entire login variables set in your run settings You can load the data directly from DB2,Oracle or Informix using Unload DB Table component to my knowledge 7) Does anybody know how many number of columns in the lookup file? What is the maximum data we can have in the lookup file? I am doing code review for my application, i see their 8 to 10 columns in each lookup file with large amount of data Ans: 1) There is no set column number that a lookup file can contain. There is, however, a limit on the size of the data file. If you believe that 8-10 columns are too large, you might be correct. If the size of the lookup contains anything over 750,000-1M records, I would highly recommend using a join on this. The lookup file will die, if the size gets too large, and you will have to code for a join. 2) Lookups get cached into memory during graph execution; it is always a good idea to keep the data in lookup to bare minimum based on requirement. Don't keep any columns you don't need or you don't access from the lookup in Lookup file. If the graph is partitioned then try to use lookup_local wherever possible. For this your partition key and lookup key must match or lookup key should be leading subset of partition key. Rule of thumb: Trim any fields from the data, which you don't use in the downstream processing. 3) The limit for a lookup file is 2GB. Whether or not it is sensible to use a lookup of that sort of size depends on what it's being used for. 9) How can I stop an executing graph in the middle for some conditions then how to restart it? Ans: 1) Doing a kill -9 PID1 PID2 will only kill the Ab Initio processes running on the host node. We may still have Ab Initio processes running on different agent nodes. During runtime Ab Inito creates one recovery file by the name in the host directory specified in Run -> Setting parameter in GDE. If the host directory is not specified, then the file gets created in the default $HOME of the user specified in the Run ->Setting of GDE. This recovery file contains pointers to different temporary files created dynamically during runtime. In order to kill an Ab Inito Job and all its associated processes running across all the nodes, you have to execute the following two commands in order as they appear 1. m_kill -9 2. m_rollback -d

If the graph execution has to be stopped, depending on certain conditions, then use force_error() function. 10) What are the functions used for system date? a. today() :: Returns the internal representation of the current date on each call b. today1() :: Returns the internal representation of the current date on the first call. Note [DML represents dates internally as integer values specifying days relative to January 1, 1900] iii) now() :: Returns the current local date and time iv) now1() :: The first time a component calls now1, the function returns the value returned from the system function localtime. The second and subsequent times a component calls now1, it returns the same value it returned on the first call 11) How to convert a string into date format? Ans: The string needs to be first casted in the date format. So if you have an input field of string “20031130“ and your output field is a date (YYYY-MM-DD),then use this out.fieldname (date(YYYY-MM-DD))in.fieldname; Note: [However if any of the i/p field has NULL data, it fails, so use a is_valid() ,is_defined() functions to check the validity of the i/p data ] 12) What is the relation between EME, GDE and Co-operating system? Ans. EME is said as enterprise metadata env, GDE as graphical development env and Cooperating system can be said as ab initio server relation b/w this CO-OP, EME AND GDE is as fallows Co operating system is the Abinitio Server. This co-op is installed on particular O.S platform that is called NATIVE O.S .coming to the EME, it“s just as repository in informatica, its hold the metadata, transformations, db config files source and targets information“s. coming to GDE its is end user environment where we can develop the graphs(mapping just like in informatica) designer uses the GDE and designs the graphs and save to the EME or Sand box it is at user side where EME is as server side. 13) What is the use of aggregation when we have rollup as we know rollup component in abinitio is used to summarize group of data record, then where we will use aggregation? Ans: Aggregation and Rollup both can summarize the data but rollup is much more convenient to use. In order to understand how a particular summarization being rollup is much more explanatory compared to aggregate. Rollup can do some other functionality like input and output filtering of records. Aggregate and rollup perform same action, rollup display intermediate result in main memory; Aggregate does not support intermediate result 14) What are kinds of layouts does ab initio supports?

Ans: Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same time. The parallel one depends on the degree of data parallelism. If the multi-file system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is defined such as it's same as the degree of parallelism. 15) How can you run a graph infinitely? Ans: To run a graph infinitely, the end script in the graph should call the .ksh file of the graph. Thus if the name of the graph is abc.mp then in the end script of the graph there should be a call to abc.ksh. Like this the graph will run infinitely. 16) Do you know what a local lookup is? Ans: If your lookup file is a multifile and partioned/sorted on a particular key then local lookup function can be used ahead of lookup function call. This is local to a particular partition depending on the key. Lookup File consists of data records which can be held in main memory. This makes the transform function to retrieve the records much faster than retrieving from disk. It allows the transform component to process the data records of multiple files fastly. 17) What is the difference between look-up file and look-up, with a relevant example? Ans: Generally Lookup file represents one or more serial files (Flat files). The amount of data is small enough to be held in the memory. This allows transform functions to retrieve records much more quickly than it could retrieve from Disk. A lookup is a component of abinitio graph where we can store data and retrieve it by using a key parameter. A lookup file is the physical file where the data for the lookup is stored. 18) How many components in your most complicated graph? It depends the type of components you us. Usually avoid using much complicated transform function in a graph. 19) Explain what is lookup? Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per the data present in a particular file (serial/multi file). The dataset can be static as well dynamic (in case the lookup file is being generated in previous phase and used as lookup file in current phase). Sometimes, hash-joins can be replaced by using reformat and lookup if one of the inputs to the join contains less number of records with slim record length. AbInitio has built-in functions to retrieve values using the key for the lookup 20) What is a ramp limit? The limit parameter contains an integer that represents a number of reject events. The ramp parameter contains a real number that represents a rate of reject events in the number of records processed. no of bad records allowed = limit + no of records*ramp. ramp is basically the percentage value (from 0 to 1) This two together provides the threshold value of bad records. 21) Have you worked with packages?

A multistage transform component by default uses packages. However user can create his own set of functions in a transfer function and can include this in other transfer functions. 22) Have you used rollup component? Describe how. If the user wants to group the records on particular field values then rollup is best way to do that. Rollup is a multi-stage transform function and it contains the following mandatory functions. 1. initialise 2. Rollup 3. finalise Also need to declare one temporary variable if you want to get counts of a particular group. For each of the group, first it does call the initialize function once, followed by rollup function calls for each of the records in the group and finally calls the finalize function once at the end of last rollup call. 23) How do you add default rules in transformer? Add Default Rules “ Opens the Add Default Rules dialog. Select one of the following: Match Names “ Match names: generates a set of rules that copies input fields to output fields with the same name. Use Wildcard (.*) Rule “ Generates one rule that copies input fields to output fields with the same name. 1) If it is not already displayed, display the Transform Editor Grid. 2) Click the Business Rules tab if it is not already displayed. 3) Select Edit > Add Default Rules. In case of reformat if the destination field names are same or subset of the source fields then no need to write anything in the reformat xfr unless you don“t want to use any real transform other than reducing the set of fields or split the flow into a number of flows to achieve the functionality. 24) What is the difference between partitioning with key and round robin? Partition by Key or hash partition -> this is a partitioning technique which is used to partition data when the keys are diverse. If the key is present in large volume then there can large data skew? But this method is used more often for parallel data processing. Round robin partition is another partitioning technique to uniformly distribute the data on each of the destination data partitions. The skew is zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-robin manner. 25) How do you improve the performance of a graph? There are many ways the performance of the graph can be improved. 1) Use a limited number of components in a particular phase 2) Use optimum value of max core values for sort and join components 3) Minimise the number of sort components 4) Minimise sorted join component and if possible replace them by in-memory join/hash join 5) Use only required fields in the sort, reformat, join components 6) Use phasing/flow buffers in case of merge, sorted joins 7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port

8) For large dataset don't use broadcast as partitioner 9) Minimise the use of regular expression functions like re_index in the trasfer functions 10) Avoid repartitioning of data unnecessarily Try to run the graph as long as possible in MFS. For these input files should be partitioned and if possible output file should also be partitioned. 26) How do you truncate a table? From Abinitio run sql component using the DDL "truncate table By using the Truncate table component in Ab Initio 27) Have you ever encountered an error called "depth not equal"? When two components are linked together if their layout doesn“t match then this problem can occur during the compilation of the graph. A solution to this problem would be to use a partitioning component in between if there was change in layout. 28) What is the function you would use to transfer a string into a decimal? In this case no specific function is required if the size of the string and decimal is same. Just use decimal cast with the size in the transform function and will suffice. For example, if the source field is defined as string (8) and the destination as decimal (8) then (say the field name is field1). out.field :: (decimal(8)) in.field If the destination field size is lesser than the input then use of string_substring function can be used like the following. say destination field is decimal(5). out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim leading and trailing spaces */ 29) What is an outer join? An outer join is used when one wants to select all the records from a port - whether it has satisfied the join criteria or not. 30) What are Cartesian joins? Joins two tables without a join key. Key should be {}. 31) What is the difference between a DB config and a CFG file? A .dbc file has the information required for Ab Initio to connect to the database to extract or load tables or views. While .CFG file is the table configuration file created by db_config while using components like Load DB Table. 32) What is the relation between EME, GDE and Co-operating system? ans. EME is said as enterprise metadata env, GDE as graphical development env and Cooperating system can be said as abinitio server relation b/w this CO-OP, EME AND GDE is as fallows Co operating system is the Abinitio Server. this co-op is installed on particular O.S platform that is called NATIVE O.S .coming to the EME, its i just as repository in informatica , its hold the metadata, transformations, db config files source and targets informations. coming to GDE

its is end user environment where we can develop the graphs(mapping just like in informatica) designer uses the GDE and designs the graphs and save to the EME or Sand box it is at user side where EME is as server side. 33) Explain the difference between the “truncate“ and "delete" commands. The difference between the TRUNCATE and DELETE statement is Truncate belongs to DDL command whereas DELETE belongs to DML command. Rollback cannot be performed in case of truncate statement whereas Rollback can be performed in Delete statement. "WHERE" clause cannot be used in Truncate where as "WHERE" clause can be used in DELETE statement. 34) How we can create job sequencer in abinitio i.e running number of graphs at a time? As such there is no job sequencer supported by Ab initio Until the versions:GDE:1.13.3 and Co>Op:2.12.1 But we can sequence a the jobs by creating Wrapper Scripts in UNIX i.e. a korn shell script which calls the graphs in sequence. In Abinito it is not possible to create the job sequence. But scheduling of the jobs can be done with the help of scheduling tool called "CONTROL M".In this tool graph corresponding scripts and wrapper scripts are placed as per the sequence of exec and we can monitor the execution of the graphs. There is no sequencer concept in abinitio. suppose you have graphs A,B,C A o/p is I/p to B and B o/p is Input to C Then you will write a wrapper script that will call this jobs, script will be like this a.ksh b.ksh c.ksh you can use next_in_sequence function which returns sequence of integers 35) How to take the input data from an excel sheet? There is a Read Excel component that reads excel either from host or from local drive. The dml will be a default one. make it csv formatted , deliminated file and read it thru input table comp. 36) What is the function you would use to transfer a string into a decimal? use ""reinterpret_as" function to convert string to decimal,or decimal to string. syntax: To convert decimal into string reinterpret_as(ebcdic string(13),(ebcdic decimal(13))(in.cust_amount)) 37) How to run the graph without GDE?

In the run directory a graph can be deployed as a .ksh file. Now, this .ksh file can be run at the command prompt as: ksh 38) How to work with parameterized graphs? One of the main purpose of the parameterized graphs is that if we need to run the same graph for n number of times for different files, we set up the graph parameters like $INPUT_FILE, $OUTPUT_FILE etc and we supply the values for these in the Edit>parameters. These parameters are substituted during the run time. We can set different types of parameters like positional, keyword, local etc. The idea here is, instead of maintaining different versions of the same graph, we can maintain one version for different files. Have you worked with packages? Packages are nothing but the reusable blocks of objects like transforms, user defined functions, dmls etc. These packages are to be included in the transform where you use them. For example, consider a user defined function like /*string_trim.xfr*/ out::trim(input_string)= begin let string(35) trimmed_string = string_lrtrim(input_string); out::trimmed_string; end Now, the above xfr can be included in the transform where you call the above function as include ''~/xfr/string_trim.xfr''; But this should be included ABOVE your transform function. For more details see the help file in "packages". What is an outer join? If you want to see all the records of one input file independent of whether there is a matching record in the other file or not. Then it“s an outer join. What is driving port? In a join, it is sometimes advantageous to have the Sorted-Input parameter set to "Input need not be sorted". This helps, when we are sure that one of the input ports has far less records than the other port, and the data from that port can be held in memory. In this case, we can set the other port as the driving port.

Say, e.g. Port in0 has 1000 rec and in1 has 1 million records, in this case we set the port in1 as driving port, for which the value would be 1. By default, the driving port value is 0(for in0). Depending on the requirement, sometimes it more advisable to create a lookup instead. But that depends on the requirement and design What is writing of wrapper can any explain elaborately? Writing a wrapper script helps u 2 to run the graph in sequence as u want. Example: when u need to run 3 graphs but the condition is after the first graph ran successfully u need to take the feed generated by it and use it in next graph and so on... graph after it finished u have to check the graph ran successfully then run the second KSh so on..... What is Conditional DML? Can anyone please explain with example Then u have to right a Unix script in which run the ksh of the first The DML that is used as a condition is known as conditional DML.. Suppose we have data that includes the Header, Main data and Trailer as given below: 10 This data contains employee info. 20 emp_id,emp_name, salary 30 count So, the DML for the above structure would be: Record decimal (",") id; if (id==10) begin string (",") info; end else if (id==20) begin string (",") emp_id; string (",") name;

string (",") salary; end else if (id==30) begin decimal (",") count; end end; This is Could anybody provide me the major UNIX commands for abinitio multi file system? m_mkfs - For creating a multifile m_ls - to list all the multifiles m_rm - To remove the multifile m_cp - To cpy a multifile What is meant by vector field? Explain with an exam... A vector is a sequence of the same type of elements. The element type may be any type including a vector or record type. It is a field which tell us how many times a particular field is repeated .for example Take this input Cust_id purchase_amount purchase date 101 1000 29.08.06 101 500 30.08.06 102 1050 31.08.06 103 1140 1.0906 103 1000 02.0906 103 500 30.09.06 Cust_id total_purchase_amount 101 102 103

1500 1050 2640

no_purchase purchase_date(1) purchase_date(2 2 1 3

29.08.06 31.08.06 1.09.06

30.08.06 02.09.06 so on

Here no_purchase is the vector field which rep the no of times a cust hase done purchases What does dependency analysis mean in Ab Initio? Dependency analysis will answer the questions regarding data linage that is where does the data come from, what applications produce and depend on this data etc.

For data parallelism, we can use partition components. For component parallelism, we can use replicate component. Like this which component(s) can we use for pipeline parallelism? When connected sequence of components of the same branch of graph executes concurrently is called pipeline parallelism. Components like reformat where we distribute input flow to multiple o/p flow using output index depending on some selection criteria and process those o/p flows simultaneously creates pipeline parallelism. But components like sort where entire i/p must be read before a single record is written to o/p cannot achieve pipeline parallelism flow: input file ------>reformat----->rollup------>filter by expression----->o/p file 50th record

25 records

10 records

clearly speaking when ever u run any graph we observe the number of records processed on flows ,this is best example for pipeline parallism. What is .abinitiorc and what it contains? .abinitiorc is the config file for ab initio. It is found in user's home directory. Generally it is used to contain abinitio home path, different log in information like id encrypted password login method for hosts where the graph connects in time of execution. .abinitiorc file contains all configuration variables such as AB_WORK_DIR, AB_DATA_DIR etcthis file can be find in "$AB_HOME/Config". What do you mean by .profile in Abinitio and what...? .profile is a file which gets executed automatically when that particular user logging in. You can change your .profile file to include any commands that you want to execute whenever u logging in. you can even put commands in your .profile file that overrides settings made in /etc/profile(this file is set up by the system administrator). You can set the following in your .profile...... - Environment settings - aliases - path variables - name and size of your history file - primary and secondary command prompts.....and many more.

What is semi-join? in abinitio, there are 3 types of join... 1. inner join.

2. outer join

and 3.semi join.

For inner join 'record_required' parameter is true for all in ports. For outer join it is false for all the in ports. if u want the semi join u put 'record_requiredn' as true for the required component and false for other components. What is data mapping and data modeling? Data mapping deals with the transformation of the extracted data at FIELD level i.e. the transformation of the source field to target field is specified by the mapping defined on the target field. The data mapping is specified during the cleansing of the data to be loaded. For Example: source; string(35) name = "Siva Krishna

";

target; string("01") nm=NULL("");/*(maximum length is string(35))*/ Then we can have a mapping like: Straight move.Trim the leading or trailing spaces. What is driving port? When do you use it? Driving port in join supplies the data that drives join . That means, for every record from the driving port, it will be compared against the data from non driving port. We have to set the driving port to the larger dataset so that non driving data which is smaller can be kept in main memory for speeding up the operation. What is $mpjret? Where it is used in ab-initio? $mpjret is return value of shell command "mp run" execution of Ab-Initio graph. What is data cleaning? How is it done? I can simply say it as Purifying the data. Data Cleansing: the act of detecting and removing and/or correcting a database“s dirty data (i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly) 1. What is the Difference between GDE and Co>Operating system?

GDE(Graphical development environment) is look like a GUI to develop the graphs in a simple manner. Co> Operating system is nothing but distributed operating system, which can run as a backend server Current Version of GDE is 1.15 and Co>Operating system is 2.14 2. Which process you fallowed to develop a graph?   

Getting the requirements Preparing the mapping documents(Mapping document is nothing but mapping between Input field and output field using some functional logic) Then using the design documents I will implement the graph with proper components.

3. Which components you have worked?          

Reformat Rollup Join Sort Replicate Partition by expression and key Redefine Multi update Lookup Intermediate

4. Explain About Reformat component? Reformat can change the record formats by dropping fields or adding or combining Ports:     

Input Output Reject Log Error

Specific Parameters:  

Select Output Index

5. What is the difference between output Index and Select parameters in reformat? Select and output index both are used to filter the data, but using select parameter we can“t get the deselected record. But using output index parameter U can filter the data as well as u can connect the deselected record to another output port. 6. What is the difference between Reformat and Redefined component?:

Reformat can change the record format by dropping, adding, modifying fields. Using Redefined format copies the records from input to output without changing the record values. 7. Explain about Join component? 

Reads the data from two or more inputs and combines the records with matching keys and send to output ports

Specific parameters: 

Dedup: Set true to remove duplicates before joining



Driving port: Driving port is the largest input and remain inputs will directly reads into memory.(Available only when Inmemory: Input need not to sort parameter set to true)



Join type:

1. Inner join 2. Full outer join 3. Explicit join 

Record required parameter: This will be available when join type is set to Explicit. If you want left outer join set true to input 0 and false to input 1.If you want right outer join set false to Input 0 and set true to Input 1.



Key: Matching keys



Overridden key : Set the alternative names to the particular key fields



Max memory: Maximum usage of bytes before joining to write the temporary files to the disk(Available only when(sorted Input In memory: Input need to be sort is set to true), default is 8MB



Select: To filter the data



Max-core: Maximum usage of bytes before joining to write the temporary files to the disk (Available only when (sorted Input In memory: Input need not to be sort is set to true) .The default is 64MB



Sorted Input:

When set to in memory Input need to be sort, it accept only sorted input and if it is In memory Input need not sort, accepts unsorted data Specific ports: Unusedn: We can retrieve the unmatched data using unused ports 9. Can we make a explicit join for more than two inputs? Yes, we can make join for more than two inputs Ex:

For three inputs, if you want left outer join set the record required parameter true to input 0 and false to input 1 and input2 For three inputs, if you want right outer join set the record required parameter false to input 0 input 1 and set true toinput2 10. What is the difference between merge and join? Both components used to join the data based on keys, with join we can combine to input flows, but using merge we can combine the partitioned data. 11. Explain about sort component? Sort component sorts and merge the data Parameters:  Key  Maxcore (Default is 100MB) 12. How to determine the Maximum usage of memory of a component? The maximum available value of max core is 231 “1 13. Explain about Portion by key and Expression Portion by Key: Distributes the records to output flow portions according to its key value Partition by Expression: Distributes the record to output flows partitions by expression. 14. What r the different types of partition components?     

Partition by key Partition by Expression Partition by round robin Partition by range Broadcast

15. Difference between broadcast and replicate? Broadcast: combines the records it receives into single flow and writes a copy of that flow to each output flow partitions. Broadcast supports data parallelism. Replicate: combines the records it receives into single flow and write a copy of that flow to each output flows. Replicate supports component parallelism. 16. What is difference between Concatenate and merge? Concatenate: Appends the multiple flow partitions one after another Merge: Combine the multiple flow partitions that have been sorted by key 16. What are the different de partition components?   

Merge Interleave (Combines in round robin fashion) Concatenate



Gather (Combines the data arbitrarily)

16. what is the difference between reformat and Filter by Expression? In both components we can filter the data based on select expression, but in reformat we can“t get the de selected records in a separate port. In filter by expression we have a separate deselect port. 18. Explain the difference between Aggregator and rollup? Both components used for summarization, but in aggregator don“t have the built-in functions. In rollup we have the built-in functions like SUM (), AVG (), COUNT (), MIN (), MAX (), FIRST (), LAST (), PRODUCT (). 19. Explain the difference between rollup and scan? Rollup component can produce the total control on summarization. Scan component produce only Intermediate summary or cumulative summary records.

20. What are the aggregator functions in rollup?  

Temporary_type (declaring the temporary variable) Initialize (Initializing needed value)



Rollup (Doing summarization)



Finalize (Assigning the final value)

21. what are the different types of sort components?  

Sort Sort with groups



Checkpointed sort



Partition by key and sort

23. What is a multifile and how we can create through command line? AbInitio multifiles are nothing but a partition of a large serial file into tree structure and runs parallel way. We can create the multifile in command line using the command M_MFKS fallowed by URL. Of that particular file. 24. What is the difference between phase and check point?  

Phases are used to break up a graph into blocks for performance tuning. Check point is used for recovery

25. Explain about different types of parallelisms supported by Ab Initio? Ab Initio supports three types“ of parallelisms:  

Component parallelism Pipeline parallelism



Data parallelism

Component parallelism: Component parallelism occurs when program components execute simultaneously on different branches of a graph.

Pipeline parallelism: Pipeline parallelism occurs when a connected sequence of program components on the same branch of a graph execute simultaneously.

Data parallelism: Data parallelism occurs when you separate data into multiple divisions, allowing multiple copies of program components to operate on the data in all the divisions simultaneously.

26. Explain about flow partitions in AbIntio?



Straight flow Fan-in flow(



Fan-out flow(



All to All flow(



) ) )

Straight flow: This flow connects the two components with the same depth of parallelism

Fan-in flow: A fan-in flow connects a component with a greater depth of parallelism to one with a lesser depth “ in other words; it follows a many-to-one pattern.

Fan-out flow: A fan-out flow connects a component with a lesser number of partitions to one with a greater number of partitions “ in other words, it follows a one-to-many pattern.

All to All flow:

An all-to-all flow is used:  

To connect components with different numbers of partitions, when the result of dividing the greater number of partitions by the lesser number is not an integer For repartitioning of data using components with the same or different numbers of partitions (see Repartitioning)

28. Do u have worked on conditional components? You can make any component or sub graph conditional by specifying a conditional expression that the GDE evaluates at runtime to determine whether or not the component runs. If the conditional expression evaluates to true, the GDE runs the subgraph or component. If the conditional expression evaluates to false, the GDE either disables the component and any flows connected to its ports, or replaces it with a flow, depending on your choice on the Properties dialog: Condition tab. 29. What is a subgraph? A subgraph is a graph fragment. Just like graphs, subgraphs contain components and flows. A subgraph groups together components that perform a subtask in a graph. The subgraph creates a reusable component that performs the subtask. 30. What sort of functions have u worked?  

Enquiry and error functions String functions



Lookup functions



Date functions

31. which Enquiry and error functions have u used? 

Is_defined (Test whether the expression is not null)

Syntax: Is_defined (expr) 

Is_Null(Test whether the expression is null)

Syntax: Is_defined (expr) 

Is_error (Tests whether the error will occur while the time of evaluating the expression)

Syntax: Is_error (expr) Is_valid (Tests whether the expression is valid or not) Syntax: Is_valid (expr) 

Force error(Causes an error and returns a message)

Syntax: force_error (string msgr) 32. What sorts of String functions u have been worked?  

Decimal_lpad: Decimal_lrpad



String_compare



String_substring



String_concat



String_Index



String_length



String_lpad



String_lrpad

(Note: please go through the help document for the description) 33. How can we generate sequence of numbers in Ab Intio? We have separate function called Next_in_sequence is there to generate 1 to n numbers.

Syntax: int next_in_sequence( ) 34. How can we get the log information in AbInitio? Using write_to_log function we can write to log port of a component Syntax: write_to_log(string event_type, string event_text) 35. What is the use of Lookup file component? Lookup File represents one or more serial files or a multifile. The amount of data is small enough to be held in main memory. This allows a transform function to retrieve records much more quickly than it could retrieve them if they were stored on disk. Lookup File associates key values with corresponding data values to index records and retrieve them. Parameters for Lookup:  

Key Record format

How to Use Lookup File Unlike other dataset components, Lookup File is not connected to other components in graphs. In other words, it has no ports. However, its contents are accessible from other components in the same or later phases. You use the Lookup File in other components by calling one of the following DML functions in any transform function or expression parameter: lookup, lookup_count, or lookup_next. The first argument to these lookup functions is the name of the Lookup File. The remaining arguments are values to be matched against the fields named by the key parameter. The lookup functions return a record that matches the key values and has the format given by the RecordFormat parameter. For details, see the Data Manipulation Language Reference. A file you want to use as a Lookup File must fit into memory. If a file is too large to fit into memory, use Input File followed by Match Sorted or Join instead. Information about Lookup Files is stored in a catalog, which allows you to share them with other graphs. 36.Have u worked on Lookup functions? I worked on the fallowing functions: Lookup Lookup_count Lookup_Local Lookup_next (Note please go through help document for the description)    

37. How to convert a output file or Intermediate file to a lookup file?

By clicking the Add to catalog check box. 39. Explain the performance tuning in your current project? There are many ways the performance of the graph can be improved. 1) Use a limited number of components in a particular phase 2) Use optimum value of max core values for sort and join components 3) Minimize the number of sort components 4) Minimize sorted join component and if possible replace them by in-memory join/hash join 5) Use only required fields in the sort, reformat, join components 6) Use phasing/flow buffers in case of merge, sorted joins 7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port 8) For large dataset don't use broadcast as partitioner 9) Minimise the use of regular expression functions like re_index in the trasfer functions 10) Avoid repartitioning of data unnecessarily 40.What is DB CONFIG file and how to create it? Db config file has the information required for the AbInitio to connect the database Creation: In Input or output table components select dbconfig file/new/then u should give the Db name, Db node,database version and user_id and password and click create. 41. How do u migrate ur project from one env to another env? We have two options like Check-In Check-Out (Note :Please go through the help document for more Information)  

42. How can do Version control in Ab Initio? Once Check- In has done Graph automatically updated to new version Whenever u checkout the graph u need to give Tag information in the Tag tab(It represents the version) If u want to view total versions, you need to give the fallowing command in the command line: AIR_OBJECT_VERSION_VERBOSE. 43 .How can we debug AbInitio graph? Ans: Using file watchers we can debug the graph, watcher will add an Intermediate file on the flow. So you can view the data that passes through the flow when you run a graph. Two types of watchers are there:  

Non-phased Phased

Non-Phased: with out phase break

Phased: with phase break. 44. How do we add a watcher to the flow? Add watchers on flows by doing the following: 1. Turn on debugging mode if it is not on. 2. Select the flows on which you want to place watchers. 3. Do one of the following: o On the menu bar of the GDE, choose Debugger > Add Watcher to Flow. o o

On the GDE Debugging toolbar, click the Add Watcher to Flow button Right-click the flow and choose Add Watcher from the shortcut menu.

.

Watchers appear on the selected flows. The actions in step 3 will remove watchers if there are watchers on all selected flows. When you run the graph the watchers turn blue, and you can view the data that has passed through the flows. 45.How to run a graph through command line? Ans: We can deploy the graph as a .ksh file and using that file can run the graph through command line. 46.what is a sandbox? A sandbox is a collection of graphs and related files that are stored in a single directory tree, and treated as a group for purposes of version control, navigation, and migration. Sandbox contains fallowing sub directories:  

DML (Holds the Record format Information) XFR (Holds the Transformation logic files)



DB (Holds the database connection information)



MP (Holds the graphs)



RUN (Holds the ksh files)

47.What will happen we you create a sandbox in Ab Initio? When you create the sandbox, Automatically the tree structure (DML.XFR,DB,MP,RUN Folders) ,parameters and environment variables will create. Along with these the ABPROJECTSETUP.KSH file will create in the sandbox. 48..What sort of error messages you have got In your project?  

Bad value found error Null value assignment



Depth is not equal



Too many files open or max core error

49.When we can get the depth is not equal message? When the depth of parallelism (partitions of a layout) mismatched between up stream and down stream components. 50.When we get the too many files opened error? When the max core value is too low while executing a component this error will occur, so we need to set the appropriate max core value for that component. 51.How does the job recovery works in Ab Initio? Job recovery can done in the fallowing ways: 





If you set the checkpoint phase .rec file will create automatically. Once failure occur for graph, while the time of rerunning of that graph, It will automatically recover the data till last check point If you want to run the from the beginning, you need perform the manual rollback from the command line The command is m_rollback

53.What is local variable? A local variable is a variable declared within a transform function. You can use local variables to simplify the structure of rules or to hold values used by multiple rules. Declaration: Here is the syntax of a local variable declaration: let type_specifier variable_name [not NULL] [ = expression ] ; NOTE: The declaration of a local variable must occur before the statements and rules in a transform function. Let

Keyword for declaring a variable.

type_specifier

The type of the variable.

variable_name

The name of the variable.

[not NULL]

Optional. Keywords indicating that the variable cannot take on the value of null. These must appear after the variable name and before the initial value. NOTE: If you create a local variable without the not NULL keywords, and do not assign an initial value, the local variable initially takes on the value of null.

[=

Optional. An expression that provides an initial value for the variable.

expression ] ;

A semicolon must end a variable declaration.

For example, the following local variable definitions define two variables, x and y. The value for x depends on the value of the amount field of the variable in, and the value of y depends on the value of x: let int x = in.amount + 5; let double y = 100.0 / x; 54.What is Global variable? With in a package you can create and use the global variable to all the transformation functions, which are present in the package, but u should declare the global variable outside the transformation function. Declaration: let type_specifier variable_name [not NULL] [ = expression ] ; Let

Keyword for declaring a variable.

type_specifier

The type of the variable.

variable_name

The name of the variable.

[not NULL]

Optional. Keywords indicating that the variable cannot take on the value of null. These must appear after the variable name and before the initial value. NOTE: If you create a global variable without the not NULL keywords, and do not assign an initial value, the global variable initially takes on the value of null.

[= expression ]

Optional. An expression that provides an initial value for the variable.

;

A semicolon must end a variable declaration.

55. Have you ever used any m commands? Yes, I used the commands like M_rollback, M_cleanup,m_dump 56. What is the difference between m_rollback and m_cleanup? m_rollback rolls back a partially completed graph to its beginning state. m_cleanup cleans up files left over from unsuccessfully executed graphs and manually recovered graphs. 57. How to use m_cleanup? To find temporary files and directories before cleaning them up, you use the m_cleanup command. You can run this utility with or without arguments:

  

m_cleanup prints usage for the command. m_cleanup -help prints usage for the command. m_cleanup -j job_log_file [job_log_file... ] lists the temporary files and directories listed in the log file specified by job_log_file. To specify multiple files, separate each filename with a space.

Log files have either a .hlg or .nlg suffix. A log file ending in .hlg is on the control, or host, machine of a graph. A log file ending in .nlg is on a processing machine of a graph. The job_log_file can be an absolute or relative pathname. Paths have the following syntax: o On the control machine “ AB_WORK_DIR/host/job_id/job_id.hlg o On a processing machine “ AB_WORK_DIR/vnode/job_id-XXX/job_id.nlg, where the XXX on a processing machine path is an internal ID assigned to each machine by the Co>Operating System. 58.How can I generate DML for a database table from command line? Using the m_db command line utility we can generate the dml. Syntax is m_db gendml dbc_file [options] -table tablename 59.Can we do check-In and Check-Out through Command line? Yes, we can do check-in and check-out using the air commands like AIR_OBJECT_IMPORT and AIR_OBJECT_EXPORT. 60.What sort of issues you solved in the production support?  

Data quality issues Max core issues.

1) What is EME & EME DataStore? Ans) EME is short for Enterprise Meta>Environment. The EME is a high performance objectoriented storage system that manages Ab-Initio applications (including data formats and business rules) and related information. It provides an integrated and consolidated view of your business. It is used for the purpose of VERSION CONTROLLING, NAVIGATION & MIGRATION. An EME datastore is a specific instance of the EME: the term denotes the specific EME storage that you are currently connected to through the GDE, there can be many such datastore instances resident in an environment in which the EME has been installed. But you can only be connected to one datastore at a time: this is determined by your GDE's current EME datastore settings. 2) What is Sandbox? Ans) A sandbox is a collection of graphs and related files that are stored in a single directory tree, and treated as a group for purposes of version control, navigation, and migration. A sandbox can be a file system copy of a datastore project. 3) What is Co-Operating System?

Ans) The Co>Operating System is core software that unites a network of computing resources“CPUs, storage disks, programs, datasets“into a production-quality data processing system. The Co>Operating System is layered on top of the native operating systems of a collection of computers. It provides a distributed model for process execution, file management, process monitoring, checkpointing, and debugging. The Graphical Development Environment (GDE) provides a graphical user interface into the services of the Co>Operating System. 4) What are the differences between the various GDE connection methods? Ans) There are a number of communication methods used to communicate between the GDE and the Co>Operating System, including:       

Ab Initio Server/REXEC: Ab Initio Server/TELNET: DCOM: REXEC: RSH: TELNET: SSH(/Ab Initio)

When using the GDE to connect to the Co>Operating system, the normal process for a connection differs depending upon which communication method is selected. In broad terms, two things tend to happen: files are transferred from the GDE to the target host (or from the host to the GDE), and processes are started/executed on the host. When using telnet, rexec and rsh, the basic steps are as follows. A. The GDE transfers the execution script to the server via FTP. B. The GDE connects to the server by means of the selected method. C. The GDE executes that script on the server by means of the connection set up in step B. The process is differerent for connection methods that use the Ab Initio Server, however. These methods include Ab Initio Server/Telnet and Ab Initio Server/Rexec, as well as SSH and DCOM. The use of the Ab Initio Control Server replaces the need for FTP and adds enhanced serverside services. When the Ab Initio Control Server is involved, the basic steps are as follows:  

The GDE connects to the server by means of the selected method. This connection initiates startup of the Ab Initio Control Server.



The GDE initiates a connection to the Control Server.



All file transfer occurs across the same Control Server connection.



Script execution is accomplished through a new connection using the selected connection method.

5) What is Meta data? Ans) Meta data is Data about the Data, It will give the description about the data. Metadata associated with graphs. This includes the information needed to build a graph, such as record formats, key specifiers, and transform functions. 6) What is the configuration file in Ab-initio? Ans) The Co>Operating System accepts either of two names for the per-user Ab Initio configuration file. In addition to .abinitiorc, the Co>Operating System now also accepts abinitio.abrc in order to conform to Windows file name conventions. Other supported platforms also recognize the new name. Only one configuration file is permitted, however. Using both .abinitiorc and abinitio.abrc results in an error. 7) What are different file extensions in Ab-initio? Ans) .cfg

Database table configuration files for use with 2.1 Database Components

.dat

Data files (either serial files or multifiles)

.dbc

Database configuration files

.dml

Data Manipulation Language files or record format definitions.

.mdc Dataset or custom dataset components .mp

Stored Ab Initio graphs or graph components

.mpc Program components or custom components .xfr

Transform function definitions or packages

.aih

Host Settings

.aip

Project Settings

8) What does GDE do automatically? Ans) The GDE provides default settings and behaviors for several features. Buffering and Deadlock Record Format Propagation

Flow Layout and

9) What kind of flat file formats supports by Ab Initio Graphical Design Interface (GDE)? Ans) The Ab Initio Graphical Design Interface (GDE) supports these flat file formats: All file types use the .dat extension.  

Serial Files Multifiles



Ad-hoc Multifile

Serial Files A serial file is a flat, non-parallel file also known as one-way parallel. You create serial files using a Universal Resource Locator (URL) on the component's Description tab. The URL starts with file Multifiles: A multifile is a parallel file consisting of individual files called partitions and often stored on different disks or computers. A multifile has a control file that contains URLs pointing to one or more data files. You can divide data across partition files using these methods: random or roundrobin partitioning, partitioning based on ranges or functions, and replication or broadcast, in which each partition is an identical copy of the serial data. You create multifiles using a URL on the components Description tab. Ad-hoc Multifile :An ad-hoc multifile is a also a parallel file. Unlike a multifile, however, the content of an ad-hoc multifile is not stored in multiple directories. In a custom layout, the partitions are serial files. You create an ad-hoc multifile using partitions on the component's Description tab. 10) What is dbc file contains? Ans) File with a .dbc extension which provides the GDE with the information it needs to connect to a database. A configuration file contains the following information:  

The name and version number of the database to which you want to connect The name of the computer on which the database instance or server to which you want to connect runs, or on which the database remote access software is installed



The name of the database instance, server, or provider to which you want to connect

11) What are the default parameters in sandbox? Ans) The default sandbox parameters in a GDE-created sandbox are these six:      

PROJECT_DIR “ absolute path to the sandbox directory DML “ relative sandbox path to the dml subdirectory XFR “ relative sandbox path to the xfr subdirectory RUN “ relative sandbox path to the run subdirectory DB “ relative sandbox path to the db subdirectory MP “ relative sandbox path to the mp subdirectory

These six parameters are automatically created (and assigned their correct value) whenever you create a sandbox. 12) What is the difference b/w sandbox parameters & graph parameters? Ans) The difference between sandbox parameters and graph parameters is:

 

Graph parameters are visible only to the particular graph to which they belong Sandbox parameters are visible to all the graphs stored in a particular sandbox

13) What is standalone Sandbox? Ans) A sandbox that is not associated with a project is simply a special directory. 14) What is the difference b/w EME & Sandbox? Ans) The big difference between the contents of a sandbox and its corresponding project in the EME is that the project contains, for each file, each and every version that has ever been checked in by anybody. The sandbox, on the other hand, contains only the latest version of each file checked out into that sandbox. A sandbox can be associated with only one project. However, there is no limit (other than the physical one of disk space) to the number of sandboxes that a user can have. Although a given sandbox can be associated with only one project, a given project can have any number of sandboxes. 15) What are formal graph parameters? Ans) A formal graph parameter is a parameter you substitute for a path and/or filename when you create a graph. This allows you to specify the value of that parameter at runtime. 16) What is the order of evolution of parameters? Ans) When you run a graph, parameters are evaluated in the following order:  

The host setup script is run. Common (that is, included) sandbox parameters are evaluated.



Sandbox parameters are evaluated.



The project-start.ksh script is run.



Formal parameters are evaluated.



Graph parameters are evaluated.



The graph Start Script is run.

17) What is Transform function? Ans) A transform function (or transform) is the logic that drives data transformation “ most commonly, transform functions express record reformatting logic. In general, however, you can use transform functions in data cleansing, record merging, and record aggregation. To be more specific, a transform function is a collection of business rules, local variables, and statements. The transform expresses the connections between the rules, variables, and statements, as well as the connections between these elements and the input and output fields.

Transform functions are always associated with transform components; these are components that have a transform parameter: Aggregate, Denormalize Sorted, Fuse, Join, Match Sorted, MultiReformat, Normalize, Reformat, Rollup, and Scan components. 18) What is Prioritizing rule? Ans) The order of evaluation of rules in a transform function by assigning priority numbers to the rules. The rules are attempted in order of priority, starting with the assignment of lowestnumbered priority and proceeding to assignments of higher-numbered priorities, then finally to an assignment for which no priority has been given. 19) What are local variables? Ans) A local variable is a named storage location in an expression or transform function. You declare a local variable within the transform function in which you want to use it. The local variable is reinitialized each time the transform function is called, and it persists for one single evaluation of the transform function. 20) What Is a Package? Ans) A package is a named collection of related DML objects. A package can hold types, transform functions, and variables, as well as other packages. Packages provide a means of locating in one place DML objects that are needed more than once in a given graph, or needed by multiple developers. Packages allow developers to avoid redundant code; this makes maintenance of DML objects more efficient. Packages are very useful in these types of situations:  

The record formats of multiple ports use common record formats and/or type specifiers Multiple components use common transforms

21) Explain Multi-Stage transform Components? Ans) The multi-stage transform components require packages because, unlike other transform components, they are driven by more than single transform functions. These components each take a package as a parameter and, in order to process data, look for particular variables, functions, and types in that package. For example, a multi-stage component might look for a type named temporary_type, a transform function named finalize, or a variable named count_items. 22) What is a Phase? Ans) A phase is a stage of a graph that runs to completion before the start of the next stage. By dividing a graph into phases, you can save resources, avoid deadlock, and safeguard against failures. To protect a graph, all phases are checkpoints by default. 23) What is a Checkpoint? Ans) A checkpoint is a phase that acts as an intermediate stopping point in a graph and saves status information to allow you to recover from failures. By assigning phases with checkpoints to a graph, you can recover completed stages of the graph if failure occurs.

24) How will use the subgraph of graph A in the Graph B? Ans) When you build a subgraph, it becomes a part of the graph in which you build it. If you want to use it in other graphs, or in other places in the original graph, save it in the Component Organizer of the GDE. 25) Is there a way to make my graph conditional, so that certain components may not run? Ans) You can enter a Condition statement on the Condition tab of graph components. This is an expression that evaluates to the string value for true or false (see details below). The GDE then evaluates the expression at runtime. If the expression evaluates to true, the component or subgraph is executed. If it is false, then the component or subgraph is not executed, and is either removed completely or replaced with a flow between two user-designated ports. The correct syntax for if statements in the Korn shell is as follows: $( if [[ condition ]]; then_statement; else_statement; fi) 26) How to improve GDE is performence, when it's running slow? Ans) If the GDE is performing slowly, you can improve performance with one or more of these methods:  



Turn off Undo by choosing File > Autosave/Undo on the GDE menu bar and clearing the selection of Undo/Redo Enabled. Turn off Propagation by choosing Edit > Propagation on the GDE menu bar and clearing the selection of Record Format and Layout. Increase the Tracking Interval by choosing Run > Default Settings on the GDE menu bar, clicking the Code Generation tab, and increasing the Tracking Interval to 60 seconds.

27) What is lookup file? Ans) Lookup File represents one or more serial files or a multifile. The amount of data is small enough to be held in main memory. This allows a transform function to retrieve records much more quickly than it could retrieve them if they were stored on disk. Lookup File associates key values with corresponding data values to index records and retrieve them. 28) What is Two-stage routing? Ans) When an all-to-all flow connects components with layouts containing a large numbers of partitions, the Co>Operating® System uses many networking resources. If the number of partitions in the source and destination components is N , an all-to-all flow uses resources proportional to N*N(N square) . To save network resources, you can mark an all-to-all flow as using two-stage routing. With two-stage routing, the all-to-all flow uses only resources 2*N*√N (2*N*root N). For example, an all-to-all flow with 25 partitions uses 25*25 = 625 resources, but with twostage routing uses only 2*25*5 = 250 resources. 29) What kind parallelism supports by Ab-initio?

Ans) There are three types of parallelism employed by the Co>Operating System:  

Component Parallelism Pipeline Parallelism



Data Parallelism

30) What is Component Parallelism? Ans) Component parallelism occurs when program components execute simultaneously on different branches of a graph. Component parallelism scales to the number of branches of a graph “ the more branches a graph has, the greater the component parallelism. If a graph has only one branch, component parallelism cannot occur. 31) What is Pipeline Parallelism? Ans) Pipeline parallelism occurs when a connected sequence of program components on the same branch of a graph execute simultaneously. 32) What is Data Parallelism? Ans) Data parallelism occurs when you separate data into multiple divisions, allowing multiple copies of program components to operate on the data in all the divisions simultaneously. 33) What are Multifiles and Multifile Systems & Multi directories? Ans) Ab Initio multifiles are parallel files composed of individual files, typically located on different disks and usually, but not necessarily, on different systems. These individual files are the partitions of the multifile. Ab Initio multifiles reside in parallel directories called multidirectories, which are organized into multifile systems. An Ab Initio multifile system consists of multiple replications of a directory tree structure containing multidirectories and multifiles. Each replication constitutes a partition of the multifile system. Each partition holds a subset of the data contained in the multifile system, and the system has one additional partition that contains control information. The partitions containing data are the data partitions of the system, and the additional partition is the control partition. The control partition contains no user data, only the information the Co>Operating System needs to manage the multifile system. 34) How to create multifile system? Ans) To create a multifile system, issue the m_mkfs command, using as arguments the URLs of the partitions of the multifile system you want to create. The first URL creates the control partition, and each subsequent URL creates the next partition of the multifile system. Similarly use m_mkdir for multi directories. 35) What is Layout?

Ans) A layout is one of the following:  

A URL that specifies the location of a serial file A URL that specifies the location of the control partition of a multifile



A list of URLs that specifies the locations of:



The partitions of an ad hoc multifile



The working directories of a program component

36) What Is Dependency Analysis? Ans) Using the EME, you can conduct project analyses of the dependencies within and between graphs. The EME examines the project and develops an analytical survey of it in its entirety, tracing how data is transformed and transferred, field by field, from component to component. 37) What are different kinds of Analysis? Ans): Choice

Checkin Wizard Action

None

Turns off all translation and dependency analysis during checkin.

Translation Only

Translates graphs from GDE format to datastore format, but does not do error checking and does not store results in the datastore. Tip We recommend that at minimum you do translation only, since it is required for analysis, which you can run anytime.

Translation with Checking

Translates graphs from GDE to datastore format and checks for errors that will interfere with dependency analysis. See Checked-for Errors.

Full Dependency Analysis (Default)

Performs full dependency analysis on the graph and saves the results in the datastore. Tip We recommend that you do not do analysis now, as it can greatly prolong checkin.

What to Analyze The What to Analyze group of checkboxes allow you to specify which files will be subjected to the level of analysis you specified in Analysis Level. The following table explains the four choices: Choice

Checkin Wizard Analyzes ...

All Files

All files in the project.

All Unanalyzed Files

All files in the project that have changed or those that are dependent on or are required by files that have changed since the last time they were analyzed “ regardless of whether or not the files were checked in by you.

Only My Checked In Files

Only the files checked in by you. This group can include files you checked in earlier which are still on the analysis queue and have not yet been analyzed.

Only the File Specified (Default)

Only the specified file(s).

Analysis Scope The Analysis Scope group of checkboxes allow you to specify how far the specified level of analysis will be extended to files dependent on those being analyzed, both in the current project and in other projects. The following table describes the three choices. Choice

Checkin Wizard Analyzes...

Dependent Files from All Projects (Default)

Files in other projects common to (included in) the one you are checking, if they are dependent on the files being analyzed.

Dependent Files from Specified Project (Default)

Only the dependent files that are in the same project as the file(s) being analyzed.

No Dependent Files

No dependent files.

38) What is switch parameter? Ans) A switch parameter has a fixed set of string values which you define when you create the parameter. The purpose of a switch parameter is to allow you to change your sandbox's context: its value determines the values of various other parameters that you make dependent on that switch. For each switch value, each of the dependent parameters has a dependent value. Changing the switch's value thus changes the values of all its dependent parameters. 39) What are the types of project parameters? There are four types of project parameters:  

Standard Parameters Switch Parameters



Dependent Parameters



Common Project Parameters

40) What is max-core parameter?

Ans) The value for the max-core parameter determines the maximum amount of memory, in bytes, that the component can use. If the component is running in parallel, the value of maxcore represents the maximum memory usage per partition, not the sum for all partitions. If you set the max-core value too low, the component runs more slowly than expected. If you set the max-core value too high, the component might use too many machine resources, slow the process drastically, and cause hard-to-diagnose failures. 41) What is ordered flow? Ans) The Ordered attribute is a port attribute. It determines whether the order in which you attach flows to a port, from top to bottom, is significant to the definition and purpose of the component. If a port is ordered, the order in which flows are attached determines the result of the processing the component does: if you change the order in which you attach the flows, you create a different result. Note: GDE indicates the difference between a port that is ordered and one that is not by drawing them differently. If you inspect the ordered port of Concatenate in the graph, you see a line dividing the port between the two flows; that line is not present in the port of Gather, which is not ordered. 42) What will be the record order in the flows? Ans) Components maintain the ordering of the input data records unless their explicit purpose is to reorder records. For most components, if record x appears before record y in an input flow partition, and if record x and record y are both in the same output flow partition, then record x appears before record y in that output flow partition. For example, if you supply sorted input to a Partition component, it produces sorted output partitions. Exceptions are:  

The components that explicitly reorder records, such as Sort, Sort within Groups, and Partition by Key and Sort. The components that have fan-in flows, such as the Departition components. They each define their own record order.

43) What is loging parameter? Ans) The transform components and some other components have a logging parameter. This parameter specifies whether or not you want the component to generate log records for certain events. The value of the logging parameter is True or False. The default is False. If you set the logging parameter to True, you must also connect the component's log port to a component that collects the log records. 44) Explain about multistage transform components? Ans) A multistage transform is a Transform Component that modifies records in up to five stages: input selection, temporary initialization, processing, finalization, and output selection.

Each stage is written as a DML transform function. The multistage transform components are: Denormalize, Normalize, Rollup & Scan 45) Explain about compress components? Ans) There are a number of components that compress and uncompress data.   

Deflate (compress) and Inflate (Uncompress) work on all platforms. Compress and Uncompress work on only UNIX platforms. GZip(compress) is deprecated. It will be removed in a future release. GUnzip (Uncompress) uncompresses data and works on all platforms.

Components: 46) Difference b/w Replicate & Broadcast? Ans) Broadcast arbitrarily combines all the data records it receives into a single flow and writes a copy of that flow to each of its output flow partitions. Replicate arbitrarily combines all the data records it receives into a single flow and writes a copy of that flow to each of its output flows. Use Replicate to support component parallelism “ for example, when you want to perform more than one operation on a flow of data records coming from an active component. Use Broadcast to increase data parallelism when you have connected a single fan-out flow to the out port or to increase component parallelism when you have connected multiple straight flows to the out port. 47) Explain about FUSE? Ans) Fuse applies a transform function to corresponding records of each input flow. The first time the transform function executes, it uses the first record of each flow. The second time the transform function executes, it uses the second record of each flow, and so on. Fuse sends the result of the transform function to the out port. The component works as follows. The component tries to read from each of its input flows.  

If all of its input flows are finished, Fuse exits. Otherwise, Fuse reads one record from each still-unfinished input port and a NULL from each finished input port.

48) Explain about JOIN? Ans) Join reads data from two or more input ports, combines records with matching keys according to the transform you specify, and sends the transformed records to the output port. Additional ports allow you to collect rejected and unused records. There can be as many as 20 input ports. Types of join:

  

Inner join “ sets the record-required parameters for all ports to True. Inner join is the default. The GDE does not display the record-required parameters because they all have the same value. Outer join “ sets the record-required parameters for all ports to False. The GDE does not display the record-required parameters because they all have the same value. Explicit “ allows you to set the record-required parameter for each port individually.

49) What is the use of override key parameter & where it is used? Ans) Override key parameter is used in the Join component. To specify the alternative name(s) for the key field(s) for a particular in port. 50) What are the different options available in reject thresh hold? Ans) There are 3 options available, they are   

Never abort Abort on first reject Use limit/ramp

51) Explain about limit & ramp? Ans) Limit is a number representing the acceptable total of reject events. Default is 0.  

Ramp is a decimal representing the Rate of reject events in the number of records processed. The component stops the execution of the graph when the number of reject events exceeds the result of the following formula: limit + (ramp * number_of_records_processed_so_far)

52) Explain Join with DB component? Ans) Join with DB joins records from the flow or flows connected to its input port with records read directly from a database(SQL statement), and outputs new records containing data based on, or calculated from, the joined records. execute_on_miss parameter: A statement executed when no rows are returned by the select_sql statement. The statement should be an INSERT (or possibly an UPDATE); after it is executed, the select_sql is executed a second time. If no results are generated on the second attempt, the input record is rejected. A database commit is by default performed after each execution of execute_on_miss, but this can be altered by setting the commit number parameter. 53) What is the difference b/w JOIN & JOIN with DB? Ans) The main difference is, in join with DB component we are joining the incoming feed file with the TABLE by writing the SQL statement, where as in normal join we don“t have SQL statements.

Instead of using a statement in SQL, you can now extract the to-be-joined data from the database by calling a stored procedure, specified in the sql_select parameter. The syntax for calling a stored procedure using Oracle or DB2 is as follows:  

{call | exec | execute} [:a = ] [schema.][package.]stored_procedure(:b, :c, ...) where: :a, :b, and :c are input/output arguments

54) What is the use of META-PIVOT component? Ans) The Meta Pivot component allows you to split records by data fields (columns). The component converts each input record into a series of separate output records. There is one separate output record for each field of data in the original input record. Each output record contains the name and value of a single data field from the original input record. 55) Explain REFORMAT? Ans) Reformat changes the record format of data records by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records. output-index: 





Either the name of a file containing a transform function, or a transform string. The Reformat component calls the specified transform function for each input record. The transform function uses the value of the input record to direct that input record to a particular output port. The expected output of the transform function is the index of an output port (zero-based). The Reformat component directs the input record to the identified output port and executes the transform function, if any, associated with that port. When you specify a value for output-index, each input record goes to exactly one transform/output port pair. For example, suppose there are 100 input records and two output ports. Each output port receives between 0 and 100 records. According to the transform function you specify for output-index, the split can be 50/50, 60/40, 0/100, 99/1, or any other combination that adds up to 100. If you do not specify a value for output-index, Reformat sends every input record to every transform/output port pair. For example, if Reformat has two output ports and there are no rejects, 100 input records results in 100 output records on each port for a total of 200 output records.

56) What is the difference between the Reformat and Redefine Format components? Ans) The difference between Reformat and Redefine Format is that Reformat can actually change the bytes in the data while Redefine Format simply changes the record format on the data as it flows through, leaving the data unchanged. The Reformat component can change the record format of data records by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records. The Redefine Format component copies data records from its input to its output without changing the values in the data records. You use Redefine Format to change or rename fields in a record format without changing the values in the records. In this way, it is similar to the DML

built-in function, reinterpret_as. Typically this component has different DML on its input and output ports, and allows the unmodified data to be interpreted in a different form. 57) Explain Multi Reformat? Ans) Multi Reformat changes the record format of data records flowing between from one to 20 pairs of in and out ports by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records. A typical use for Multi Reformat is to put it immediately before a custom component that takes multiple inputs. The component operates separately on the data flowing between each pair of its inn-outn ports. The count parameter specifies the total number of port pairs. Each inn-outn port pair has its own associated transformn to reformat the data flowing between those ports. 58) What is ABLOCAL() and how can I use it to resolve failures when unloading in parallel? Ans) Some complex SQL statements contain grammar that is not recognized by the Ab Initio parser when unloading in parallel. You can use the ABLOCAL() construct in this case to prevent the Input Table component from parsing the SQL (it will get passed through to the database). It also specifies which table to use for the parallel clause. 59) What is the difference b/w Update table & Multi update table? Ans) The main difference is commit number & commit table are mandatory parameters in multi update table, where as in update table they are optional. Update table modify“s only single table in the database, where as multi update table can modify more than one table, so we require commit table & commit number in multi update table. API Mode Execution (same in both componets): The statements are applied to the incoming records as follows. For each record:  

The statement referenced by updateSqlFile is attempted first. If the statement can be successfully applied to the current record, it is executed, and the statement referenced by insertSqlFile is skipped. If the updateSqlFile statement cannot be applied to the current record, the statement referenced by insertSqlFile is attempted.

60) Difference b/w NORMALIZE & DE NORMALIZE? Ans) Normalize generates multiple output data records from each of its input records. You can directly specify the number of output records for each input record, or the number of output records can depend on some calculation. 

Denormalize consolidates groups of related data records into a single output record with a vector field for each group, and optionally computes summary fields in the output record for each group.

Both these components are Multi stage Transform components, The multi-stage transform components require packages because, unlike other transform components, they are driven by

more than single transform functions. These components each take a package as a parameter and, in order to process data, look for particular variables, functions, and types in that package. For example, a multi-stage component might look for a type named temporary_type, a transform function named finalize, or a variable named count_items. 61) How can I generate DML for a database table from the command line? Ans) The Ab Initio command-line utility, m_db, with the gendml argument, generates appropriate metadata for a database table or expression. The syntax for the utility is  

m_db gendml dbc_file [options] -table tablename m_db gendml dbc_file [options] -select 'sql-select-statement'

62) What are Departitioning components? Ans) Departition components combine multiple flow partitions of data records into a single flow as follows:  

Concatenate appends multiple flow partitions of data records one after another. Gather combines data records from multiple flow partitions arbitrarily.



Interleave combines blocks of data records from multiple flow partitions in round-robin fashion.



Merge combines data records from multiple flow partitions that have been sorted according to the same key specifier and maintains the sort order.

64) What are Partitioning components? Ans) Partition components distribute data records to multiple flow partitions to support data parallelism, as follows:  

Broadcast arbitrarily combines all the data records it receives into a single flow and writes a copy of that flow to each of its output flow partitions. Partition by Expression distributes data records to its output flow partitions according to a specified DML expression.



Partition by Key distributes data records to its output flow partitions according to key values.



Partition by Percentage distributes a specified percent of the total number of input data records to each output flow.



Partition by Range distributes data records to its output flow partitions according to the ranges of key values specified for each partition.



Partition by Round-robin distributes data records evenly to each output flow.



Partition with Load Balance distributes data records to its output flow partitions, writing more records to the flow partitions that consume records faster.

65) What are FTP components? Ans) FTP (file transfer protocol) components transfer data records as follows:  

FTP From transfers files of data records from a computer that is not running the Co>Operating System to a computer that is running the Co>Operating System. FTP To transfers files of data records to a computer that is not running the Co>Operating System from a computer that is running the Co>Operating System

66) How can I terminate a graph based on a condition? Ans) You can use a Reformat component with a force_error() function to test for a condition and terminate the graph if that condition is met. 67) Explain about ROLLUP? Ans) Rollup evaluates a group of input records that have the same key and then generates data records that either summarize each group or select certain information from each group. There are two ways to use a Rollup component: Template Mode : This mode uses transform that uses aggregation functions like SUM,MAX,MIN,COUNT,AVG. Expanded Mode : This mode uses package transformation 68) Explain SCAN component? Ans) For every input record, Scan generates an output record that includes a running, cumulative summary for the data records group that that input record belongs to. For example, the output records might include successive year-to-date totals for groups of data records. 69) Explain SCAN with ROLLUP? Ans) For every input record, Scan with Rollup sends an output record to its out port that includes a running, cumulative summary for the input group that input record belongs to. In addition, after reading all input records for a particular input group, Scan with Rollup sends a summary record to its rollup port for that input group. For example, suppose transaction records are keyed on the stores in which they occur. Each record sent to the out port might include the year-to-date transaction total for the store in which the transaction occurred. Each record sent to the rollup port would include the year's total for transactions at one store, and there would be one record for each store. 70) Explain SORT & SORT within GROUPS?  

Ans) Sort sorts and merges data records. Sort within Groups assumes input records are sorted according to the major-key parameter.



Sort within Groups reads data records from all the flows connected to the in port until it either reaches the end of a group or reaches the number of bytes specified in the max-core parameter.

When Sort within Groups reaches the end of a group, it does the following: a. Sorts the records in the group according to the minor-key parameter b. Writes the results to the out port c. Repeats this procedure with the next group NOTE: When connecting a fan-in or all-to-all flow to the in port of a Sort, you do not need to use a Gather because Sort can gather internally on its in port. 71) How you validate the records? Ans) Validate Records uses the is_valid function to check each field of the input data records to determine if the value in the field is:  

Consistent with the data type specified for the field in the input record format Meaningful in the context of the kind of information it represents

72) What is the difference between m_rollback and m_cleanup? When would you use them? Ans) m_rollback rolls back a partially completed graph to its beginning state. m_cleanup cleans up files left over from unsuccessfully executed graphs and manually recovered graphs. The Co>Operating System automatically creates a recovery (.rec) file and other temporary files and directories in the course of executing a graph. When a graph terminates abnormally, it leaves the temporary files and directories on disk. At this point there are several alternatives possible: Roll back to the last checkpoint. The Co>Operating System rolls back the graph automatically, if possible. You can roll back the graph manually by explicitly using the m_rollback command without the -d option. After a rollback, some temporary files and directories remain on disk. To remove them, follow one of the other three alternatives. Rerun the graph. If the graph is not already rolled back, rerunning the graph first rolls back the graph to the last checkpoint. The graph then starts re-executing. If the re-execution is successful, it removes all temporary files and directories.  

Roll back and clean up using m_rollback -d. Clean up using the m_cleanup utilities.

So, given this new feature, for old job files, you can use the m_cleanup utility to list the temporary files and directories, and m_cleanup_rm to delete them. You can also use m_cleanup_du to display the amount of space these files use. Because recovery files and

temporary files are automatically created in the course of a run, remember not to delete these files for jobs that are still running. 73) What does the error message "straight flows may only connect ports having equal depths" mean? Ans) The "straight flows may only connect ports having equal depths" error message appears when you connect two components running at different levels of parallelism (or depth) with a straight flow (one that does not have an arrow symbol on it). For example, you get this error message if you connect a Join running 10 ways parallel to a serial output file, or if you connect a serial Join to a 4-way multifile. 74) What is AB_WORK_DIR and what do you need to know about it? Ans) AB_WORK_DIR is a configuration variable whose value is a working space for graph execution.You can view the value of this by using m_env “describe. 75) What does the error message "too many open files" mean, and how do you fix it? Ans) The "too many open files" error messsage occurs most commonly because the value of the max-core parameter of the Sort component is set too low. In these cases, increasing the value of the max-core parameter solves the problem. 76) What does the error message "Failed to allocate bytes" mean and how do you fix it? Ans) The "failed to allocate" type of error message is generated when an Ab Initio process has exceeded its limit for some type of memory allocation. 





Reduce the value of max-core in order to reduce the amount of memory allocated to a component before temporary files are used. When the amount of memory specified by max-core is used up by a component, the component starts writing temporary files to hold the data being processed. Be aware that while reducing the value of max-core may solve the problem of running out of swap space, it may have an adverse effect on the graph's performance and will increase the number of temporary files. Increase available swap space, for example, by waiting until other memory intensive jobs have completed.

77) What do you need to do to configure to run my graph across two or more machines? Ans) In order to execute a graph across multiple machines, you need to carry out the following steps:  



Make sure that all the machines involved have compatible Co>Operating Systems installed. Set up the configuration files (.abinitiorc files) so that the different Co>Operating Systems can communicate with each other. Set up the environment variables and make sure that they are propagated properly from one machine to another, when appropriate.



Set up the graph so that it can run across the machines as desired.

78) What communication ports does the GDE use when communicating with the Co>Operating System? Ans) The communication ports used depend upon the communication protocol selected. In short, the GDE uses:  

DCOM: 135 & ** REXEC: 512 & 21/20



RSH: 514 & 21/20



TELNET: 23 & 21/20



SSH(/AI): 22



AI/REXEC: 512



AI/TELNET: 23 & **

The ** above refer to the dynamically determined port that the control server sets up for the file transfer. 79) If you use the layout Database: default in your database component, which working directory does the Co>Operating System use? Ans) The $AB_WORK_DIR directory is the working directory for database layouts. $AB_DATA_DIR provides disk storage for the temporary files. 80) What are vectors? Why would you use them? Ans) Vectors are arrays of elements. An element can be a single field or an entire record. They are often used to provide a logical grouping of information. Many programming languages use the concept of an array. In broad terms, an array is a collection of elements that are logically grouped for ease of access. 81) How can you quickly test my DML expressions? Ans) You can use the m_eval utility to quickly test the expressions that you intend to use in your graphs. 82) What is the layout for watcher files? Ans) The debugger places watcher files in the layout of the component downstream of the watcher. 83) How do you remove watcher files?

Ans) To delete all watcher datasets in the host directory (for all graphs), you can either use the GDE menu option, Debugger > Clean-out Watcher Datasets or invoke the following command: m_rm -f -rmdata GDE-WATCHER-xxx 84) How can I determine which version of the GDE and Co>Operating System I am using? Ans) To determine your GDE version, on the GDE menu bar choose Help > About Ab Initio. For the Co>Operating System, use either of the following commands:  

m_env -version m_env -v

85) Should you use a Reformat component with a lookup file or a Join component in graph? Ans) First of all, there are situations in which you cannot use a Reformat with Lookup instead of a Join. For example, you cannot do a Full Outer Join using a Reformat and Lookup. The answer below assumes that in your particular case either Reformat with Lookup or Join can be used in principle, and that the question is about performance benefits of one over the other. When the lookup file (in case of lookup) or the nondriving input (in case of a Join) fits into the available memory, the Join and the lookup offer very similar performance. 86) How can you increase the time-out value for starting an Ab Initio process? Ans) You can increase time-out values with the Ab Initio environment variables AB_STARTUP_TIMEOUT and AB_RTEL_TIMEOUT_SECONDS. 87) Give the file management commands? Ans) To create Multi file system: m_mkfs [ options ] control_url partition_url [ partition_url .. ]  

To delete Multifile system: m_rmfs path To create Multidirectory: m_mkdir [ -m[ode] mode ] [ -mvfile ] [ -max-segment-size bytes ] path



To delete Multidirectory:



To copy:

m_cp



To move:

m_mv



To list files: m_ls



Disk usage: m_du



Disk free:

m_df



Count:

m_wc

m_rmdir url [url ...]

88) What are data-sized vectors? How do you work with them? Ans) Data-sized vectors are vectors that have no set length of elements but, rather, are variably sized based upon the number of elements in each data record. For example, if an input dataset has three records, each with a vector, the first record's vector might have 5 elements, the second 1 element, and the third record, 7. 89) What is the difference b/w today (now) and today1 (now1)? Ans) The today (now) function calls the operating system for the current date on each call. 

In contrast, the function today1 (now1) calls the operating system for the current date only on the first call in a job, returning the same value on subsequent calls. The difference between the two functions is particularly noticeable on jobs that start before and end after midnight.

90) Explain is_valid function? Ans) is_valid Tests whether a value is valid. The is_valid function returns:  

The value 1 if expr is a valid data item. The value 0 if the expression does not evaluate to NULL.

91) Explain is_defined function? Ans) is_defined Tests whether an expression is NOT NULL. The is_defined function returns:  

The value 1 if expr evaluates to a non-NULL value. The value 0 otherwise.

The inverse of is_defined function is is_null function (is_failure). 92) Explain read raw component? Ans) The Read Raw component reads a flow of data whose structure requires it to be parsed programmatically rather than with declarative DML type declarations. Typically, the data written to the output port can be readily described with DML types. 93) How you will get the sequence of numbers in abinitio? Ans) by using next_in_sequence ( ) function. Returns a sequence of integers on successive calls, starting with 1. 94) How will you get the degree of parallelism?

Ans) By using number_of_partitions ( ) function,it returns the number of partitions. The number of partitions is also known as the degree of parallelism. The value -1 if not called from within a component. 95) Explain first_defined function? Ans) first_defined returns the first defined (not NULL) argument of two arguments. Note that the Oracle NVL function is very similar to this function. Syntax: first_defined (a, b)

1) what is difference between file and table in abinitio ans)Table means it maintaince relational data i.e it is a relational structure. File means non relation structure. it maintaince data. 2) How do you connect EME to Abinitio Server? ans)There are serveral ways of connecting to EME 1.Set AB_AIR_ROOT 2.GDE you can connect to EME datastore 3.login to eme web interface http://serverhost[:serverport]/abinitio 4 using the air command, i don't know much about this 3)What is the function you would use to transfer a string into a decimal? For converting a string to a decimal we need to typecast it using the following syntax, out.decimal_field :: ( decimal( size_of_decimal ) ) string_field; The above statement converts the string to decimal and populates it to the decimal field in output. 6)How do we handle if DML changing dynamicaly There are lot many ways to handle the DMLs which changes dynamically with in a single file. Some of the suitable methods are to use a conditional DML or to call the vector functionality while calling the DMLs. 8)What is .abinitiorc and What it contain? ans) .abinitiorc is the config file for ab initio. It is found in user's home directory. Generally it is used to contain abinitio home path, different log in information like id encrypted password login method for hosts where the graph connects in time of execution. What is air_project_parameters and air_sandbox_overrides? what is the relation between them? Answer # 1 .air-project-parameters Contains the parameter definitions of all the parameters within a sandbox. This file is maintained by the GDE and the Ab Initio environment scripts. .air-sandbox-overrides

This file exists only if you are using version 1.11 or a later version of the GDE. It contains the user's private values for any parameters in .air-project-parameters that have the Private Value flag set. It has the same format as the .air-project-parameters file. When you edit a value (in GDE) for a parameter that has the Private Value flag checked, the value is stored in the .airsandbox-overrides file rather than the .air-projectparameters file. What is AB_LOCAL expression where do you use it in ab- initio? Answer # 3 we use AB_LOCAL(expression) to increase the SQL query performance by supplying the name of large table in expression. This way we make it as a driving table. name the air commands in ab initio? Answer # 1 Here are the few of the commands we use 1) air object ls --- This is used to see the listing of objects in a directory inside the project. 2) air object rm -- This is used to remove an object from the repository. Please be careful with this. 3) air object cat --- This is used to see the object which is present in the EME. 4) air object versions -verbose --- Gives the Version History of the object. 5) air project show --- Gives the whole info about the project. What all types of files can be checked-in etc. 6) air project modify -extension --- This is to modify the project settings. Ex: If you need to checkin *.java files into the EME, you may need to add the extension first. 7) air lock show -project --- shows all the files that are locked in the given project 8) air lock show -user -- shows all the files locked by a user in various projects. 9) air sandbox status ---

shows the status of file in the sandbox with respect to the EME (Current, Stale, Modified are few statuses) what is deeup in unique only? Answer # 1 keep Parameter of Dedup Sorted Component (choice, required) Specifies which records the component keeps to write to the out port. Choose one of the following options: first “ Keeps the first record of a group last “ Keeps the last record of a group unique-only “ Keeps only records with unique key values The component writes the remaining records of each group to the dup port. Default is first. 14) What is Ad hoc multifile? How is it used? ANSWER: Here is a description of Ad hoc multifile: Ad hoc Multifiles treat several serial files having the same record format as a single graph component. Frequently, the input of a graph consists of a set of serial files, all of which have to be processed as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial files, without needing to define a multifile system to contain it. This enables you to represent the needed set of serial files with a single input file component in the graph. Moreover, the set of files used by the component can be determined at runtime. This lets the user customize which set of files the graph uses as input without having to change the graph itself, even after it goes into production. Ad hoc Multifiles can be used as output, intermediate, and lookup files as well as input files. The simplest way to define an Ad hoc multifile is to list the files explicitly as follows: 1. Insert an input file component in your graph. 2. Open the properties dialog. Select Description tab. 3. Select Partitions in the Data Location of the Description tab 4. Click Edit to open the Define multifile Partitions dialog box. 5. Click New and enter the first file name. Click New again and enter the second file name and so on. 6. Click OK. If you have added 'n' files, then the input file now acts something like a file in a n-way multifile system, whose data partitions are the n files you listed. It is possible for components to run in the layout of the input file component. However, there is no way to run commands such as m_ls or m_dump on the files, because they do not comprise a real multifile system. There are other ways than listing the input files explicitly in an Ad hoc multifile.

1. Listing files using wildcards - If the input file names have a common pattern then you can use a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are found at the runtime matching the wild card pattern will be taken for the Ad hoc multifile. 2. Listing files in a variable. You can create a runtime parameter for the graph and inside the parameter you can list all the files separated by spaces. 3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces the list of files to be used for the ad hoc multifile. This method gives maximum flexibility in choosing the input files, since you can use complex commands also that involves owner of file or date time stamp. 15) How can I tune a graph so it does not excessively consume my CPU? How to Tune a Graph against Excessive CPU consumption? ANSWER: Options: 1. Reduce the DOP ( degree of paralleism ) for components. Example: 1. Change from a 4-way parallel to a 2-way parallel. 2. Examine each transformation for inefficiencies. Example: 1. If transformation uses many local variables, make these variables global. 2. If same function call is performed more than once; call it once and store its value in a global variable. 3. When reading data, reduce the amount of data that needs to be carried forward to the next component. 16) I'm having trouble finding information about the AB_JOB variable. Where and how can I set this variable? ANSWER: You can change the value of the AB_JOB variable in the start script of a given graph. This will enable you to run the same graph multiple times at the same time (thus parallel). However, make sure you append some unique identifier such as timestamp or sequential number to the end of each AB_JOB variable name you assign. You will also need to vary the file names of any outputs to keep the graphs from stepping on each other“s outputs. I have used this technique to create a "utility" graph as a container for a start script that runs another graph multiple times depending on the local variable input to the "utility" graph. Be careful you don't max out the capacity of the server you are running on. 17) I have a job that will do the following: ftps files from remote server; reformat data in those files and updates the database; deletes the temporary files. How do we trap errors generated by Ab Initio when an ftp fails? If I have to re-run / re-start a graph again, what are the points to be considered? Does *.rec file have anything to do with it? ANSWER: AbInitio has very good restartability and recovery features built into it. In Your situation you can do the tasks you mentioned in one graph with phase breaks.

FTP in phase 1 and your transformation in next phase and then DB update in another phase (This is just an example this may not best of doing it as best design depends on various other factors) If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would see a message saying recovery file exists, do you want to start your graph from last successful check point or restart from beginning. Same thing if it fails in Phase 2. Phases are expensive from Disk I/O perspective, so have to be careful in doing too much phasing. Coming back to error trapping each component has reject, error, log ports, reject captures rejected records, error captures corresponding error and log captures the execution statistics of the component. You can control reject status of each component by setting reject threshold to either "Never Abort", "Abort on first reject" or setting "ramp/limit" Recovery files keep tack of crucial information for recovering the graph from failed status, which node the component is executing on etc. It is a bad idea to just remove the *.rec files, you always want to rollback the recovery fils cleanly so that temporary files created during graph execution won't hang around and occupy disk space and create issues. Always use m_rollback -d 18) What is parallelism in Ab Initio? ANSWER: 1) Component parallelism: A graph with multiple processes running simultaneously on separate data uses component parallelism. 2) Data parallelism: A graph that deals with data divided into segments and operates on each segment simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data parallelism. To support this form of parallelism, Ab Initio software provides Partition Components to segment data, and Departition Components to merge segmented data back together. 3) Pipeline parallelism: A graph with multiple components running simultaneously on the same data uses pipeline parallelism. Each component in the pipeline continuously reads from upstream components, processes data, and writes to downstream components. Since a downstream component can process records previously written by an upstream component, both components can operate in parallel. NOTE: To limit the number of components running simultaneously, set phases in the graph. 20) What is a sandbox? ANSWER: Sandbox is a directory structure of which each directory level is assigned a variable name, is used to manage check-in and checkout of repository based objects such as graphs. fin -------> top level directory ( $AI_PROJECT )

| |---- dml -------> second level directory ( $AI_DML ) | |----- xfr -------> second level directory ( $AI_XfR ) | |----- run --------> second level directory ( $AI_RUN ) | You'll require a sandbox when you use EME (repository s/w) to maintain release control. Within EME for the same project an identical structure will exist. The above-mentioned structure will exist under the os (eg unix), for instance for the project called fin, and is usually name of the top-level directory. In EME, a similar structure will exist for the project: fin. When you checkout or check-in a whole project or an object belonging to a project, the information is exchanged between these two structures. For instance, if you checkout a dml called fin.dml for the project called fin, you need a sandbox with the same structure as the EME project called fin. Once you've created that, as shown above, fin.dml or a copy of it will come out from EME and be placed in the dml directory of your sandbox. 21) How can I read data which contains variable length records with different record structures and no delimiters? ANSWER: a)Try using the Read Raw component, it should do exactly what you are looking for. b)Use the dml format: record string(integer(4)) my_field_name; end 22) How do I create subgraphs in Ab Initio? ANSWER: First, highlight all of the components you would like to have in the sub graph, click on edit, then click on sub graph, and finally click on create. 23) suppose that u are changing fin.dml u said checkout Exactly how does u do it? And one more thing like can quote an example where do u use sandbox Parameters and how exactly u create those do u keep those sand box parameters also 2 copies as we keep our graphs and other files. ANSWER: Checkin and checkout from EME Checkin (sandbox) ---------------> EME Checkout (sandbox) Checkin option ----> checkout “" 4. You create a sandbox from GDE via Project -----> Create sandbox option 5. When creating a sandbox you specify a directory name ( try it out; don't be afraid) 6. EME contains one or many Projects ( a project is a collection of graphs and related files and a parameter called Project parameter file ) 7. The project parameter file, when resides within EME, is called Project Parameter 8. The project parameter file, resides within one's sandbox, is called sandbox parameter. Therefore, sandbox parameter is a copy of project parameter and is local to sandbox owner. 9. When project parameters change, it'll be reflected in your sandbox parameters , if you have checked out a graph and therefore, a copy of latest project parameter, after that change had taken place. 10. You edit sandbox parameter via Project ----->edit sandbox option 11. You edit project parameters via Project -----> Administrative -------> edit project 12. When checking out an object, use Project--------> checkout option. Navigate down to the Project of your choice Navigate down to required directory ( eg.mp, dml or xfr etc ) Select the object required Then specify a sandbox name ( ie. the top level directory of the directory structure called sandbox) You will be prompted to confirm the checkout 13. Sometimes, when you checkout an object, you get a number of other objects checked out for you automatically and this happens due to dependency. Example Checkout a graph ( .mp file ) In addition, you might get a .dml .xfr file You will also certainly ger a .ksh file for the graph 26) How do I create subgraphs in Ab Initio? ANSWER: First, highlight all of the components you would like to have in the subgraph, click on edit, then click on subgraph, and finally click on create. 27)I was trying to use a User Defined Function (int_to_date) inside a Rollup, to type cast date and time values originally stored as integers back to date forms and then concatenate the same. The code I wrote is as below. record datetime("YYYY-MM-DD HH24:MI:SS")("\001") output_date_format; end out::int_to_date(record big endian integer(4) input_date_part; end in0, record big endian integer(4) input_time_part; end in1) begin let datetime("YYYY-MM-DD HH24:MI:SS")("\001") v_output_format =(datetime("YYYY-MM-DD HH24:MI:SS"))string_concat((string("|"))(date("YYYY-MM-DD"))in0.input_date_part,(string("|")) (datetime("HH24:MI:SS"))decimal_lpad(((string("|"))(decimal("|")) in1.input_time_part),6)); out.output_date_format :: v_output_format; end;

out::rollup(in) begin let datetime("YYYY-MM-DD HH24:MI:SS")("\001") rfmt_dt; rfmt_dt=int_dat(in.reg_date, in.reg_time); out.datetime_output :: rfmt_dt; out.* :: in.*; end; However I got an error during run time. The Error Message looked like: While compiling finalize: While compiling the statement: rfmt_dt = int_to_date(in.reg_date, in.reg_time); Error: While compiling transform int_to_date: Output object "out.output_date_format" unknown.? 28) I have small problem understanding the problem with reformat. i could not figure out why this reformat component runs forever. i believe it is in endless loop somehow Reformat component has following input and output DML: record begin string(",") code, code2; intger(2) count ; end("\n") Note : here variable "code" is never null nor blank. sample data is string_1,name,location,firstname,lastname,middlename,0 string_2,job,location,firstjob,lastjob,0 string_3,design,color,paint,architect,0 out::reformat(in) = begin let string(integer(2)) temp_code2 = in.code2; let string(integer(2)) temp_code22 = " "; let integer(2) i=0; while (string_index(temp_code2, ",") !=0 || temp_code2 "") begin temp_code22 = string_concat(in.code,",", string_substring(temp_code2, 1,string_index(temp_code2,","))); temp_code2 = string_substring(temp_code2, string_index(temp_code2, ","), string_length(temp_code2)); i=i+1; end out.code :: in.code; out.code2 :: string_lrtrim(temp_code22); out.count:: i; end;

my expected output is string_1,string_1,name,string_1,location,string_1,firstname,string_1,lastname,string_1,middlen ame,5 string_2,string_2,job,string_2,location,string_2,firstjob,string_2,lastjob,4 string_3,string_3,design,string_3,color,string_3,paint,string_3,architect,4 ANSWER: record begin string(",") code, code2; integer(2) count ; end("\n") In my abinitio it is not validated .................. 29) In my graph I am creating a file with account data. For a given account there can be multiple rows of data. I have to split this file into 4 (specifically) files which are nearly equal in size. The trick is to keep the accounts confined to one file. In other words account data should not span across these files. How do I do it? Also if the records are less than 4 (different accounts) i should be able to create empty files. But I need atleast 4 files. FYI: The requirement to have 4 files is because I need to start 4 parallel processes for load balancing the subsequent processes. ANSWER: a) I could not get ur requirement very clearly as you want to split the files in 4 equal parts as well as keep the same account numbers in same file. Can you explain what will you do in case of 5account numbers having 20 records each?..........As far as splitting is concerned a very very crude soln would be as follows In the end script do the following: 1.Find the size of the file and store it in variable (say v_size) 2.v_qtr_size=`expr $v_size / 4` 3.split -b $v_qtr_size 4.Rename the splitted files as per ur requirement. Note the splitted files have a specific pattern in their name b) Your requirement is such that it essentially depends on the skewness of your data across accounts. If you want to keep same accounts in same partition, then partition the data by key (account) with the out port connected to 4 way parallel layout. But this does not guarantee equal load in all partitions unless the data has little skewness. But I can suggest you an alternative approach, though cumbersome, still might give you a result, close to your requirement. You replicate your original dataset into two, and take one of them and rollup on account no to find the record count per account_no. Now sort this result on record count so that you have the account_no with min count at top and the one with max count at bottom. Now apply a partition by round robin and separate out the four partitions (partition 0, 1, 2 & 3).

Now take the first partition and join with your main dataset ( that you have replicated earlier) on account_no and write the matching records (out port) into the first file. Take the unused records of your main flow of the previous join and now join it with the second partition (partition1) on account_no and write the matching records (out port) to the second file. Similarly again take the unused records of the previous join and join it with the third partition (partition 2) on acount_no. Write the matching record (out port) to the third file and the unused records of the main flow in the fourth file. This way you can get four files, nearly equal in size, and same account not spread across files. 30) I have a graph parameter state_cd having values based on a If statement. This variable I would like to use in SQL statement in AI_SQL directory. I have 20 SQL statements“ for 20 table codes. I will be using corresponding SQL statement based on table code passed as parameter to a graph. eg: SQLs in AI_SQL directory. ---------------------------1. Select a,b from abc where abc.state_cd in ${STATE_CD} 2. Select x,y from xyz where xyz.state_cd in ${STATE_CD} ${STATE_CD} is a graph parameter In value - "(IL,CO,MI)" 

Problem is ${STATE_CD} is not getting interpreted when I echo the 'Select Statement', hence giving problem.

ans: Anand use eval or export of I table components .................. Or define ${STATE_CD} in ur start script ...................... that“s better............... 30. Explain kill() and its possible return values. There are four possible results from this call: “kill()“ returns 0. This implies that a process exists with the given PID, and the system would allow you to send signals to it. It is system-dependent whether the process could be a zombie. “kill()“ returns -1, “errno == ESRCH“ either no process exists with the given PID, or security enhancements are causing the system to deny its existence. (On some systems, the process could be a zombie.) “kill()“ returns -1, “errno == EPERM“ the system would not allow you to kill the specified process. This means that either the process exists (again, it could be a zombie) or draconian security enhancements are present (e.g. your process is not allowed to send signals to *anybody*). “kill()“ returns -1, with some other value of “errno“ you are in trouble! The most-used technique is to assume that success or failure with “EPERM“ implies that the process exists, and any other error implies that it doesn't. An alternative exists, if you are writing specifically for a system (or all those systems) that provide a “/proc“ filesystem: checking for the existence of “/proc/PID“ may work. What is the difference between m_rollback and m_cleanup and when would I use them? Answer m_rollback has the same effect as an automatic rollback “ using the jobname.rec file, it rolls back a job to the last completed checkpoint, or to the beginning if the job has not completed any checkpoints. The m_cleanup commands are used when the jobname.rec file doesn't exist and you want to remove temporary files and directories left by failed jobs.

Details In the course of running a job, the Co>Operating System creates a jobname.rec file in the working directory on the run host.

NOTE: The script takes jobname from the value of the AB_JOB environment variable. If you have not specified a value for AB_JOB, the GDE supplies the filename of the graph as the default value for AB_JOB when it generates the script. The jobname.rec file contains a set of pointers to the internal job-specific files written by the launcher, some of which the Co>Operating System uses to recover a job after a failure. The Co>Operating System also creates temporary files and directories in various locations. When a job fails, it typically leaves the jobname.rec file, the temporary files and directories, and many of the internal job-specific files on disk. (When a jobs succeeds, these files are automatically removed, so you don't have to worry about them.) If your job fails, determine the cause and fix the problem. Then: 

If desired, restart the job.

If the job succeeds, the jobname.rec file and all the temporary files and directories are cleaned up. Alternatively, run m_rollback -d to clean up the files left behind by the failed job. What value should I set for the max-core parameter? Short answer The max-core parameter is found in the SORT, JOIN, and ROLLUP components, among others. There is no single, optimal value for the max-core parameter, because a "good" value depends on your particular graph and the environment where it runs. Details The SORT component works in memory, and the ROLLUP and JOIN components have the option to do so. These components have a parameter called max-core, which determines the maximum amount of memory they will consume per partition before they spill to disk. When the value of max-core is exceeded in any of the in-memory components, all inputs are dropped to disk. This can have a dramatic impact on performance; but this does not mean that it is always better to increase the value of max-core. The higher you set the value of max-core, the more memory the component can use. Using more memory generally improves performance “ up to a point. Beyond this point, performance will not improve and might even decrease. If the value of max-core is set too high, operating system swapping can occur and the graph might fail if memory on the machine is exhausted. When setting the value for max-core, you can use the suffixes k, m, and g (upper-case is also supported) to indicate powers of 1024. For max-core, the suffix k (kilobytes) means precisely 1024 bytes, not 1000. Similarly, the suffix m (megabytes) means precisely 1048576 (10242), and g (gigabytes) means precisely 1024 3. Note that the maximum value for max-core is 2g-1. SORT component

For the SORT component, 100 MB is the default value for max-core. This default is used to cover a wide variety of situations and might not be ideal for your particular circumstances. Increasing the value of max-core will not increase performance unless the full dataset can be held in memory, or the data volume is so large that a reduction in the number of temporary files improves performance. You can estimate the number of temporary files by multiplying the data volume being sorted by three and dividing by the value of max-core (because data is written to disk in blocks that are one third the size of the max-core setting). This number should be less than 1000. For example, suppose you are sorting 1 GB of data with the default max-core setting of 100 MB and the process is running in serial. The number of temporary files that will be created is: 3 × 1000MB / 100 MB = 30 files You should decrease the value of a SORT component's max-core if an in-memory ROLLUP or JOIN component in the same phase would benefit from additional memory. The net performance gain will be greater. If you get a "Too many open files" error message, your SORT component's max-core might be set too low. If this is the case, SORT can also fill AB_WORK_DIR (usually set to /var/abinitio at installation), which will cause all graphs to fail with a message about semaphores. This directory is where recovery information and inode information for named pipes are stored and is typically mounted on a small filesystem.

NOTE: We recommend setting the value max-core as a $ reference to a parameter (for example, $AI_SORT_MAX_CORE) so you can easily adjust the value at runtime if required. In-memory ROLLUP or JOIN It is difficult to be precise about the amount of memory an in-memory ROLLUP or JOIN will consume. An in-memory JOIN tries to hold all its nondriving inputs in memory, so make the largest input by volume the driving one. A ROLLUP component must hold the size of the keys, plus the size of the temporaries, plus the size of any input fields required in finalize to produce the output. In practice, in most ROLLUP components, this is just the size of the output. In addition, some space is needed for the hash table. You should always set the max-core parameter in in-memory ROLLUP and JOIN components with a parameter, like AI_GRAPH_MAX_CORE. The default can be set to the appropriate value and changed at runtime if required. You can create additional parameters such as AI_GRAPH_MAX_CORE_HALF and AI_GRAPH_MAX_CORE_QUARTER to divide up the available max-core among different in-memory components in a phase. If two in-memory components each need most or all of AI_GRAPH_MAX_CORE, you should put them in separate phases, provided you have the disk space to hold the data at the phase break. A second use of phasing is to control the allocation of memory among in-memory components. Because there is a limited amount of memory available, you can use phasing to make sure each in-memory component gets a sufficient amount. Typically, only one to four in-memory components should occupy the same phase, depending on memory availability and demands. To compute a value for AI_GRAPH_MAX_CORE, take the total memory on the machine and subtract memory used by lookups and competing processes, including other graphs, running at the same time on the machine. This is the available memory. Divide this by twice the number

of partitions to get AI_GRAPH_MAX_CORE “ max-core is measured per partition, and the factor of two gives a contingency safety factor. So: AI_GRAPH_MAX_CORE = (total memory - memory used elsewhere)/(2 * number of partitions

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF