Ab InitioFAQ2
Short Description
h...
Description
ANEESH Technologies
Ab Initio FAQs 1)What is the function you would use to transfer a string into a decimal? 2)How many parallelisms are in Abinitio? Please give a definition of each. 3)What is the difference between a DB config and a CFG file? 4)Have you eveer encountered an error called "Pipeline Broken"? (This occurs when you extensively create graphs it is a trick question) 5)How do you truncate a table? (Each candidate would say only 1 of the several ways to do this.) 6)How do you improve the performance of a graph? 7)What is the difference between partitioning with key and round robin? 8)Have you worked with packages? 9)How do you add default rules in transformer? 10)What is a ramp limit and Maxcore values for Scan, Rollup, Sort, Replicate? 11)Have you used rollup component? Describe how. 12)How many components in your most complicated graph? 13)Do you know what a local lookup is? 14)What is Ad hoc multifile? How is it used? Here is a description of Ad hoc multifile: Ad hoc multifiles treat several serial files having the same record format as a single graph component. Frequently, the input of a graph consists of a set of serial files, all of which have to be processed as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial files, without needing to define a multifile system to contain it. This enables you to represent the needed set of serial files with a single input file component in the graph. Moreover, the set of files used by the component can be determined at runtime. This lets the user customize which set of files the graph uses as input without having to change the graph itself, even after it goes into production.
1
ANEESH Technologies
Ad hoc multifiles can be used as output, intermediate, and lookup files as well as input files. The simplest way to define an Ad hoc multifile is to list the files explicitly as follows: 1. Insert an input file component in your graph. 2. Open the properties dialog. Select Description tab. 3. Select Partitions in the Data Location of the Description tab 4. Click Edit to open the Define multifile Partitions dialog box. 5. Click New and enter the first file name. Click New again and enter the second file name and so on. 6. Click OK. If you have added 'n' files, then the input file now acts something like a file in a n-way multifile system, whose data partitions are the n files you listed. It is possible for components to run in the layout of the input file component. However, there is no way to run commands such as m_ls or m_dump on the files, because they do not comprise a real multifile system. There are other ways than listing the input files explicitly in an Ad hoc multifile. 1. Listing files using wildcards - If the input file names have a common pattern then you can use a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are found at the runtime matching the wild card pattern will be taken for the Ad hoc multifile. 2. Listing files in a variable. You can create a runtime parameter for the graph and inside the parameter you can list all the files separated by spaces. 3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces the list of files to be used for the ad hoc multifile. This method gives maximum flexibility in choosing the input files, since you can use complex commands also that involves owner of file or date time stamp. 15) How can I tune a graph so it does not excessivly consume my CPU? How to Tune a Graph Against Excessive CPU consumption? ANSWER: Options: 1. Reduce the DOP ( degree of parallleism ) for components. Example: 1. Change from a 4-way parallel to a 2-way parallel. 2. Examine each transformation for inefficiencies.
2
ANEESH Technologies Example: 1. If transformation uses many local variables, make these variables global. 2. If same function call is performed more than once; call it once and store its value in a global variable. 3. When reading data, reduce the amount of data that needs to be carried forward to the next component. 16) I'm having trouble finding information about the AB_JOB variable. Where and how can I set this variable? ANSWER: You can change the value of the AB_JOB variable in the start script of a given graph. This will enable you to run the same graph multiple times at the same time (thus parallel). However, make sure you append some unique identifier such as timestamp or sequential number to the end of each AB_JOB variable name you assign. You will also need to vary the file names of any outputs to keep the graphs from stepping on each other’s outputs. I have used this technique to create a "utility" graph as a container for a start script that runs another graph multiple times depending on the local variable input to the "utility" graph. Be careful you don't max out the capacity of the server you are running on. 17) I have a job that will do the following: ftps files from remote server; reformat data in those files and updates the database; deletes the temporary files. How do we trap errors generated by Ab Initio when an ftp fails? If I have to re-run / restart a graph again, what are the points to be considered? does *.rec file have anything to do with it? ANSWER: AbInitio has very good restartability and recovery features built into it. In Your situation you can do the tasks you mentioned in one graph with phase breaks. FTP in phase 1 and your transaformation in next phase and then DB update in another pahse (This is just an example this may not best of doing it as best design depends on various other factors) If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would see a message saying recovery file exists, do you want to start your graph from last successful check point or restart from begining. Same thing if it fails in Phase 2. Phases are expensive from Disk I/O perspective, so have to be careful in doing too much phasing. Coming back to error trapping each component has reject, error, log ports, reject captures rejected records, error captures corresponding error and log captures the execution statistics of the component. You can control reject status of each component by setting reject threshold to either "Never Abort", "Abort on first reject" or setting "ramp/limit"
3
ANEESH Technologies
Recovery files keep tack of crucial information for recovering the graph from failed status, which node the component is executing on etc. It is a bad idea to just remove the *.rec files, you always want to rollback the recovery fils cleanly so that temporary files created during graph execution won't hang around and occupy disk space and create issues. always use m_rollback -d 18)What is parallelism in Ab Initio? ANSWER: 1) Component parallelism: A graph with multiple processes running simultaneously on separate data uses component parallelism. 2) Data parallelism: A graph that deals with data divided into segments and operates on each segment simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data parallelism. To support this form of parallelism, Ab Initio software provides Partition Components to segment data, and Departition Components to merge segmented data back together. 3) Pipeline parallelism: A graph with multiple components running simultaneously on the same data uses pipeline parallelism. Each component in the pipeline continuously reads from upstream components, processes data, and writes to downstream components. Since a downstream component can process records previously written by an upstream component, both components can operate in parallel. NOTE: To limit the number of components running simultaneously, set phases in the graph. 19) How can I determine what requirements for a new ETL and/or OLAP tool for our organization? ANSWER: Here are some things to evaluate when considering a new ETL tool: # How many sources users are going to extract the information from? # What are the type of sources? i.e. flat files, Rdbms, ERPs, mainframe, legacy systems, excels, and etc. # What are the expected sizes of these input files? # Is there a need to join information from different sources? # What is the growth expected per month in terms of records, size etc?
4
ANEESH Technologies # What is the current environment for staging areas, loading data warehouse, and OS? # What is the budget for purchase of ETL tool? Here are some things to evaluate when considering a new OLAP tool: # How many developers need to access the tool? # Is there any security to be implemented for access? # What is the level of end users knowledge about OLAP, analysis, data warehousing? # What is the budget for purchase of OLAP tool? # What is the report requirement- static or ad-hoc? # What is the functional domain for analysis? This may help in specific selection for analytic applications. 20) What is a sandbox? ANSWER: Sandbox is a directory structure of which each directory level is assigned a variable name, is used to manage check-in and checkout of repository based objects such as graphs. fin -------> top level directory ( $AI_PROJECT ) | |---- dml -------> second level directory ( $AI_DML ) | |----- xfr -------> second level directory ( $AI_XfR ) | |----- run --------> second level directory ( $AI_RUN ) | You'll require a sandbox when you use EME (repository s/w) to maintain release control. Within EME for the same project an identical structure will exist. The above-mentioned structure will exist under the os (eg unix), for instance for the project called fin, and is usually name of the top-level directory. In EME, a similar structure will exist for the project: fin. When you checkout or check-in a whole project or an object belonging to a project, the information is exchanged between these two structures. For instance, if you checkout a dml called fin.dml for the project called fin, you need a sandbox with the same structure as the EME project called fin. Once you've created that, as shown above, fin.dml or a copy of it will come out from EME and be placed in the dml directory of your sandbox.
5
ANEESH Technologies
21) How can I read data which contains variable length records with different record structures and no delimiters? ANSWER: a)Try using the Read Raw component, it should do exactly what you are looking for. b)Use the dml format: record string(integer(4)) my_field_name; end 22) How do I create subgraphs in Ab Initio? ANSWER: First, highlight all of the components you would like to have in the subgraph, click on edit, then click on subgraph, and finally click on create. 23) suppose that u are changing fin.dml u said checkout exactly how do u do it ?and one mroe thign like can quote an example where do u use sandbox parameters and how exactly u create those do u keep those sand box parameters also 2 copies as we keep our graphs and other files. ANSWER: Checkin and checkout from EME Checkin (sandbox) ---------------> EME Checkout ( sandbox) checkin option ----> checkout "" 4. You create a sandbox from GDE via Project -----> Create sandbox option 5. When creating a sandbox you specify a directory name ( try it out; don't be afraid) 6. EME contains one or many Projects ( a project is a collection of graphs and related files and a parameter called Project parameter file ) 7. The project parameter file, when resides within EME, is called Project Parameter
6
ANEESH Technologies 8. The project parameter file, resides within one's sandbox, is called sandbox parameter. Therefore, sandbox parameter is a copy of project parameter and is local to sandbox owner. 9. When project parameters change, it'll be reflected in your sandbox parameters , if you have checked out a graph and therefore, a copy of latest project parameter, after that change had taken place. 10. You edit sandbox parameter via Project ----->edit sandbox option 11. You edit project parameters via Project -----> Administrative -------> edit project 12. When checking out an object, use Project--------> checkout option. Navigate down to the Project of your choice Navigate down to required directory ( eg.mp, dml or xfr etc ) Select the object required Then specify a sandbox name ( ie. the top level directory of the directory structure called sandbox) You will be prompted to confirm the checkout 13. Sometimes, when you checkout an object, you get a number of other objects checked out for you automatically and this happens due to dependency.
Example Checkout a graph ( .mp file ) In addition, you might get a .dml .xfr file You will also certainly ger a .ksh file for the graph 24) I have small problem understadning the problem with reformat. i could not figure out why this reformat compoennet runs forever. i beleieve it is in endless loop somehow Reformat component has following input and output DML: record begin string(",") code, code2; intger(2) count ; end("\n") Note : here variable "code" is never null nor blank. sample data is string_1,name,location,firstname,lastname,middlename,0 string_2,job,location,firstjob,lastjob,0 string_3,design,color,paint,architect,0 out::reformat(in) = begin let string(integer(2)) temp_code2 = in.code2; let string(integer(2)) temp_code22 = " "; let integer(2) i=0; while (string_index(temp_code2, ",") !=0 || temp_code2 "") begin
7
ANEESH Technologies temp_code22 = string_concat(in.code,",", string_substring(temp_code2, 1,string_index(temp_code2,","))); temp_code2 = string_substring(temp_code2, string_index(temp_code2, ","), string_length(temp_code2)); i=i+1; end out.code :: in.code; out.code2 :: string_lrtrim(temp_code22); out.count:: i; end; my expected output is string_1,string_1,name,string_1,location,string_1,firstname,string_1,lastname,string_1,mi ddlename,5 string_2,string_2,job,string_2,location,string_2,firstjob,string_2,lastjob,4 string_3,string_3,design,string_3,color,string_3,paint,string_3,architect,4 ANSWER: record begin string(",") code, code2; integer(2) count ; end("\n") In my abinitio it is not validated ..................
25) In my graph I am creating a file with account data. For a given account there can be multiple rows of data. I have to split this file into 4 (specifically) files which are nearly equal in size. The trick is to keep the accounts confined to one file. In other words account data should not span across these files. How do I do it? Also if the records are less than 4 (different accounts) i should be able to create empty files. But I need atleast 4 files. FYI: The requirement to have 4 files is because I need to start 4 parallel processes for load balancing the subsequent processes. ANSWER: a)I could not get ur requirement very clearly as you want to split the files in 4 equal parts as well as keep the same account numbers in same file. Can you explain what will you do in case of 5 account numbers having 20 records each?..........As far as splitting is concerned a very very crude soln would be as follows In the end script do the following: 1.Find the size of the file and store it in variable (say v_size)
8
ANEESH Technologies 2.v_qtr_size=`expr $v_size / 4` 3.split -b $v_qtr_size 4.Rename the splitted files as per ur requirement. Note the splitted files have a specific pattern in their name b)Your requirement is such that it essentially depends on the skewness of your data across accounts. If you want to keep same accounts in same partition, then partition the data by key (account) with the out port connected to 4 way parallel layout. But this does not guarantee equal load in all partitions unless the data has little skewness. But I can suggest you an alternative approach, though cumbersome, still might give you a result, close to your requirement. You replicate your original dataset into two, and take one of them and rollup on account no to find the record count per account_no. Now sort this result on record count so that you have the account_no with min count at top and the one with max count at bottom. Now apply a partition by round robin and separate out the four partitions (partition 0, 1, 2 & 3). Now take the first partition and join with your main dataset ( that you have replicated earlier) on account_no and write the matching records (out port) into the first file. Take the unused records of your main flow of the previous join and now join it with the second partition (partition1) on account_no and write the matching records (out port) to the second file. Similarly again take the unused records of the previous join and join it with the third partition (partition 2) on acount_no. Write the matching record (out port) to the third file and the unused records of the main flow in the fourth file. This way you can get four files, nearly equal in size, and same account not spread across files. 26) How do I create subgraphs in Ab Initio? ANSWER: First, highlight all of the components you would like to have in the subgraph, click on edit, then click on subgraph, and finally click on create. 27)I was trying to use a User Defined Function (int_to_date) inside a Rollup, to type cast date and time values originally stored as integers back to date forms and then concatenate the same.
9
ANEESH Technologies
10
ANEESH Technologies The code I wrote is as below. record datetime("YYYY-MM-DD HH24:MI:SS")("\001") output_date_format; end out::int_to_date(record big endian integer(4) input_date_part; end in0, record big endian integer(4) input_time_part; end in1) begin let datetime("YYYY-MM-DD HH24:MI:SS")("\001") v_output_format =(datetime("YYYY-MM-DD HH24:MI:SS"))string_concat((string("|"))(date("YYYYMM-DD"))in0.input_date_part,(string("|")) (datetime("HH24:MI:SS"))decimal_lpad(((string("|"))(decimal("|")) in1.input_time_part),6)); out.output_date_format :: v_output_format; end; out::rollup(in) begin let datetime("YYYY-MM-DD HH24:MI:SS")("\001") rfmt_dt; rfmt_dt=int_dat(in.reg_date, in.reg_time); out.datetime_output :: rfmt_dt; out.* :: in.*; end; However I got an error during run time. The Error Message looked like: ?While compiling finalize: While compiling the statement: rfmt_dt = int_to_date(in.reg_date, in.reg_time); Error: While compiling transform int_to_date: Output object "out.output_date_format" unknown.? 28) I have small problem understadning the problem with reformat. i could not figure out why this reformat compoennet runs forever. i beleieve it is in endless loop somehow Reformat component has following input and output DML: record begin
11
ANEESH Technologies string(",") code, code2; intger(2) count ; end("\n") Note : here variable "code" is never null nor blank. sample data is string_1,name,location,firstname,lastname,middlename,0 string_2,job,location,firstjob,lastjob,0 string_3,design,color,paint,architect,0 out::reformat(in) = begin let string(integer(2)) temp_code2 = in.code2; let string(integer(2)) temp_code22 = " "; let integer(2) i=0; while (string_index(temp_code2, ",") !=0 || temp_code2 "") begin temp_code22 = string_concat(in.code,",", string_substring(temp_code2, 1,string_index(temp_code2,","))); temp_code2 = string_substring(temp_code2, string_index(temp_code2, ","), string_length(temp_code2)); i=i+1; end out.code :: in.code; out.code2 :: string_lrtrim(temp_code22); out.count:: i; end; my expected output is string_1,string_1,name,string_1,location,string_1,firstname,string_1,lastname,string_1,mi ddlename,5 string_2,string_2,job,string_2,location,string_2,firstjob,string_2,lastjob,4 string_3,string_3,design,string_3,color,string_3,paint,string_3,architect,4 ANSWER: record begin string(",") code, code2; integer(2) count ; end("\n") In my abinitio it is not validated ..................
12
ANEESH Technologies 29) In my graph I am creating a file with account data. For a given account there can be multiple rows of data. I have to split this file into 4 (specifically) files which are nearly equal in size. The trick is to keep the accounts confined to one file. In other words account data should not span across these files. How do I do it? Also if the records are less than 4 (different accounts) i should be able to create empty files. But I need atleast 4 files. FYI: The requirement to have 4 files is because I need to start 4 parallel processes for load balancing the subsequent processes. ANSWER: a) I could not get ur requirement very clearly as you want to split the files in 4 equal parts as well as keep the same account numbers in same file. Can you explain what will you do in case of 5 account numbers having 20 records each?..........As far as splitting is concerned a very very crude soln would be as follows In the end script do the following: 1.Find the size of the file and store it in variable (say v_size) 2.v_qtr_size=`expr $v_size / 4` 3.split -b $v_qtr_size 4.Rename the splitted files as per ur requirement. Note the splitted files have a specific pattern in their name b) Your requirement is such that it essentially depends on the skewness of your data across accounts. If you want to keep same accounts in same partition, then partition the data by key (account) with the out port connected to 4 way parallel layout. But this does not guarantee equal load in all partitions unless the data has little skewness. But I can suggest you an alternative approach, though cumbersome, still might give you a result, close to your requirement. You replicate your original dataset into two, and take one of them and rollup on account no to find the record count per account_no. Now sort this result on record count so that you have the account_no with min count at top and the one with max count at bottom. Now apply a partition by round robin and separate out the four partitions (partition 0, 1, 2 & 3). Now take the first partition and join with your main dataset ( that you have replicated earlier) on account_no and write the matching records (out port) into the first file. Take the unused records of your main flow of the previous join and now join it with the second
13
ANEESH Technologies partition (partition1) on account_no and write the matching records (out port) to the second file. Similarly again take the unused records of the previous join and join it with the third partition (partition 2) on acount_no. Write the matching record (out port) to the third file and the unused records of the main flow in the fourth file. This way you can get four files, nearly equal in size, and same account not spread across files. 30) I have a graph parameter state_cd having values based on a If statement. This variable I would like to use in SQL statement in AI_SQL directory. I have 20 SQL statments for 20 table codes. I will be using corresponding SQL statment based on table code passed as parameter to a graph. eg: SQLs in AI_SQL directory. ---------------------------1. Select a,b from abc where abc.state_cd in ${STATE_CD} 2. Select x,y from xyz where xyz.state_cd in ${STATE_CD} ${STATE_CD} is a graph parameter In value - "(IL,CO,MI)" Problem is ${STATE_CD} is not getting interpreted when I echo the 'Select Statement', hence giving problem. ans:Anand use eval or export of I table components .................. Or define ${STATE_CD} in ur start script ...................... that’s better ...............
14
View more...
Comments