Teradata Interview questions 4
February 13, 2017 | Author: Farooq Bmd | Category: N/A
Short Description
Download Teradata Interview questions 4...
Description
1.Explain Teradata Architecture Major Components of Teradata Architecture NODE: A node is made up of various hardware and software components. Components that make up a node are 1. Parsing Engine (PE) 2. BYNET 3. Access Module Processor (AMP) 4. Disks Parsing Engine The Parsing Engine (PE) is a component that interprets SQL requests, receives input records, and passes data. To do that it sends the messages through the BYNET to the AMPs. BYNET The BYNET is the message passing layer. It determines which AMP(s)(Access Module Processor) should receive a message. Access Module Processor (AMP) The AMP is a virtual processor designed for and dedicated to managing a portion of the entire database. It performs all database management functions such as sorting, aggregating, and formatting data. The AMP receives data from the PE, formats rows, and distributes them to the disk storage units it controls. The AMP also retrieves the rows requested by the Parsing Engine. Disks Disks are disk drives associated with an AMP that store the data rows. On current systems, they are implemented using a disk array
All applications run under UNIX, Windows NT or Windows 2000 and all Teradata software runs under PDE. All share the resources of CPU and memory on the node. AMPs and PEs are virtual processors running under control of the PDE.Their numbers are software configurable. In addition to user applications, gateway software and channel driver support may also be running. The Teradata RDBMS has a "shared-nothing" architecture, which means that the vprocs (which are the PEs and AMPs) do not share common components. For example, each AMP manages its own dedicated memory space (taken from the memory pool) and the data on its own vdisk -- these are not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall. Symmetric Multi-Processor (SMP):A single node is a Symmetric Multi-Processor (SMP) Massively Parallel Processing (MPP):When multiple SMP nodes are connected to form a larger configuration,we refer to this as a Massively Parallel Processing (MPP) system.
M 2.Functionality of each that include in Teradata architecture: Parsing Engine: A Parsing Engine (PE) is a virtual processor(vproc). It is made up of the following software components: 1. Session Control, 2. Parser, 3.Optimizer, 4. Dispatcher.
Session Control The major functions performed by Session Control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session’s context. Parser The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically. The PE also consults the Data Dictionary to ensure that all objects and columns exist and that the user has authority to access these objects. Optimizer The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest alternative is chosen. This alternative is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. Dispatcher The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the BYNET. It is composed of execution control and response-control tasks. Execution control receives the step definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step will be sent to one AMP, or broadcast to all AMPs.
The BYNET handles the internal communication of the Teradata RDBMS. All communication between PEs and AMPs is done via the BYNET. When the PE dispatches the steps for the AMPs to perform, they are dispatched onto the BYNET. The messages are routed to the appropriate AMP(s) where results sets and status information are generated. This response information is also routed back to the requesting PE via the BYNET. Depending on the nature of the dispatch request, the communication may be a: • Broadcast—message is routed to all nodes in the system. • Point-to-point—message is routed to one specific node in the system. Once the message is on a participating node, PDE handles the multicast(carries the message to just the AMPs that should get it). So, while a teradata system does do multicast messaging, the BYNET hardware alone cannot do it - the BYNET can only do point-to-point and broadcast between nodes. FEATURES OF BYNET: The BYNET has several unique features: Fault tolerant: each network has multiple connection paths. If the BYNET detects an unusable path in either network, it will automatically reconfigure that network so all messages avoid the unusable path. Additionally, in the rare case that BYNET 0 cannot be reconfigured, hardware on BYNET 0 is disabled and messages are re-routed to BYNET 1 (or equally distributed if there are more than two BYNETs present), and vice versa. Load balanced: traffic is automatically and dynamically distributed between both BYNETs. Scalable: as you add nodes to the system, overall network bandwidth scales linearly - meaning an increase in system size without loss of performance. High Performance: an MPP system typically has two or more BYNET networks. Because all networks are active, the system benefits from the full aggregate bandwidth of all networks. Since the number of networks can be scaled, performance can also be scaled to meet the needs of demanding applications. The technology of the BYNET is what makes the Teradata parallelism possible. The Access Module Processor (AMP)
The Access Module Processor (AMP) is the virtual processor. An AMP will control some portion of each table on the system. AMPs do the physical work associated with generating an answer set including sorting, aggregating, formatting and converting. An AMP can control up to 64 physical disks. The AMPs perform all database management functions in the system.An AMP responds to Parser/Optimizer steps transmitted across the BYNET by selecting data from or storing data to its disks. For some requests, the AMPs may redistribute a copy of the data to other AMPs. The Database Manager subsystem resides on each AMP. The Database Manager: • Receives the steps from the Dispatcher and processes the steps. It has the ability to: − Lock databases and tables. − Create, modify, or delete definitions of tables. − Insert, delete, or modify rows within the tables. − Retrieve information from definitions and tables. • Collects accounting statistics, recording accesses by session so users can be billed appropriately. • Returns responses to the Dispatcher. The Database Manager provides a bridge between that logical organization and the physical organization of the data on disks. The Database Manager performs a space-management function that controls the use and allocation of space.
A disk array is a configuration of disk drives that utilizes specialized controllers to manage and distribute data and parity across the disks while providing fast access and data integrity. Each AMP vproc must have access to an array controller that in turn accesses the physical disks. AMP vprocs are associated with one or more ranks of data. The total disk space associated with an AMP vproc is called a vdisk. A vdisk may have up to three ranks. Teradata supports several protection schemes: • RAID Level 5—Data and parity protection striped across multiple disks. • RAID Level 1—Each disk has a physical mirror replicating the data. • RAID Level S—Data and parity protection similar to RAID 5 but used for EMC di5sk arrays. The disk array controllers are referred to as dual active array controllers, which means that both controllers are actively used in addition to serving as backup for each other. 3.How is Teradata parallel? Teradata is Parallel for the following reasons: Each PE can support up to 120 user sessions in parallel. Each session may handle multiple requests concurrently. While only one request at a time may be active on behalf of a session, the session itself can manage the activities of 16 requests and their associated answer sets. The MPL is implemented differently for different platforms, this means that it will always be well within the needed bandwidth for each particular platform’s maximum throughput. Each AMP can perform up to 80 tasks in parallel. This means that AMPs are not dedicated at any moment in time to the servicing of only one request, but rather are multi-threading multiple requests concurrently. Because AMPs are designed to operate on only one portion of the database, they must operate in parallel to accomplish their intended results. In addition to this, the optimizer may direct the AMPs to perform certain steps in parallel if there are no contingencies between the steps. This means that an AMP might be concurrently performing more than one step on behalf of the same request.
Query Parallelism: Breaking the request into smaller components, all components being worked on at the same time, with one single answer delivered. Parallel execution can incorporate all or part of the operations within a query, and can significantly reduce the response time of an SQL statement, particularly if the query reads and analyzes a large amount of data. Query parallelism is enabled in Teradata by hash-partitioning the data across all the VPROCs defined in the system. A VPROC provides all the database services on its allocation of datablocks.All relational operations such as table scans, index scans, projections, selections, joins, aggregations, and sorts execute in parallel across all the VPROCs simultaneously and unconditionally. Each
operation is performed on a VPROC’s data independently of the data associated with the other VPROCs. 4.Explain mechanism in data distribution and data retrieval Data Distribution: Teradata uses hash partitioning and distribution to randomly and evenly distribute data across all AMPs. The rows of every table are distributed among all AMPs - and ideally will be evenly distributed among all AMPs. The rows of all tables are distributed across the AMPs according to their Primary Index value. The Primary Index value goes into the hashing algorithm and the output is a 32-bit Row Hash. The high order 16 bits are referred to as the “bucket number” and are used to identify a hash map entry. The “hash bucket” is also referred to as then DSW – Destination Selection Word. This entry, in turn, is used to identify the AMP that will be targeted. The remaining 16 bits are not used to locate the AMP. Each hash map is simply an array that associates DSW values (or bucket numbers) with specific AMPs
. To locate a row, the AMP file system searches through a memory-resident structure called the Master Index. An entry in the Master Index will indicate that if a row with this Table ID and row hash exists, then it must be on a specific disk cylinder. The file system will then search through the designated Cylinder Index. There it will find an entry that indicates that if a row with this Table ID and row hash exists, it must be in one specific data block on that cylinder. The file system then searches the data block until it locates the row(s) or returns a No Rows Found condition code.
Data retrival: Retrieving data from the Teradata RDBMS simply reverses the storage model process. A request made for data is passed on to a Parsing Engine(PE). The PE optimizes the request for efficient processing and creates tasks for the AMPs to perform, which results in the request being satisfied. Tasks are then dispatched to the AMPs via the BYNET. Often, all AMPs must participate in creating the answer set, such as returning all rows of a table to a client application. Other times, only one or a few AMPs need participate. The PE will ensure that only the AMPs that need to will be assigned tasks. Once the AMPs have been given their assignments, they retrieve the desired rows from their respective disks. The AMPs will sort, aggregate,or format if needed. The rows are then returned to the requesting PE viathe BYNET. The PE takes the returned answer set and returns it to the requesting client application.
5. If PI is not defined on a Teradata table, what will happen? Teradata tables must have a primary index. If none is specified while creating the table, teradata supplies an automatically created one. 6.what are the types of indexes in Teradata? • • • • •
Unique Primary Index (UPI) Unique Secondary Index (USI) Non-Unique Primary Index (NUPI) Non-Unique Secondary Index (NUPI) Join Index
7.what is secondary index? Whats are its uses? A secondary index is an alternate path to the data. Secondary indexes are used to improve performance by allowing the user to avoid scanning the entire table during a query. A secondary index is like a primary index in that it allows the user to locate rows. Unlike a primary index, it has no influence on the way rows are distributed among AMPs. Secondary Indexes are optional and can be created and dropped dynamically. Secondary Indexes require separate subtables which require extra I/O to maintain the indexes. Comparing to primary indexes, Secondary indexes allow access to information in a table by alternate, less frequently used paths. Teradata automatically creates a Secondary Index Subtable. The subtable will contain:
Secondary Index Value Secondary Index Row ID Primary Index Row ID
When a user writes an SQL query that has a SI in the WHERE clause, the Parsing Engine will hash the Secondary Index Value. The output is the Row Hash of the SI. The PE creates a request containing the Row Hash and gives the request to the Message Passing Layer (which includes the BYNET software and network). The Message Passing Layer uses a portion of the Row Hash to point to a bucket in the Hash Map. That bucket contains an AMP number to which the PE's request will be sent. The AMP gets the request and
accesses the Secondary Index Subtable pertaining to the requested SI information. The AMP will check to see if the Row Hash exists in the subtable and double check the subtable row with the actual secondary index value. Then, the AMP will create a request containing the Primary Index Row ID and send it back to the Message Passing Layer. This request is directed to the AMP with the base table row, and the AMP easily retrieves the data row. Secondary indexes can be useful for :
Satisfying complex condition
Processing aggregates
value comparision
Matching character combination
Joining tables
8.why is secondary index needed? Secondary indexes are used to improve performance by allowing the user to avoid scanning the entire table during a query Secondary indexes are frequently used in the where clause. The base table data aren't redistributed when secondary indexes are defined. Secondary indexes can be useful for :
Satisfying complex condition
Processing aggregates
value comparison
Matching character combination
Joining tables
9.What are the different types of locks in Teradata? Locking prevents multiple users who are trying to change the same data at the same time from violating the data's integrity. Locks are automatically acquired during the processing of a request and released at the termination of the request. There are four types of locks: Exclusive Lock:Exclusive locks are only applied to databases or tables, never to rows. They are the most restrictive type of lock; all other users are locked out. Exclusive locks are used rarely, most often when structural changes are being made to the database. Read Lock:Read locks are used to ensure consistency during read operations. Several users may hold concurrent read locks on the same data, during which no modification of the data is permitted. Write Lock:Write locks enable users to modify data while locking out all other users except readers not concerned about data consistency (Access lock readers). Until a Write lock is released, no new read or write locks are allowed. Access Lock:Access locks can be specified by users who are not concerned about data consistency. The use of an access lock allows for reading data while modifications are in process. Access locks are designed for decision support on large tables that are updated only by small single row changes. Access locks are sometimes called “stale read” locks, i.e. you may get ‘stale data’ that hasn’t been updated. Locks may be applied at three levels: Database – applies to all tables/views in the database Table/View – applies to all rows in the table/views Row Hash – applies to all rows with same row hash Lock types are automatically applied based on the SQL command: SELECT – applies a Read lock UPDATE – applies a Write lock CREATE TABLE – applies an Exclusive lock 10.When is ACCESS lock used? Access locks are used for the quick access to tables in multi-user environment even if other request are updating the data. They also have minimal effect on locking out others - when you use an access lock, virtually all requests are compatible with yours.
11.How to set default database? Setting the default database: The user name you logon with is your temporary database. For example ,if you logon as .logon abc; password:xyz then abc is normally default database Queries you make that do not specify database name will be made against your default database. Changing the default database: The DATABASE command is used to change the default database For example: DATABASE birla; set your default database to birla and the subsequent queries are made against birla database. 12.What is a cluster? A cluster is a group of AMPs that act as a single fallback unit. Clustering has no effect on primary row distribution of the table, but the fallback row copy will always go to another AMP in the same cluster. Should an AMP fail, the primary and fallback row copies stored on that AMP cannot be accessed. However, their alternate copies are available through the other AMPs in the same cluster. The loss of an AMP in one cluster has no effect upon other clusters. It is possible to lose one AMP in each cluster and still have full access to all fallback-protected table data. If there are two AMP failures in the same cluster, the entire Teradata system halts.While an AMP is down, the remaining AMPs in the cluster must do their own work plus the work of the down AMP. The example shows an 8-AMP system set up in two clusters of 4-AMPs each.
13.What are the connections involved in Channel attached system? In channel-attached systems, there are three major software components, which play important roles in getting the requests to and from the Teradata RDBMS.
The client application is either written by a programmer or is one of Teradata’s provided utility programs. Many client applications are written as “front ends” for SQL submission,but they also are written for file maintenance and report generation. Any client-supported language may be used provided it can interface to the Call Level Interface (CLI). The Call Level Interface (CLI) is the lowest level interface to the Teradata RDBMS. It consists of system calls which create sessions, allocate request and response buffers, create and de-block “parcels” of information, and fetch response information to the requesting client. The Teradata Director Program (TDP) is a Teradata-supplied program that must run on any client system that will be channel-attached to the Teradata RDBMS. The TDP manages the session traffic between the Call-Level Interface and the RDBMS. Its functions include session initiation and termination, logging, verification, recovery, and restart, as well as physical input to and output from the PEs, (including session balancing) and the maintenance of queues. The TDP may also handle system security.
The Host Channel Adapter is a mainframe hardware component that allows the mainframe to connect to an ESCON or Bus/Tag channel. The PBSA (PCI Bus ESCON Adapter) is a PCI adapter card that allows a WorldMark server to connect to an ESCON channel. The PBCA (PCI Bus Channel Adapter) is a PCI adapter card that allows a WorldMark server to connect to a Bus/Tag channel. 14.What are the connections involved in Network attached system? In network-attached systems, there are four major software components that play important roles in getting the requests to and from the Teradata RDBMS.
The Call Level Interface (CLI) is a library of routines that resides on the client side. Client
application programs use these routines to perform operations such as logging on and off, submitting SQL queries and receiving responses which contain the answer set. These routines are 98% the same in a network-attached environment as they are in a channel attached. The Teradata ODBC™ (Open Database Connectivity) driver uses an open standardsbased ODBC interface to provide client applications access to Teradata across LAN-based environments. NCR has ODBC drivers for both UNIX and Windows-based applications. The Micro Teradata Director Program (MTDP) is a Teradata-supplied program that must be linked to any application that will be network-attached to the Teradata RDBMS. The MTDP performs many of the functions of the channel based TDP including session management. The MTDP does not control session balancing across PEs. Connect and Assign Servers that run on the Teradata system handle this activity. The Micro Operating System Interface (MOSI) is a library of routines providing operating system independence for clients accessing the RDBMS. By using MOSI, we only need one version of the MTDP to run on all network-attached platforms.
15.How do you replace a null value with a default value while loading? Using COALESCE function Syntax: COALESCE( COL, 'DEFAULT') 16.What is COMPRESS? Compress: By default compresses the null values. In order to compress any values explicitly we need to give the characters or values in order to compress those values. 17.How many values can we compress in Teradata? Any column can be compressed except the indexed column and non volatile. 18.Difference between volatile and global volatile table? Global Temporary tables (GTT) 1. When they are created, its definition goes into Data Dictionary. 2. When materialized data goes in temp space. 3. thats why, data is active up to the session ends, and definition will remain there upto its not dropped using Drop table statement. If dropped from some other session then its should be Drop table all; 4. you can collect stats on GTT. Volatile Temporary tables (VTT) 1. Table Definition is stored in System cache 2. Data is stored in spool space. 3. thats why, data and table definition both are active only upto session ends. 4. No collect stats for VTT. 19.Difference between PK and PI? Primary Key: A relational concept used to determine relationships among entities and to define referential constraints.
Not required, unless referential integrity checks are to be performed. Define by CREATE TABLE statement. Unique. Identifies a row uniquely. Value can not be changed. Can not be null. Not related to access path.
Primary Index: Used to store rows on disk. Defined by CREATE TABLE STATEMENT . Unique or Non unique. It is used to distribute rows. Values can be changed. Can be null. Related to access path. 20.What is multiple statement processing? Multiple statement processing increases the performance when loading into large tables. All statements are sent to parser simultaneously. All statements are executed parallel. 21.What is TDPID? TDPID is the IP address of the teradata server machine. 22.What is tenacity? Specifies the no. of hours that teradata FLOAD continuous trying to logon when the maximum no of load jobs is already running on teradata database. 23.What is Sleep? Specifies the no. of minutes that teradata FLOAD pauses before retrying on logon operation. 24.What is database skewing? Skew factor occurs when the primary index column selected is not a good candidate. Mean, If for a table when the PI selected having highly non unique values then SKEW factor will be getting by default it will be zero, if skew factor selected is greater than 25 then it is not a good sign. 25.What is soft Referential Integrity and Batch Referential Integrity? Soft Referential Integrity: It provides a mechanism to allow user-specified Referential Integrity (RI) constraints that are not enforced by the database. Enables optimization techniques such as Join Elimination. Batch Referential Integrity: Tests an entire insert, delete, or update batch operation for referential integrity. If insertion, deletion, or update of any row in the batch violates referential integrity, then parsing engine software rolls back the entire batch and returns an abort message. 26.Difference Between MLOAD & FLOAD MLOAD:
It does the loading in the 5 phases Phase1:It will get the import file and checks the script Phase2:It reads the record from the base table and store in the work table Phase3:In this Application phase it locks the table header Phase4:In the DML operation will done in the tables Phase 5: In this table locks will be released and work tables will be dropped.
Multiload allows nonunique secondary indexes - automatically rebuilds them after loading.
Multiload can load at max 5 tbls at a time and can also update and delete the data
FastLoad: Fastload performs the loading of the data in 2phase and it no need a work table for loading the data so it is faster as well as it follows the below steps to load the data in the table Phase1-It moves all the records to all the AMP first without any hashing Phase2-After giving end loading command,Amp will hashes the record and send it to the appropriate AMPS . Fastload is used to load empty tables and is very fast, can load one table at a time. 27. Advantages of PPI PPI:-Partitioned Primary Index. When a Index is given on a partitioned table on the partitioned column that is the column on which the partitioned has done the same column has been given as a primary index then, • If there are more partitions, then it will be faster to scan the table, that too with the PI value itself. 28. Disadvatages of PPI If there are no partition declared for the row to be inserted in a particular partition then it is waste to declare the primary index itself. It is better to use the secondary index for partition for better performance. 29.Teradata joins? Join Processing A join is the combination of two or more tables in the same FROM of a single SELECT statement. When writing a join, the key is to locate a column in both tables that is from a common domain. Like the correlated subquery, joins are normally based on an equal comparison between the join columns. The following is the original join syntax for a two-table join: SELECT
[.]
[,. ] FROM [ AS ] , [ AS ] [ WHERE [.]= [.] ]
JoIN keyword is used in an SQL statement to query data from two or more tables, based on a relationship between certain columns in these tables Common Join Types in Teradata 1.Self Join 2.Inner Join 3.Outer Join The three formats of an OUTER JOIN are: Left_table LEFT OUTER JOIN Right_table -left table is outer table Left_table RIGHT OUTER JOIN Right_table -right table is outer table Left_table FULL OUTER JOIN Right_table -both are outer tables
Self Join A Self Join is simply a join that uses the same table more than once in a single join operation. The first requirement for this type of join is that the table must contain two different columns of the same
domain. This may involve de-normalized tables. For instance, if the Employee table contained a column for the manager's employee number and the manager is an employee, these two columns have the same domain. By joining on these two columns in the Employee table, the managers can be joined to the employees. Example: SELECT Mgr.Last_name (Title 'Manager Name', format 'X(10) ) ,Department_name (Title 'For Department ') FROM Employee_table AS Emp INNER JOIN Employee_table AS Mgr ON Emp.Manager_Emp_ID = Mgr.Employee_Number INNER JOIN Department_table AS Dept ON Emp.Department_number = Dept.Department_number ORDER BY 2 ;
INNER JOIN: INNER JOIN keyword return rows when there is at least one match in both tables INNER JOIN Syntax: SELECT column_name(s) FROM table_name1 INNER JOIN table_name2 ON table_name1.column_name=table_name2.column_name LEFT OUTER JOIN The LEFT OUTER JOIN keyword returns all rows from the left table (table_name1), even if there are no matches in the right table(table_name2). LEFT OUTER JOIN Syntax: SELECT column_name(s) FROM table_name1 LEFT OUTER JOIN table_name2 ON table_name1.column_name=table_name2.column_name RIGHT OUTER JOIN: The RIGHT OUTER JOIN keyword Return all rows from the right table (table_name2), even if there are no matches in the left table (table_name1). RIGHT OUTER JOIN Syntax: SELECT column_name(s) FROM table_name1 RIGHT OUTER JOIN table_name2 ON table_name1.column_name=table_name2.column_name FULL OUTER JOIN: The FULL OUTER JOIN keyword return rows when there is a match in one of the tables.
FULL OUTER JOIN Syntax: SELECT column_name(s) FROM table_name1 FULL OUTER JOIN table_name2 ON table_name1.column_name=table_name2.column_name A FULL OUTER JOIN uses both of the tables as outer tables. The exceptions are returned from both tables and the missing column values from either table are extended with NULL.
Product Join It is very important to use an equal condition in the WHERE clause. Otherwise you get a product join. This means that one row of a table is joined to multiple rows of another table. A mathematic product means that multiplication is used. 30. Difference between Primary index and secondary index? 1. primary index cannot create after table creation, whereas secondary index can be created dynamically. 2. primary index is 1 AMP operation, secondary index is 2 AMP operation and non unique secondary index is ALL AMP operation. 31. what are Journals? Journaling is a data protection mechanism in teradata Journals are generated to maintain preimages and post images of a DML transaction starting/ending at/from a checkpoint. When a DML transaction fails,the table is restored back to the last available checkpoint using the journal Images. There are two types of Journals (1) permanent (2) Transient journal. The purpose of the permanent journal is to provide selective or full database recovery to a specified point in time. It permits recovery from unexpected hardware or software disasters. The permanent journal also reduces the need for full table backups that can be costly in both time and resources. 1. Permanent journals are explicitly created during database and/or table creation time. This journaling can be implemented depending upon the need and available disk space. PJ processing is a user selectable option on a database which allows the user to select extra journaling for changes made to a table. There are more options and the data can be rolled forward or backward (depending if you selected the correct options) at points of the customers choosing. They are permanent because the changes are kept until the customer deletes them or unloads them to a backup tape. They are usually kept in conjunction with backups of the database and allow partial rollback or roll forward for some corrupted data or operational error like someone deleted a months worth of data because they messed up the where clause 2.Transient Journal The transient journal permits the successful rollback of a failed transaction (TXN). Transactions are not committed to the database until the AMPs have received an End Transaction request, either implicitly or explicitly. There is always the possibility that the transaction may fail. If so, the participating table(s) must be restored to their pre-transaction state. The transient journal maintains a copy of before images of all rows affected by the transaction. In the event of transaction failure, the before images are reapplied to the affected tables, then are deleted from the journal, and a rollback operation is completed. In the event of transaction success, the before images for the transaction are discarded from the journal at the point of transaction commit. Transient Journal activities are automatic and transparent to the user
32.Teradata fast export script? .LOGTABLE RestartLog1_fxp; .RUN .BEGIN
FILE logon ; EXPORT SESSIONS 4 ;
.LAYOUT .FIELD .FIELD
Record_Layout ; in_City in_Zip
.IMPORT
INFILE city_zip_infile LAYOUT Record_Layout ;
1 CHAR(20) ; * CHAR(5);
.EXPORT SELECT
OUTFILE cust_acct_outfile2 ; A.Account_Number , C.Last_Name , C.First_Name , A.Balance_Current FROM Accounts A INNER JOIN Accounts_Customer AC INNER JOIN Customer C ON C.Customer_Number = AC.Customer_Number ON A.Account_Number = AC.Account_Number WHERE A.City = :in_City AND A.Zip_Code = :in_Zip ORDER BY 1 ; .END EXPORT ; .LOGOFF ; 33.Teradata statistics. Statistics collection is essential for the optimal performance of the Teradata query optimizer. The query optimizer relies on statistics to help it determine the best way to access data. Statistics also help the optimizer ascertain how many rows exist in tables being queried and predict how many rows will qualify for given conditions. Lack of statistics, or out-dated statistics, might result in the optimizer choosing a less-thanoptimal method for accessing data tables. Points: 1: Once a collect stats is done on the table(on index or column) where is this information stored so that the optimizer can refer this? Ans: Collected statistics are stored in DBC.TVFields or DBC.Indexes. However, you cannot query these two tables. 2: How often collect stats has to be made for a table that is frequently updated? Answer: You need to refresh stats when 5 to 10% of table's rows have changed. Collect stats could be pretty resource consuming for large tables. So it is always advisable to schedule the job at off peak period and normally after approximately 10% of data changes. 3: Once a collect stats has been done on the table how can i be sure that the optimizer is considering this before execution ? i.e; until the next collect stats has been done will the optimizer refer this? Ans: Yes, optimizer will use stats data for query execution plan if available. That's why stale stats is dangerous as that may mislead the optimizer. What is a HOT AMP When the workload is not distributed across all the AMPs, only a few AMPs end up overburdened with the
work. This is a hot AMP condition. This typically occurs when the volume of data you are dealing with is high and (a). You are trying to retrieve the data in a TERADATA table which is not well distributed across the AMPs on the system (bad Primary Index) OR (b). When you are trying to join on column with highly non unique values OR (c). When you apply the DISTINCT operator on a column with highly non unique values 4: How can i know the tables for which the collect stats has been done? Ans: You run Help Stats command on that table. e.g HELP STATIISTICS TABLE_NAME ; this will give you Date and time when stats were last collected. You will also see stats for the columns ( for which stats were defined) for the table. You can use Teradata Manager too. 5: To what extent will there be performance issues when a collect stats is not done?Can a performance issue be related only due to collect stats? Probably a HOT AMP could be the reason for lack of spool space which is leading to performance degradation !!! As: 1stpart: Teradata uses a cost based optimizer and cost estimates are done based on statistics. So if you dont have statistics collected then optimizer will use a Dynamic AMP Sampling method to get the stats. If your table is big and data was unevenly distributed then dynamic sampling may not get right information and your performance will suffer. 2nd Part: No, performance could be related to bad selection of indexes ( most importantly PI) and the access path of a particular query. 6: Also let me know what can lead to lack of spool space apart from HOT AMP !!! Ans: One reason comes to my mind, a product join on two big data sets may lead to the lack of spool space. 34.
Where will you define error tables in the script?
In FLOAD & MLOAD we define in BEGIN LOADING statement. 35. I have to load data daily. Which load utility will be good? TPUMP. 36 What are different SPACES available in Teradata? Perm Space Temp Space spool space Perm Space :All databases have a defined upper limit of permanent space. Permanent space is used for storing the data rows of tables. Perm space is not pre-allocated. It represents a maximum limit. Spool Space : All databases also have an upper limit of spool space. If there is no limit defined for a particular database or user, limits are inherited from parents. Theoretically, a user could use all unallocated space in the system for their query. Spool space is temporary space used to hold intermediate query results or formatted answer sets to queries. Once the query is complete, the spool space is released. Example: You have a database with total disk space of 100GB. You have 10GB of user data and an additional 10GB of overhead. What is the maximum amount of spool space available for queries? Answer: 80GB. All of the remaining space in the system is available for spool Temp Space : The third type of space is temporary space. Temp space is used for Global and Volatile temporary tables, and these results remain available to the user until the session is terminated. Tables created in temp space will survive a restart. 37.different options that we can specify in CREATE table statement? There are two different table type philosophies so there are two different type tables. They are SET and MULTISET. It has been said, "A man with one watch knows the time, but a man with two watches is
never sure". When Teradata was originally designed it did not allow duplicate rows in a table. If any row in the same table had the same values in every column Teradata would throw one of the rows out. They believed a second row was a mistake. Why would someone need two watches and why would someone need two rows exactly the same? This is SET theory and a SET table kicks out duplicate rows. The ANSI standard believed in a different philosophy. If two rows are entered into a table that are exact duplicates then this is acceptable. If a person wants to wear two watches then they probably have a good reason. This is a MULTISET table and duplicate rows are allowed. If you do not specify SET or MULTISET, one is used as a default. Here is the issue: the default in Teradata mode is SET and the default in ANSI mode is MULTISET. Therefore, to eliminate confusion it is important to explicitly define which one is desired. Otherwise, you must know in which mode the CREATE TABLE will execute in so that the correct type is used for each table. The implication of using a SET or MULTISET table is discussed further. SET and MULTISET Tables A SET table does not allow duplicate rows so Teradata checks to ensure that no two rows in a table are exactly the same. This can be a burden. One way around the duplicate row check is to have a column in the table defined as UNIQUE. This could be a Unique Primary Index (UPI), Unique Secondary Index (USI) or even a column with a UNIQUE or PRIMARY KEY constraint. Since all must be unique, a duplicate row may never exist. Therefore, the check on either the index or constraint eliminates the need for the row to be examined for uniqueness. As a result, inserting new rows can be much faster by eliminating the duplicate row check. However, if the table is defined with a NUPI and the table uses SET as the table type, now a duplicate row check must be performed. Since SET tables do not allow duplicate rows a check must be performed every time a NUPI DUP (duplicate of an existing row NUPI value) value is inserted or updated in the table. Do not be fooled! A duplicate row check can be a very expensive operation in terms of processing time. This is because every new row inserted must be checked to see if it is a duplicate of any existing row with the same NUPI Row Hash value. The number of checks increases exponentially as each new row is added to the table. What is the solution? There are two: either make the table a MULTISET table (only if you want duplicate rows to be possible) or define at least one column or composite columns as UNIQUE. If neither is an option then the SET table with no unique columns will work, but inserts and updates will take more time because of the mandatory duplicate row check. Below is an example of creating a SET table: CREATE SET Table TomC.employee ( emp ,dept
INTEGER INTEGER
,lname
CHAR(20)
,fname
VARCHAR(20)
,salary
DECIMAL(10,2)
,hire_date DATE ) UNIQUE PRIMARY INDEX(emp); Notice the UNIQUE PRIMARY INDEX on the column emp. Because this is a SET table it is much more efficient to have at least one unique key so the duplicate row check is eliminated. The following is an example of creating the same table as before, but this time as a MULTISET table: CREATE MULTISET TABLE employee ( emp ,dept
INTEGER INTEGER
,lname
CHAR(20)
,fname
VARCHAR(20)
,salary
DECIMAL(10,2)
,hire_date DATE )
PRIMARY INDEX(emp); Notice also that the PI is now a NUPI because it does not use the word UNIQUE. This is important! As mentioned previously, if the UPI is requested, no duplicate rows can be inserted. Therefore, it acts more like a SET table. This MULTISET example allows duplicate rows. Inserts will take longer because of the mandatory duplicate row check. 38. What is macro? Advatages of it. Macros:A macro is a predefined, stored set of one or more SQL commands and report-formatting commands. Macros are used to simplify the execution of frequently used SQL commands. Macros do not require permanent space. 39.What are the functions of AMPs in Teradata? Each AMP is designed to hold a portion of the rows of each table. An AMP is responsible for the storage, maintenance and retrieval of the data under its control. Teradata uses hash partitioning to randomly and evenly distribute data across all AMPs for balanced performance points: 40. How Does Teradata Store Rows? • Teradata uses hash partitioning and distribution to randomly and evenly distribute data across all AMPs. • The rows of every table are distributed among all AMPs - and ideally will be evenly distributed among all AMPs. • Each AMP is responsible for a subset of the rows of each table. • Evenly distributed tables result in evenly distributed workloads. Fallback & Down Amp recovery journal!!! Hi, When a Fallback protected AMP goes down during a write operation, the update takes place in the Fallback AMP in the same cluster to later update in the original AMP when it recovers. When an AMP goes down the updates are also recorded in the Down AMP Recovery journal to later update when AMP recovers. My doubt is when an AMP goes down are the updates made in both Fallback AMP & Down AMP recovery journal? Because if Yes, it looks like a redundant recovery measure or Is it like Down AMP Recovery journal is used for only Non Fallback protected AMPs or for Fallback protected AMPs when both the AMPs in the cluster are down. Regards, Annal T
Hi Annal, According to my knowledge 1.Down amp recovery journal will start when AMP goes down to restore the data for the down amp 2.fall back is like it has redundant data,if one amp goes down in the cluster also it wont affect your queries.the query will use data from fall back rows.the down amp wont be updated use the data from fall back. For your doubt,When amp is down you ran the update,so fall back rows will be updated.Still amp is in down condition and if you run the query,the query will use the updated ones and run.whenever down amp active it will use downamp recovery journal and data will be updated. Hope this helps. Regards, Syam Prasad K
41. Which one will take care when an AMP goes down? Down amp recovery journal will start when AMP goes down to restore the data for the down amp 2.fall back is like it has redundant data,if one amp goes down in the cluster also it wont affect your queries.the query will use data from fall back rows.the down amp wont be updated use the data from fall back. For your doubt,When amp is down you ran the update,so fall back rows will be updated.Still amp is in down condition and if you run the query,the query will use the updated ones and run.whenever down amp active it will use downamp recovery journal and data will be updated. 42.Which one will take care when a NODE goes down? In the event of node failure, all virtual processors can migrate to another available node in the clique. All nodes in the clique must have access to the same disk arrays 43.What is the use of EXPLIN plan? The EXPLAIN facility allows you to preview how Teradata will execute a requested query. It returns a summary of the steps the Teradata RDBMS would perform to execute the request. EXPLAIN also discloses the strategy and access method to be used, how many rows will be involved, and its cost in minutes and seconds. Use EXPLAIN to evaluate a query performance and to develop an alternative processing strategy that may be more efficient. EXPLAIN works on any SQL request. The request is fully parsed and optimized, but not run. The complete plan is returned to the user in readable English statements. EXPLAIN provides information about locking, sorting, row selection criteria, join strategy and conditions, access method, and parallel step processing. EXPLAIN is useful for performance tuning, debugging, pre-validation of requests, and for technical training. 44.Use of COALESCE function? The newer ANSI standard COALESCE can also convert a NULL to a zero. However, it can convert a NULL value to any data value as well. The COALESCE searches a value list, ranging from one to many values, and returns the first Non-NULL value it finds. At the same time, it returns a NULL if all values in the list are NULL. To use the COALESCE, the SQL must pass the name of a column to the function. The data in the column is then compared for a NULL. Although one column name is all that is required, normally more than one column is normally passed to it. Additionally, a literal value, which is never NULL, can be returned to provide a default value if all of the previous column values are NULL. The syntax for the COALESCE follows: SELECT
COALESCE ( [, ] )
,( COALESCE([,] ) ) FROM GROUP BY 1 ; In the above syntax the is a list of columns. It is written as a series of column names separated by commas. SELECT
COALESCE(NULL,0) AS Col1
,COALESCE(NULL,NULL,NULL) AS Col2 ,COALESCE(3) AS Col3 ,COALESCE('A',3) AS Col4 ; 45.Diff between role , privilege and profile? A role can be assisgned a collection of access rights in the same way a user can.
You then grant the role to a set of users, rather than grant each user the same rights. This cuts down on maintenance, adds standardisation (hence reducing erroneous access to sensitive data) and reduces the size of the dbc.allrights table, which is very important in reducing DBC blocking in a large environment. Profiles assign different characteristics on a User, such as spool space, permspace and account strings. Again this helps with standardisation. Note that spool assigned to a profile will overrule spool assigned on a create user statement. Check the on line manuals for the full lists of properties Data Control Language is used to restrict or permit a user's access. It can selectively limit a user's ability to retrieve, add, or modify data. It is used to grant and revoke access privileges on tables and views. 46.Diff between database and user? Both may own objects such as tables, views, macros, procedures, and functions. Both users and databases may hold privileges. However, only users may log on, establish a session with the Teradata Database, and submit requests. A user performs actions where as a database is passive. Users have passwords and startup strings; databases do not. Users can log on to the Teradata Database, establish sessions, and submit SQL statements; databases cannot. Creator privileges are associated only with a user because only a user can log on and submit a CREATE statement. Implicit privileges are associated with either a database or a user because each can hold an object and an object is owned by the named space in which it resides 47.How many mload scripts are required for the below scenario First I want to load data from source to volatile table. After that I want to load data from volatile table to Permanent table. 48.What are the types of CASE statements available in Teradata? The CASE function provides an additional level of data testing after a row is accepted by the WHERE clause. The additional test allows for multiple comparisons on multiple columns with multiple outcomes. It also incorporates logic to handle a situation in which none of the values compares equal. When using CASE, each row retrieved is evaluated once by every CASE function. Therefore, if two CASE operations are in the same SQL statement, each row has a column checked twice, or two different values each checked one time. The basic syntax of the CASE follows: CASE WHEN THEN WHEN THEN WHEN THEN [ ELSE ] END Types: 1.Flexible Comparisons within CASE When it is necessary to compare more than just equal conditions within the CASE, the format is modified slightly to handle the comparison. Many people prefer to use the following format because it is more flexible and can compare inequalities as well as equalities. This is a more flexible form of the CASE syntax and allows for inequality tests: CASE WHEN THEN
WHEN THEN WHEN THEN [ ELSE ] END The above syntax shows that multiple tests can be made within each CASE. The value stored in the column continues to be tested until it finds a true condition. At that point, it does the THEN portion and exits the CASE logic by going directly to the END. 2.Comparison Operators within CASE In this section, we will investigate adding more power to the CASE statement. In the above examples, a literal value was returned. In most cases, it is necessary to return data. The returned value can come from a column name just like any selected column or a mathematical operation. Additionally, the above examples used a literal ‘=’ as the comparison operator. The CASE comparisons also allow the use of IN, BETWEEN, NULLIF and COALESCE. In reality, the BETWEEN is a compound comparison. It checks for values that are greater than or equal to the first number and less than or equal to the second number. The next example uses both formats of the CASE in a single SELECT with each one producing a column display. It also uses AS to establish an alias after the END: SELECT CASE WHEN Grade_pt IS NULL THEN 'Grade Point Unknown' WHEN Grade_pt IN (1,2,3) THEN 'Integer GPA' WHEN Grade_pt BETWEEN 1 AND 2 THEN 'Low Decimal value' WHEN Grade_pt < 3.99 THEN 'High Decimal value' ELSE '4.0 GPA' END AS Grade_Point_Average ,CASE Class_code WHEN 'FR' THEN 'Freshman' WHEN 'SO' THEN 'Sophomore' WHEN 'JR' THEN 'Junior' WHEN 'SR' THEN 'Senior' ELSE 'Unknown Class' END AS Class_Description FROM Student_table ORDER BY Class_code ;
3.CASE for Horizontal Reporting Another interesting usage for the CASE is to perform horizontal reporting. Normally, SQL does vertical reporting. This means that every row returned is shown on the next output line of the report as a separate line. Horizontal reporting shows the output of all information requested on one line as columns instead of vertically as rows. Previously, we discussed aggregation. It eliminates detail data and outputs only one line or one line per unique value in the non-aggregate column(s) when utilizing the GROUP BY. That is how vertical reporting works, one output line below the previous. Horizontal reporting shows the next value on the same line as the next column, instead of the next line. Using the next SELECT statement, we achieve the same information in a horizontal reporting format by making each value a column: SELECT AVG(CASE Class_code
WHEN 'FR' THEN Grade_pt ELSE NULL END) (format 'Z.ZZ') AS Freshman_GPA ,AVG(CASE Class_code WHEN 'SO' THEN Grade_pt ELSE NULL END) (format 'Z.ZZ') AS Sophomore_GPA ,AVG(CASE Class_code WHEN 'JR' THEN Grade_pt ELSE NULL END) (format 'Z.ZZ') AS Junior_GPA ,AVG(CASE Class_code WHEN 'SR' THEN Grade_pt ELSE NULL END) (format 'Z.ZZ') AS Senior_GPA FROM Student_Table WHERE Class_code IS NOT NULL ;
4.Nested CASE Expressions After becoming comfortable with the previous examples of the CASE, it may become apparent that a single check on a column is not sufficient for more complicated requests. When that is the situation, one CASE can be imbedded within another. This is called nested CASE statements. The CASE may be nested to check data in a second column in a second CASE before determining what value to return. It is common to have more than one CASE in a single SQL statement. However, it is powerful enough to have a CASE statement within a CASE statement. Example: SELECT Last_name ,CASE Class_code WHEN 'JR' THEN 'Junior ' ||(CASE WHEN Grade_pt < 2 THEN 'Failing' WHEN Grade_pt < 3.5 THEN 'Passing' ELSE 'Exceeding' END) ELSE 'Senior ' ||(CASE WHEN Grade_pt < 2 THEN 'Failing' WHEN Grade_pt < 3.5 THEN 'Passing' ELSE 'Exceeding' END) END AS Current_Status FROM Student_Table WHERE Class_code IN ('JR','SR') ORDER BY class_code, last_name;
49.how will you ALTER a table in Teradata? 1.ALTER TABLE(ADD): Syntax ALTER TABLE employee ADD Street VARCHAR(30) ,ADD City VARCHAR(20); ALTER TABLE(drop): Syntax ALTER TABLE employee DROP Phone
,DROP Pref; ALTER TABLE(Rename): Syntax ALTER TABLE employee RENAME Street TO StreetAddr; 50.mention the order of SQL execution SELECT-WHERE-GROUP BY-HAVING- ORDER BY class 51 What is the SQL to find the base AMP, no. of records stored for a particular table? 52 When a PI is not mentioned on a table, how will Teradata consider the PI for that table? If you don't specify a PI at table create time then Teradata must chose one. For instance, if the DDL is ported from another database that uses a Primary Key instead of a Primary Index, the CREATE TABLE contains a PRIMARY KEY (PK) constraint. Teradata is smart enough to know that Primary Keys must be unique and cannot be null. So, the first level of default is to use the PRIMARY KEY column(s) as a UPI. If the DDL defines no PRIMARY KEY, Teradata looks for a column defined as UNIQUE. As a second level default, Teradata uses the first column defined with a UNIQUE constraint as a UPI. If none of the above attributes are found, Teradata uses the first column defined in the table as a NON-UNIQUE PRIMARY INDEX (NUPI). 53 What is covered query in Teradata? If a SELECT query covers all the columns that are defined in the JOIN INDEX as join columns, such type of queries are called as COVERED query. Multi-Column NUSI Columns used as a Covered Query 54. What is NUSI bit mapping? 55. What are data demographics? Data demographics give us the information related to frequently updated columns. data demographics are : maximum rows per value typical rows per value distinct values 56. Diff between logical and physical data modeling? Logical Versus Physical Database Modeling After all business requirements have been gathered for a proposed database, they must be modeled. Models are created to visually represent the proposed database so that business requirements can easily be associated with database objects to ensure that all requirements have been completely and accurately gathered. Different types of diagrams are typically produced to illustrate the business processes, rules, entities, and organizational units that have been identified. These diagrams often include entity relationship diagrams, process flow diagrams, and server model diagrams. An entity relationship diagram (ERD) represents the entities, or groups of information, and their relationships maintained for a business. Process flow diagrams represent business processes and the flow of data between different processes and entities that have been defined. Server model diagrams represent a detailed picture of the database as being transformed from the business model into a relational database with tables, columns, and constraints. Basically, data modeling serves as a link between business needs and system requirements. Two types of data modeling are as follows: • •
Logical modeling Physical modeling
If you are going to be working with databases, then it is important to understand the difference between logical and physical modeling, and how they relate to one another. Logical and physical modeling are
described in more detail in the following subsections. Logical Modeling Logical modeling deals with gathering business requirements and converting those requirements into a model. The logical model revolves around the needs of the business, not the database, although the needs of the business are used to establish the needs of the database. Logical modeling involves gathering information about business processes, business entities (categories of data), and organizational units. After this information is gathered, diagrams and reports are produced including entity relationship diagrams, business process diagrams, and eventually process flow diagrams. The diagrams produced should show the processes and data that exists, as well as the relationships between business processes and data. Logical modeling should accurately render a visual representation of the activities and data relevant to a particular business. The diagrams and documentation generated during logical modeling is used to determine whether the requirements of the business have been completely gathered. Management, developers, and end users alike review these diagrams and documentation to determine if more work is required before physical modeling commences. Typical deliverables of logical modeling include •
•
•
Entity relationship diagrams An Entity Relationship Diagram is also referred to as an analysis ERD. The point of the initial ERD is to provide the development team with a picture of the different categories of data for the business, as well as how these categories of data are related to one another. Business process diagrams The process model illustrates all the parent and child processes that are performed by individuals within a company. The process model gives the development team an idea of how data moves within the organization. Because process models illustrate the activities of individuals in the company, the process model can be used to determine how a database application interface is design. User feedback documentation
Physical Modeling Physical modeling involves the actual design of a database according to the requirements that were established during logical modeling. Logical modeling mainly involves gathering the requirements of the business, with the latter part of logical modeling directed toward the goals and requirements of the database. Physical modeling deals with the conversion of the logical, or business model, into a relational database model. When physical modeling occurs, objects are being defined at the schema level. A schema is a group of related objects in a database. A database design effort is normally associated with one schema. During physical modeling, objects such as tables and columns are created based on entities and attributes that were defined during logical modeling. Constraints are also defined, including primary keys, foreign keys, other unique keys, and check constraints. Views can be created from database tables to summarize data or to simply provide the user with another perspective of certain data. Other objects such as indexes and snapshots can also be defined during physical modeling. Physical modeling is when all the pieces come together to complete the process of defining a database for a business. Physical modeling is database software specific, meaning that the objects defined during physical modeling can vary depending on the relational database software being used. For example, most relational database systems have variations with the way data types are represented and the way data is stored, although basic data types are conceptually the same among different implementations. Additionally, some database systems have objects that are not available in other database systems. 57. what is derived Table? Derived tables are always local to a single SQL request. They are built dynamically using an additional SELECT within the query. The rows of the derived table are stored in spool and discarded as soon as the query finishes. The DD has no knowledge of derived tables. Therefore, no extra privileges are necessary. Its space comes from the users spool space. Following is a simple example using a derived table named DT with a column alias called avgsal and its
data value is obtained using the AVG aggregation: SELECT * FROM (SELECT AVG(salary) FROM Employee_table) DT(avgsal) ; 58.what is the use of WITH CHECK OPTION in Teradata? In Teradata, the additional key phase: WITH CHECK OPTION, indicates that the WHERE clause conditions should be applied during the execution of an UPDATE or DELETE against the view. This is not a concern if views are not used for maintenance activity due to restricted privileges. 59.what is soft referential integrity and batch referential integrity? Soft RI is just an indication that there is a PK-FK relation between the columns and is not implemented at TD side. But having it would help in cases like Join processing etc. Batch: - Tests an entire insert, delete, or update batch operation for referential integrity. - If insertion, deletion, or update of any row in the batch violates referential integrity, then parsing engine software rolls back the entire batch and returns an abort message. Lets say that I had a table called X with some number of rows and I wanted to insert these rows into table Y (insert into X select * from y). However, some of the rows violated an RI constraint that table Y had. From reading the manuals, it seemed to me that if using standard RI, all of the valid rows would be inserted but the invalid ones would not. But with batch RI (which is "all or nothing") I would expect nothing to get inserted since it would check for problem rows up front and return an error right away. If in fact there is no difference except in how Teradata processes things internally (i.e. where it checks for invalid rows) then why would you want to use one over the other? Wouldn't you always want to use batch since it does the checking up front and saves processing time?
Points: lets suppose that we have 3 dimensions and 1 facts table (like in the example above). lets suppose that join index (or aji) is based on 3 dims and facts (all tables inner joined). 1. with or without referential integrity: if you submit query which joins dim1, dim2, dim3 and facts index can be used 2. with referential integrity: if you submit query which joins dim1 and facts index can be used because optimizer knows that facts rows reference rows from other dims (so he knows that inner join will not throw away those records) 3. without referential integrity if you submit query which joins dim1 and facts index cannot be used because optimizer does not know if rows from facts reference rows from other dims and optimizer does not know if it is one-to-many or manyto-one or anything else.
"Hard" referential integrity is the "normal" referential integrity that enforces any RI constraints and ensures that any data loaded into the tables meets the RI rules. You should keep in mind that neither Multiload or Fastload allow the target table to have foreign key references. Tpump does allow this. "Soft" referential integrity is a feature that is more about accessing the data than about loading it. Soft referential integrity does not enforce any RI constraints. However, when you specify soft RI, you are telling the optimizer that the foreign key references do exist. Therefore, it is your job to make sure that is true.
Soft Referential Integrity (Soft RI) is a mechanism by which you can tell the optimizer that even though no formal RI constraints have been placed on the table(s), the data in the tables conform to the requirements of RI enforced tables.
This means that the user has insured the following: • •
The PK of the parent table has unique, not null values. The FK of the child table contains only values which are contained in the PK column of the parent table.
Soft RI • •
Does not create or maintain reference indexes Does not validate referencing constraints
By allowing the optimizer to assume that RI constraints are implicitly in force, (even though no formal RI is assigned to the table), you enable the optimizer to eliminate join steps in queries such as the one seen previously. Implementing Soft RI Soft RI is implemented using slightly different syntax than standard RI. The REFERENCES clause for the column definition will add the key words 'WITH NO CHECK OPTION'. Examples Create the employee table with a soft RI reference to the department table. CREATE TABLE employee ( employee_number INTEGER NOT NULL, manager_employee_number INTEGER, department_number INTEGER , job_code INTEGER, last_name CHAR(20) NOT NULL, first_name VARCHAR(30) NOT NULL, hire_date DATE NOT NULL, birthdate DATE NOT NULL, salary_amount DECIMAL(10,2) NOT NULL , FOREIGN KEY ( department_number ) REFERENCES WITH NO CHECK OPTION department( department_number)) UNIQUE PRIMARY INDEX (employee_number); The parent table must be created with a unique, not null referenced column. Either of the examples below may be used. CREATE TABLE department ( department_number INTEGER NOT NULL CONSTRAINT primary_1 PRIMARY KEY ,department_name CHAR(30) UPPERCASE NOT NULL UNIQUE ,budget_amount DECIMAL(10,2) ,manager_employee_number INTEGER); CREATE TABLE department ( department_number INTEGER NOT NULL ,department_name CHAR(30) UPPERCASE NOT NULL UNIQUE ,budget_amount DECIMAL(10,2) ,manager_employee_number INTEGER) UNIQUE PRIMARY INDEX (department_number); Executing the same query as before, notice the join elimination step takes place just as it did when standard RI was enforced. Find all employees in valid departments. EXPLAIN SELECT employee_number , department_number FROM employee e, department d WHERE e.department_number = d.department_number
ORDER BY 2,1; An EXPLAIN of this query produces the following partial result: 3) We do an all-AMPs RETRIEVE step from SQL00.e by way of an all-rows scan with a condition of ("NOT (SQL00.e.department_number IS NULL)")into Spool 1 (group_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1. The size of Spool 1 is estimated with no confidence to be 29 rows. The estimated time for this step is 0.06 seconds. Again, the department table does not need to participate in the join for the same reason as seen in the previous example. Soft RI Caution! Note that the responsibility for this query to produce accurate results lies with the user. If the table data violates the rules of RI, then the join elimination step can have consequences for the accuracy of the results. It is assumed that the validation of the data for referential integrity takes place external to Teradata, or is enforced on Teradata through other application methods. CHECKSUM:
The problem in the diskdrive and disk array...can corrupt the data.... these type of corrupted data cant be found easily..but queries against these corrupted data will get u wrong answers..we can find the corruption by means of scandisk and checktable.....These errors will reduce the availability of the DWH.......This Kinda Errors is called DIsk I/o Errors Inorder to avoid this in TD we have the DIsk I/o Integrity Check.... CheckSum is used to check the Disk I/O Integrity Check by means of checksum for table level......this is a kinda protection technique by which we can select the various levels of corruption checking .......... These checks are done by some integrity methods..... This feature detects and logs the disk i/o errors TD give predefined data integrity levels check..... default,low,end,medium,high....etc... this checksum can be enabled.....using create table for table level.. DDL. for system level use DBScontrol utilty to set the parameter If u wanna more hands on then u ve to use the scandisk and checktbl utility.... u ve to run the checktbl utility in level 3 so that it will diagnos the entire rows,byte by byte... 60.what is identity column? IN Teradata V2R5.1 with one, column (INTEGER data type) that is defined as an Identity column. Here's the DDL: CREATE SET TABLE test_table ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT ( PRIM_REGION_ID INTEGER GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE -2147483647 MAXVALUE 2147483647 NO CYCLE),
PRIM_REGION_CD CHAR(6) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL) PRIMARY INDEX ( PRIM_REGION_ID );
Teradata has a concept of identity columns on their tables beginning around V2R6.x. These columns differ from Oracle's sequence concept in that the number assigned is not guaranteed to be sequential. The identity column in Teradata is simply used to guaranteed row-uniqueness. Example: CREATE MULTISET TABLE MyTable ( ColA INTEGER GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 20) ColB VARCHAR(20) NOT NULL ) UNIQUE PRIMARY INDEX pidx (ColA);
Granted, ColA may not be the best primary index for data access or joins with other tables in the data model. It just shows that you could use it as the PI on the table.
61.How to implement UPSERT logic in Teradata using SQL? We have MERGE-INTO option available in Teradata data which works as an UPSERT logic in teradata.
Example: MERGE into dept_table1 as Taregt USING (SELECT dept_no, dept_name, budget FROM dept_table where dept_no = 20) Source ON (Target.dept_no = 20) WHEN MATCHED then UPDATE set dept_name = ‘Being Renamed’ WHEN NOT MATCHED then INSERT (dept_no, dept_name, budget) VALUES (source.dept_no, source.dept_name, source.budget); UPSERT
update T1 from T2 b set last_dt = b.last_dt where T1.msisdn = b.msisdn else insert into tp_tmp.sa_telenor_mshare_backup values ( T2.msisdn, T2.oper_cd, T2.outg, T2.incom, T2.fst_dt, T2.last_dt, T2.second_last_dt, T2.third_last_dt)
62.what is SAMPLEID in Teradata? Since SAMPLEID is a column, it can be used as the sort key. Multiple sample sets may be generated in a single query if desired. To identify the specific set, a tag called the SAMPLEID is made available for association with each set. The SAMPLEID may be selected, used for ordering, or used as a column in a new table. Get three samples from the department table, one with 25% of the rows, another with 25% and a third with 50%. SELECT department_number ,sampleid FROM department SAMPLE .25, .25, .50 ORDER BY sampleid; Result department_number ----------------301 403 402 201 100 501 302 401 600
SampleId ----------1 1 2 2 3 3 3 3 3
63.What are diff other options available with SAMPLE function in Teradata? SAMPLE function is used to retrive the random of data from table example 1 Select * from emp sample 10 example 2 select * from tab sample when prod_code = 'AS' then 10 when prod_code = 'CM' then 10 when prod_code = 'DQ' then 10 end
Sample Function Hi, I have an order table which has order details alongwith Product Code as "AS" , "BU" ,"CM","DQ","ER","FN" I was to select a random of 10 records for each of the product codes "AS" , "CM" and "DQ" Can i use a "sample" teradata feature to acheive the above results . If yes how can that be done in a single query, such that i get 30 records 10 each for the above 3 product codes. Is there a better way to get the above results Thanks, Sam dnoeth 762 posts Joined 11/04 08 Jan 2007 Hi Sam,
select * from tab sample when prod_code = 'AS' then 10 when prod_code = 'CM' then 10 when prod_code = 'DQ' then 10 end Dieter RANDOM Function The RANDOM function may be used to generate a random number between a specified range. RANDOM (Lower limit, Upper limit) returns a random number between the lower and upper limits inclusive. Both limits must be specified, otherwise a random number between 0 and approximately 4 billion is generated. Consider the department table, which consists of nine rows. SELECT department_number FROM department; department_number ----------------501 301 201 600 100 402 403 302 401 Example Assign a random number between 1 and 9 to each department. SELECT department_number, RANDOM(1,9) FROM department; department_number ----------------501 301 201 600 100 402 403 302 401
Random(1,9) ----------2 6 3 7 3 2 1 5 1
Note: it is possible for random numbers to repeat. The RANDOM function is activated for each row processed, thus duplicate random values are possible.
64.what are the considerations to choose a Primary Index? The Primary Index determines which AMP stores an individual row of a table. The PI data is converted into the Row Hash using a mathematical hashing formula. The result is used as an offset into the Hash Map to determine the AMP number. Since the PI value determines how the data rows are distributed among the AMPs, requesting a row using the PI value is always the most efficient retrieval mechanism for Teradata. POINTS: .It determines how data will be distributed and is also the most efficient access path. 65. How many max roles can be assigned to a user.? 66.consider Mload or Tpump according to volume of the data.,diffrent situations where Tpump and Mload should be used ? In general, the more you tend to accumulate your updates into large batches before applying them to your tables, the more likely it is that you'll want to use Mload. Mload is more efficient at applying a large number of updates. However, Mload has certain limitations like it can't update unique secondary indexes or join indexes, it can't fire triggers, and you can't use it on a table with referential integrity defined. Also, Mload will lock the entire table with a write lock when it's in the APPLY phase (when it's applying the updates). Tpump, on the other hand, is best used if you are applying updates throughout the day in small batches (or using a queue). Tpump is not as fast, especially as the update volumes grow. It's advantages are that it doesn't lock the entire table for write, but only locks the specific row-hash values that are being updated, and it only locks them for the duration of the update. Also, since there is no special code inside the DBMS for Tpump, it supports all DBMS features (updates unique secondary indexes, join indexes, fires triggers, etc.). If you are applying updates on a weekly or daily basis, I would tend to use Multiload. As you start to apply updates more frequently throughout the day, you may start to find that Tpump is the better option. 67.Why 3rd NF in Teradata LDM ? because the Teradata model is in third normal form, you only have to enter data once. That significantly reduces data redundancy and means you don’t have to reorganize the entire model every time you want to ask a new business question or add a new data source.
68.How many subject areas in FSLDM. 10 1.Party 2.asset 3.product 4.agreement 5.event 6.Location 7.Campaign 8.channel 9.Financial Management 10.Internal Organization. 69.explain about MLoad and SI ? MLOAD will not work with unique secondary indexes. 70.PPI and UPI in table creation statement. UPI: CREATE SET Table TomC.employee ( emp ,dept
INTEGER INTEGER
,lname
CHAR(20)
,fname
VARCHAR(20)
,salary
DECIMAL(10,2)
,hire_date DATE ) UNIQUE PRIMARY INDEX(emp);
CREATE table test ( column1 SMALLINT, column2 DATE FORMAT 'YYYYMMDD', loaddate DATE FORMAT 'yyyy-mm-dd') UNIQUE PRIMARY INDEX (column2 )
PARTITION BY RANGE_N(column2 BETWEEN DATE '2000-01-01' AND DATE '2100-01-01' EACH INTERVAL '1'); 71.what is value ordered NUSI When we define a value ordered NUSI on a column the rows in the secondary subtable get sorted based on the secondary index value. The columns should be of integer or date type. This is used for range queries and to avoid full table scans on large tables. 72.difference between oracle and Teradata. Both the database has there advantages & disadvantages. There are a lot of factors to be taken into consideration before deciding which database is better. If you are talking about OLTP systems then Oracle is far better than Teradata. Oracle is more flexible in terms of programming like u can write Packages,procedures,functions . Teradata is useful if you want to generate reports on a very huge database. But the recent versions of Oracle like 10g is quite good & contains a lot of features to support DataWareHouse Teradata is a MPP System which really can process the complex queries very fastly.. Another advantage is the uniform distribution of data through the Unique primary indexes with out any overhead. Recently we had an evaluation with experts from both Oracle and Teradata for OLAP system, and they were really impressed with the performance of Teradata over Oracle. Oracle support MPP in form of grid computing. uniform distribution of data based on primary key will not be much useful when accessing huge amount of data a full scan is required. so far we found teradata almost equal in performance with oracle 10g. Based on bench mark and after consulting from different people we find following problems in Teradata. its too expensive. you need long pockets to work with teradata. it has only one type of index while oracle has many types of indexes especially there bitmap index. teradata does not have materialize view. oracle has materialize view which decrease the IO band width and makes system more scalable. Oracle has very wide variety of analytic functions for Sql. 3 types of partitioning and in oracle 11g there are some new addition in partitioning the ability to use clusters without having to statically partition data Further..... these are the remarks i found on some of oracle discussion forms the largest databases in the world run on Oracle http://biz.yahoo.com/prnews/031114/sff029_1.html they count a) all disk on the computer, not just database disk b) the sum of all databases a customer is using -- not individual databases But still we saw that best database is the one which you have technical resource to work and especially tune. 73.what are the DBQL tables. Database Query Log tables are the tables present in DBC database which store the history of all the operations performed on the tables present in the databases. The history could get very large so these tables should be purged when the data is no longer needed. 74.diff kind of users in Teradata 75.explain about Raid 1 and Raid5 Raid Protection There are many forms of disk array protection in Teradata. RAID 1 and RAID 5 are commonly used and will be discussed here. The disk array controllers manage both. RAID 1 is a disk-mirroring technique. Each physical disk is mirrored elsewhere in the array. This requires the array controllers to write all data to two separate locations, which means data can be read from two locations as well. In the event of a disk failure, the mirror disk becomes the primary disk to the array controller and performance is unchanged. RAID 1 may be configured as RAID 1 + 0 that uses mirrored striping.
RAID 5 is a parity-checking technique. For every three blocks of data (spread over three disks), there is a fourth block on a fourth disk that contains parity information. This allows any one of the four blocks to be reconstructed by using the information on the other three. If two of the disks fail, the rank becomes unavailable. The array controller does the recalculation of the information for the missing block. Recalculation will have some impact on performance, but at a much lower cost in terms of disk space. 76.what is the diff b/w sample and top. The Sampling function (SAMPLE) permits a SELECT to randomly return rows from a Teradata database table. It allows the request to specify either an absolute number of rows or a percentage of rows to return. Additionally, it provides an ability to return rows from multiple samples. SELECT
* FROM student_course_table SAMPLE 5;
TOP Clause The TOP clause is used to specify the number of records to return. The TOP clause can be very useful on large tables with thousands of records. Returning a large number of records can impact on perfor mance. Note:
Not all database systems support the TOP clause.
Example: 1.SELECT TOP 50 PERCENT * FROM EMP 2. SELECT TOP 2 * FROM EMP
There is an Top function in V2R6, but if you want to try out in V2R5 you need to go by analytical function. Select * From vinod_1 Qualify Row_number() OVER(Order by empno)
View more...
Comments