Sonali DBMS Notes

August 5, 2017 | Author: Sonali Sharma | Category: Database Index, Databases, Database Transaction, Acid, Relational Model

Share Embed Donate

Report this link

Short Description

dbms...

Description

1.2 Advantages of Database Systems

As shown in the figure, the DBMS is a central system which provides a common interface between the data and the various front-end programs in the application. It also provides a central location for the whole data in the application to reside. Due to its centralized nature, the database system can overcome the disadvantages of the file-based system as discussed below. •

Minimal Data Redundancy Since the whole data resides in one central database, the various programs in the application can access data in different data files. Hence data present in one file need not be duplicated in another. This reduces data redundancy. However, this does not mean all redundancy can be eliminated. There could be business or technical reasons for having some amount of redundancy. Any such redundancy should be carefully controlled and the DBMS should be aware of it.

•

Data Consistency Reduced data redundancy leads to better data consistency.

•

Data Integration

Since related data is stored in one single database, enforcing data integrity is much easier. Moreover, the functions in the DBMS can be used to enforce the integrity rules with minimum programming in the application programs.

•

Data Sharing

Related data can be shared across programs since the data is stored in a centralized manner. Even new applications can be developed to operate against the same data. •

Enforcement of Standards Enforcing standards in the organization and structure of data files is required and also easy in a Database System, since it is one single set of programs which is always interacting with the data files.

•

Application Development Ease The application programmer need not build the functions for handling issues like concurrent access, security, data integrity, etc. The programmer only needs to implement the application business rules. This brings in application development ease. Adding additional functional modules is also easier than in file-based systems.

•

Better Controls Better controls can be achieved due to the centralized nature of the system.

•

Data Independence The architecture of the DBMS can be viewed as a 3-level system comprising the following: - The internal or the physical level where the data resides. - The conceptual level which is the level of the DBMS functions - The external level which is the level of the application programs or the end user. Data Independence is isolating an upper level from the changes in the organization or structure of a lower level. For example, if changes in the file organization of a data file do not demand for changes in the functions in the DBMS or in the application programs, data independence is achieved. Thus Data Independence can be defined as immunity of applications to change in physical representation and access technique. The provision of data independence is a major objective for database systems.

•

Reduced Maintenance Maintenance is less and easy, again, due to the centralized nature of the system.

1.3 Functions of a DBMS

The functions performed by a typical DBMS are the following: •

Data Definition The DBMS provides functions to define the structure of the data in the application. These include defining and modifying the record structure, the type and size of fields and the various constraints/conditions to be satisfied by the data in each field.

•

Data Manipulation Once the data structure is defined, data needs to be inserted, modified or deleted. The functions which perform these operations are also part of the DBMS. These function can handle planned and unplanned data manipulation needs. Planned queries are those which form part of the application. Unplanned queries are ad-hoc queries which are performed on a need basis.

•

Data Security & Integrity The DBMS contains functions which handle the security and integrity of data in the application. These can be easily invoked by the application and hence the application programmer need not code these functions in his/her programs.

•

Data Recovery & Concurrency Recovery of data after a system failure and concurrent access of records by multiple users are also handled by the DBMS.

•

Data Dictionary Maintenance Maintaining the Data Dictionary which contains the data definition of the application is also one of the functions of a DBMS.

•

Performance Optimizing the performance of the queries is one of the important functions of a DBMS. Hence the DBMS has a set of programs forming the Query Optimizer which evaluates the different implementations of a query and chooses the best among them.

Thus the DBMS provides an environment that is both convenient and efficient to use when there is a large volume of data and many transactions to be processed. 1.4 Role of the Database Administrator Typically there are three types of users for a DBMS. They are : 1. The End User who uses the application. Ultimately, this is the user who actually puts the data in the system into use in business. This user need not

know anything about the organization of data in the physical level. She also need not be aware of the complete data in the system. She needs to have access and knowledge of only the data she is using. 2. The Application Programmer who develops the application programs. She has more knowledge about the data and its structure since she has manipulate the data using her programs. She also need not have access and knowledge of the complete data in the system. 3. The Database Administrator (DBA) who is like the super-user of the system. The role of the DBA is very important and is defined by the following functions. •

Defining the Schema The DBA defines the schema which contains the structure of the data in the application. The DBA determines what data needs to be present in the system ad how this data has to be represented and organized.

•

Liaising with Users The DBA needs to interact continuously with the users to understand the data in the system and its use.

•

Defining Security & Integrity Checks The DBA finds about the access restrictions to be defined and defines security checks accordingly. Data Integrity checks are also defined by the DBA.

•

Defining Backup / Recovery Procedures The DBA also defines procedures for backup and recovery. Defining backup procedures includes specifying what data is to backed up, the periodicity of taking backups and also the medium and storage place for the backup data.

•

Monitoring Performance The DBA has to continuously monitor the performance of the queries and take measures to optimize all the queries in the application.

1.5 Types of Database Systems Database Systems can be catagorised according to the data structures and operators they present to the user. The oldest systems fall into inverted list, hierarchic and network systems. These are the pre-relational models. •

In the Hierarchical Model, different records are inter-related through hierarchical or tree-like structures. A parent record can have several children, but a child can have only one parent. In the figure, there are two hierarchies shown - the first storing the relations between CUSTOMER, ORDERS, CONTACTS and ORDER_PARTS and the second showing the relation between PARTS, ORDER_PARTS and SALES_HISTORY. The many-to-many relationship

is implemented through the ORDER_PARTS segment which occurs in both the hierarchies. In practice, only one tree stores the ORDER_PARTS segment, while the other has a logical pointer to this segment. IMS (Information Management System) of IBM is an example of a Hierarchical DBMS.

•

In the Network Model, a parent can have several children and a child can also have many parent records. Records are physically linked through linkedlists. IDMS from Computer Associates International Inc. is an example of a Network DBMS.

•

In the Relational Model, unlike the Hierarchical and Network models, there are no physical links. All data is maintained in the form of tables consisting of rows and columns. Data in two tables is related through common columns and not physical links or pointers. Operators are provided for operating on rows in tables. Unlike the other two type of DBMS, there is no need to traverse pointers in the Relational DBMS. This makes querying much more easier in a Relational DBMS than in the the Hierarchical or Network DBMS. This, in fact, is a major reason for the relational model to become more programmer friendly and much more dominant and popular in both industrial and academic scenarios. Oracle, Sybase, DB2, Ingres, Informix, MS-SQL Server are few of the popular Relational DBMSs. CUSTOMER CUST. NO. CUSTOMER NAME 15371 Nanubhai & Sons ... ...

ADDRESS L. J. Road ...

CITY Mumbai ...

... ...

... ...

... ...

... ...

• CONTACTS CUST.NO. CONTACT 15371 Nanubhai 15371 Rajesh Munim ... ... ... ... PARTS PARTS NO. PARTS DESC Amkette 3.5" S3 Floppies ... ... ... ... ... ...

ORDERS DESIGNATION Owner Accountant ... ...

PART PRICE 400.00 ... ... ...

CUSTOMER NO. 24-June-1997 15371 ... ... ... ... ... ...

ORDER NO.ORDER DATE 3216 ... ... ...

ORDERS-PARTS ORDER NO.PART NO. 3216 C1 3216 S3 ... ... ... ...

• SALES-HISTORY PART NO. S3 S3 S3 S3

REGION East North South West

YEAR 1996 1996 1996 1996

UNITS 2000 5500 12000 20000

The recent developments in the area have shown up in the form of certain object and object/relational DBMS products. Examples of such systems are GemStone and Versant ODBMS. Research has also proceeded on to a variety of other schemes including the multi-dimensional approach and the logic-based approach. 3-Level Database System Architecture

QUANTITY 300 120 ... ...

• • •

The External Level represents the collection of views available to different end-users. The Conceptual level is the representation of the entre information content of the database. The Internal level is the physical level which shows how the data data is stored, what are the representation of the fields etc.

2. The Internal Level This chapter discusses the issues related to how the data is physically stored on the disk and some of the access mechanisms commonly used for retrieving this data. The Internal Level is the level which deals with the physical storage of data. While designing this layer, the main objective is to optimize performance by minimizing the number of disk accesses during the various database operations.

The figure shows the process of database access in general. The DBMS views the database as a collection of records. The File Manager of the underlying Operating System views it as a set of pages and the Disk Manager views it as a collection of physical locations on the disk. When the DBMS makes a request for a specific record to the File Manager, the latter maps the record to a page containing it and requests the Disk Manager for the specific page. The Disk Manager determines the physical location on the disk and retrieves the required page. 2.1 Clustering In the above process, if the page containing the requested record is already in the memory, retrieval from the disk is not necessary. In such a situation, time taken for the whole operation will be less. Thus, if records which are frequently used together are placed physically together, more records will be in the same page. Hence the number of pages to be retrieved will be less and this reduces the number of disk accesses which in turn gives a better performance. This method of storing logically related records, physically together is called clustering. Eg: Consider CUSTOMER table as shown below.

Cust ID

Cust Name

Cust City

...

10001

Raj

Delhi

...

10002

...

...

...

10003

...

...

...

10004

...

...

...

...

...

...

...

...

...

...

...

If queries retrieving Customers with consecutive Cust_IDs frequently occur in the application, clustering based on Cust_ID will help improving the performance of these queries. This can be explained as follows. Assume that the Customer record size is 128 bytes and the typical size of a page retrieved by the File Manager is 1 Kb (1024 bytes). If there is no clustering, it can be assumed that the Customer records are stored at random physical locations. In the worst-case scenario, each record may be placed in a different page. Hence a query to retrieve 100 records with consecutive Cust_Ids (say, 10001 to 10100), will require 100 pages to be accessed which in turn translates to 100 disk accesses. But, if the records are clustered, a page can contain 8 records. Hence the number of pages to be accessed for retrieving the 100 consecutive records will be ceil(100/8) = 13. i.e., only 13 disk accesses will be required to obtain the query results. Thus, in the given example, clustering improves the speed by a factor of 7.7 Q: For what record size will clustering be of no benefit to improve performance ? A: When the record size and page size are such that a page can contain only one record. Q: Can a table have clustering on multiple fields simultaneously ? A: No •Intra-file Clustering – Clustered records belong to the same file (table) as in the above example. •Inter-file Clustering – Clustered records belong to different files (tables). This type of clustering may be required to enhance the speed of queries retrieving related records from more than one tables. Here interleaving of records is used. 2.2 Indexing

Indexing is another common method for making retrievals faster. Consider the example of CUSTOMER table used above. The following query is based on Customer's city. “Retrieve the records of all customers who reside in Delhi” Here a sequential search on the CUSTOMER table has to be carried out and all records with the value 'Delhi' in the Cust_City field have to be retrieved. The time taken for this operation depends on the number of pages to be accessed. If the records are randomly stored, the page accesses depends on the volume of data. If the records are stored physically together, the number of pages depends on the size of each record also. If such queries based on Cust_City field are very frequent in the application, steps can be taken to improve the performance of these queries. Creating an Index on Cust_City is one such method. This results in the scenario as shown below.

A new index file is created. The number of records in the index file is same as that of the data file. The index file has two fields in each record. One field contains the value of the Cust_City field and the second contains a pointer to the actual data record in the CUSTOMER table. Whenever a query based on Cust_City field occurs, a search is carried out on the Index file. Here, it is to be noted that this search will be much faster than a sequential search in the CUSTOMER table, if the records are stored physically together. This is because of the much smaller size of the index record due to which each page will be able to contain more number of records. When the records with value 'Delhi' in the Cust_City field in the index file are located, the pointer in the second field of the records can be followed to directly retrieve the corresponding CUSTOMER records. Thus the access involves a Sequential access on the index file and a Direct access on the actual data file.

Retrieval Speed v/s Update Speed : Though indexes help making retrievals faster, they slow down updates on the table since updates on the base table demand update on the index field as well. It is possible to create an index with multiple fields i.e., index on field combinations. Multiple indexes can also be created on the same table simultaneously though there may be a limit on the maximum number of indexes that can be created on a table.

Q: In which of the following situations will indexes be ineffective ? a) When the percentage of rows being retrieved is large b) When the data table is small and the index record is of almost the same size as of the actual data record. c) In queries involving NULL / Not NULL in the indexed field. d)All of the above A: d) All of the above Q: Can a clustering based on one field and indexing on another field exist on the same table simultaneously ? A: Yes 2.3 Hashing Hashing is yet another method used for making retrievals faster. This method provides direct access to record on the basis of the value of a specific field called the hash_field. Here, when a new record is inserted, it is physically stored at an address which is computed by applying a mathematical function (hash function) to the value of the hash field. Thus for every new record,

hash_address = f (hash_field), where f is the hash function. Later, when a record is to be retrieved, the same hash function is used to compute the address where the record is stored. Retrievals are faster since a direct access is provided and there is no search involved in the process. An example of a typical hash function is given by a numeric hash field, say an id, modulus a very large prime number. Q: Can there be more than one hash fields on a file ? A: No As hashing relates the field value to the address of the record, multiple hash fields will map a record to multiple addresses at the same time. Hence there can be only one hash field per file. Collisions : Consider the example of the CUSTOMER table given earlier while discussing clustering. Let CUST_ID be the hash field and the hash function be defined as ((CUST_ID mod 10000)*64 + 1025). The records with CUST_ID 10001, 10002, 10003 etc. will be stored at addresses 1089, 1153, 1217 etc. respectively. It is possible that two records hash to the same address leading to a collision. In the above example, records with CUST_ID values 20001, 20002, 20003 etc. will also map on to the addresses 1089, 1153, 1217 etc. respectively. And same is the case with CUST_ID values 30001, 30002, 30003 etc. The methods to resolve a collision are by using : 1. Linear Search: While inserting a new record, if it is found that the location at the hash address is already occupied by a previously inserted record, search for the next free location available in the disk and store the new record at this location. A pointer from the first record at the original hash address to the new record will also be stored. During retrieval, the hash address is computed to locate the record. When it is seen that the record is not available at the hash address, the pointer from the record at that address is followed to locate the required record.

In this method, the over head incurred is the time taken for the linear search to locate the next free location while inserting a record. 2. Collision Chain: Here, the hash address location contains the head of a list of pointers linking together all records which hash to that address.

In this method, an overflow area needs to be used if the number of records mapping on to the same hash address exceeds the number of locations linked to it. 3.1 The Relational Model Relational Databases: Terminology

Ord_Items Databases: Case Example Ord_Aug Ord #

OrdDate

Cust#

101

02-08-94

002

102

11-08-94

003

103

21-08-94

003

104

28-08-94

002

105

30-08-94

005

Ord #

Item #

Qty

101

HW1

100

101

HW3

50

101

SW1

150

102

HW2

10

103

HW3

50

104

HW2

25

104

HW3

100

105

SW1

100

Items

Customers

Item #

Descr

Price

HW1

Power Supply

4000

HW2

101-Keyboard

2000

HW3

Mouse

800

SW1

MS-DOS 6.0

5000

SW2

MS-Word 6.0

8000

Term

Meaning

Ord #

OrdDate

Cust#

101

02-08-94

002

102

11-08-94

003

103

21-08-94

003

104

28-08-94

002

105

30-08-94

005

Eg. from the given Case Example

Relation

A table

Ord_Aug, Customers, Items etc.

Tuple

A row or a record in a relation.

A row from Customers relation is a Customer tuple.

Attribute

A field or a column in a relation.

Ord_Date, Item#, CustName etc.

Cardinality of a relation

The number of tuples in a relation.

Cardinality of Ord_Items relation is 8

Degree of a relation

The number of attributes in a relation.

Degree of Customers relation is 3.

Domain of an The set of all values that can attribute be taken by the attribute.

Primary Key of a relation

Foreign Key

An attribute or a combination of attributes that uniquely defines each tuple in a relation.

Domain of Qty in Ord_Items is the set of all values which can represent quantity of an ordered item. Primary Key of Customers relation is Cust#. Ord# and Item# combination forms the primary Key of Ord_Items

An attribute or a combination of attributes in one relation R1 which indicates the relationship of R1 with another relation R2.

Cust# in Ord_Aug relation is a foreign key creating reference from Ord_Aug to Customers. This is required to indicate the relationship between Orders in Ord_Aug and Customers.

The foreign key attributes in R1 must contain values

Ord# and Item# in Ord_Items are foreign keys creating references from

matching with those of the values in R2

Ord_Items to Ord_Aug and Items respectively.

3.2 Properties of Relations • No Duplicate Tuples – A relation cannot contain two or more tuples which have the same values for all the attributes. i.e., In any relation, every row is unique. • Tuples are unordered – The order of rows in a relation is immaterial. • Attributes are unordered – The order of columns in a relation is immaterial. • Attribute Values are Atomic – Each tuple contains exactly one value for each attribute. It may be noted that many of the properties of relations follow the fact that the body of a relation is a mathematical set.

3.3 Integrity Rules The following are the integrity rules to be satisfied by any relation. • No Component of the Primary Key can be null. • The Database must not contain any unmatched Foreign Key values. This is called the referential integrity rule. Q: Can the Foreign Key accept nulls? A: Yes, if the application business rule allows this. How do we explain this ? Unlike the case of Primary Keys, there is no integrity rule saying that no component of the foreign key can be null. This can be logically explained with the help of the following example: Consider the relations Employee and Account as given below. Employee Emp#

EmpName

EmpCity

EmpAcc#

X101

Shekhar

Bombay

120001

X102

Raj

Pune

120002

X103

Sharma

Nagpur

Null

X104

Vani

Bhopal

120003

Account ACC#

OpenDate

BalAmt

120001

30-Aug-1998

5000

120002

29-Oct-1998

1200

120003

01-Jan-1999

3000

120004

04-Mar-1999

500

EmpAcc# in Employee relation is a foreign key creating reference from Employee to Account. Here, a Null value in EmpAcc# attribute is logically possible if an Employee does not have a bank account. If the business rules allow an employee to exist in the system without opening an account, a Null value can be allowed for EmpAcc# in Employee relation. In the case example given, Cust# in Ord_Aug cannot accept Null if the business rule insists that the Customer No. needs to be stored for every order placed. The next issue related to foreign key reference is handling deletes / updates of parent? In the case example, can we delete the record with Cust# value 002, 003 or 005 ? The default answer is NO, as long as there is a foreign key reference to these records from some other table. Here, the records are referenced from the order records in Ord_Aug relation. Hence Restrict the deletion of the parent record. Deletion can still be carried if we use the Cascade or Nullify strategies. Cascade: Delete/Update all the references successively or in a cascaded fashion and finally delete/update the parent record. In the case example, Customer record with Cust#002 can be deleted after deleting order records with Ord# 101 and 104. But these order records, in turn, can be deleted only after deleting those records with Ord# 101 and 104 from Ord_Items relation. Nullify: Update the referencing to Null and then delete/update the parent record. In the above example of Employee and Account relations, an account record may have to be deleted if the account is to be closed. For example, if Employee Raj decides to close his account, Account record with Acc# 120002 has to be deleted. But this deletion is not possible as long as the Employee record of Raj references it. Hence the strategy can be to update the EmpAcc# field in the employee record of Raj to Null and then delete the Account parent record of 120002. After the deletion the data in the tables will be as follows: Employee

Emp#

EmpName

EmpCity

EmpAcc#

X101

Shekhar

Bombay

120001

X102

Raj

Pune

120002 Null

X103

Sharma

Nagpur

Null

X104

Vani

Bhopal

120003

Account ACC#

OpenDate

BalAmt

120001

30-Aug-1998

5000

120002

29-Oct-1998

1200

120003

01-Jan-1999

3000

120004

04-Mar-1999

500

3.4 Relational Algebra Operators The eight relational algebra operators are 1. SELECT – To retrieve specific tuples/rows from a relation.

Ord#

OrdDate

Cust#

101

02-08-94

002

104

18-09-94

002

2. PROJECT – To retrieve specific attributes/columns from a relation.

Descr

Price

Power Supply

4000

101-Keyboard

2000

Mouse

800

MS-DOS 6.0

5000

MS-Word 6.0

8000

3. PRODUCT – To obtain all possible combination of tuples from two relations.

Ord#

OrdDate

O.Cust#

C.Cust#

CustName

City

101

02-08-94

002

001

Shah

Bombay

101

02-08-94

002

002

Srinivasan

Madras

101

02-08-94

002

003

Gupta

Delhi

101

02-08-94

002

004

Banerjee

Calcutta

101

02-08-94

002

005

Apte

Bombay

102

11-08-94

003

001

Shah

Bombay

102

11-08-94

003

002

Srinivasan

Madras

4. UNION – To retrieve tuples appearing in either or both the relations participating in the UNION.

Eg: Consider the relation Ord_Jul as follows (Table: Ord_Jul) Ord#

OrdDate

Cust#

101

03-07-94

001

102

27-07-94

003

101

02-08-94

002

102

11-08-94

003

103

21-08-94

003

104

28-08-94

002

105

30-08-94

005

Note: The union operation shown above logically implies retrieval of records of Orders placed in July or in August 5. INTERSECT- To retrieve tuples appearing in both the relations participating in the INTERSECT.

Eg: To retrieve Cust# of Customers who've placed orders in July and in August Cust# 003

6. DIFFERENCE – To retrieve tuples appearing in the first relation participating in the DIFFERENCE but not the second.

Eg: To retrieve Cust# of Customers who've placed orders in July but not in August Cust# 001

7. JOIN – To retrieve combinations of tuples in two relations based on a common field in both the relations.

Eg: ORD_AUG join CUSTOMERS (here, the common column is Cust#) Ord#

OrdDate

Cust#

CustNames

City

101

02-08-94

002

Srinivasan

Madras

102

11-08-94

003

Gupta

Delhi

Note:103 The above join 21-08-94 operation logically retrieval of details 003 implies Gupta Delhiof all orders and the details of the corresponding customers who placed the orders. 104 28-08-94 002 Srinivasan Madras Such a join operation where only those rows having corresponding rows in the both 105 30-08-94 005 natural Apte the relations are retrieved is called the join or inner Bombay join. This is the most common join operation. Consider the example of EMPLOYEE and ACCOUNT relations. EMPLOYEE EMP #

EmpName

EmpCity

Acc#

X101

Shekhar

Bombay

120001

X102

Raj

Pune

120002

X103

Sharma

Nagpur

Null

X104

Vani

Bhopal

120003

ACCOUNT Acc#

OpenDate

BalAmt

120001

30. Aug. 1998

5000

120002

29. Oct. 1998

1200

120003

1. Jan. 1999

3000

120004

4. Mar. 1999

500

A join can be formed between the two relations based on the common column Acc#. The result of the (inner) join is : Emp#

EmpName

EmpCity

Acc#

OpenDate

BalAmt

X101

Shekhar

Bombay

120001

30. Aug. 1998

5000

X102

Raj

Pune

120002

29. Oct. 1998

1200

X104

Vani

Bhopal

120003

1. Jan 1999

3000

Note that, from each table, only those records which have corresponding records in the other table appear in the result set. This means that result of the inner join shows the details of those employees who hold an account along with the account details. The other type of join is the outer join which has three variations – the left outer join, the right outer join and the full outer join. These three joins are explained as follows: The left outer join retrieves all rows from the left-side (of the join operator) table. If there are corresponding or related rows in the right-side table, the correspondence will be shown. Otherwise, columns of the right-side table will take null values.

EMPLOYEE left outer join ACCOUNT gives: Emp#

EmpName

EmpCity

Acc#

OpenDate

BalAmt

X101

Shekhar

Bombay

120001

30. Aug. 1998

5000

X102

Raj

Pune

120002

29. Oct. 1998

1200

X103

Sharma

Nagpur

NULL

NULL

NULL

X104

Vani

Bhopal

120003

1. Jan 1999

3000

The right outer join retrieves all rows from the right-side (of the join operator) table. If there are corresponding or related rows in the left-side table, the correspondence will be shown. Otherwise, columns of the left-side table will take null values.

EMPLOYEE right outer join ACCOUNT gives: Emp#

EmpName

EmpCity

Acc#

OpenDate

BalAmt

X101

Shekhar

Bombay

120001

30. Aug. 1998

5000

X102

Raj

Pune

120002

29. Oct. 1998

1200

X104

Vani

Bhopal

120003

1. Jan 1999

3000

NULL

NULL

NULL

120004

4. Mar. 1999

500

(Assume that Acc# 120004 belongs to someone who is not an employee and hence the details of the Account holder are not available here)

The full outer join retrieves all rows from both the tables. If there is a correspondence or relation between rows from the tables of either side, the correspondence will be shown. Otherwise, related columns will take null values.

EMPLOYEE full outer join ACCOUNT gives: Emp#

EmpName

EmpCity

Acc#

OpenDate

BalAmt

X101

Shekhar

Bombay

120001

30. Aug. 1998

5000

X102

Raj

Pune

120002

29. Oct. 1998

1200

X103

Sharma

Nagpur

NULL

NULL

NULL

X104

Vani

Bhopal

120003

1. Jan 1999

3000

NULL

NULL

NULL

120004

4. Mar. 1999

500

Q: What will the result of a natural join operation between R1 and R2 ? A: a1

b1

c1

a2

b2

c2

a3

b3

c3

8. DIVIDE Consider the following three relations:

R1 divide by R2 per R3 gives: a Thus the result contains those values from R1 whose corresponding R2 values in R3 include all R2 values. 4. Structured Query Language (SQL) 4.1 SQL : An Overview The components of SQL are a. Data Manipulation Language – Consists of SQL statements for operating on the data (Inserting, Modifying, Deleting and Retrieving Data) in tables which already exist. b. Data Definition Language – Consists of SQL statements for defining the schema (Creating, Modifying and Dropping tables, indexes, views etc.) c. Data Control Language – Consists of SQL statements for providing and revoking access permissions to users Tables used:

Ord_Items

Ord_Aug Ord#

OrdDate

Cust#

101

02-AUG-94

002

102

11-AUG-94

003

103

21-AUG-94

003

104

28-AUG-94

002

105

30-AUG-94

005

Items Item#

Descr

Price

HW1

Power Supply

4000

HW2

101- Keyboard

2000

HW3

Mouse

800

SW1

MS-DOS 6.0

5000

SW2

MS-Word 6.0

8000

Ord#

Item#

Qty

101

HW1

100

101

HW3

50

101

SW1

150

102

HW2

10

103

HW3

50

104

HW2

25

104

HW3

100

105

SW1

100

Customers Cust#

CustName

City

001

Shah

Bombay

002

Srinivasan

Madras

003

Gupta

Delhi

004

Banerjee

Calcutta

005

Apte

Bombay

4.2 DML – SELECT, INSERT, UPDATE and DELETE statements.

The SELECT statement Retrieves rows from one or more tables according to given conditions. General form: SELECT [ ALL | DISTINCT ] FROM [ WHERE ] [ ORDER BY [DESC] [ GROUP BY ] [ HAVING ]

Query 1: Some SELECT statements on the Case Example SELECT * '15-AUG-94'; generates 1000000 tuples as intermediate result) - 1000000 tuples written to disk (Assuming that 1000000 tuples in the intermediate result cannot be held in the memory. 1000000 tuple writes to a temporary space in the disk.)

T2 = ORDTBL.ORD# = ORD_ITEMS.ORD# & ITEM# 'HW3'(T1) (Apply the two conditions in the query on the intermediate result obtained after the first step) - 1000000 tuples read into memory (1000000 tuple reads) - 50 selected (those tuples satisfying both the conditions. 50 held in the memory itself)

T3 = ORDDATE,ITEM#,QTY (T2) (Projection performed as the final step. No more tuple i/o s) - 50 tuples (final result) Total no. of tuple i/o s = 1000000 reads + 1000000 writes + 1000000 reads = 3000000 tuple i/o s

Query Evaluation – Method 2

T1 = ITEM#='HW3' (ORD_ITEMS) (Perform the Select operation on ORD_ITEMS as the first step) - 10000 tuple reads (10000 tuple reads from ORD_ITEMS) - 50 tuples selected; no disk writes (50 tuples satisfy the condition in Select. No disk writes assuming that the 50 tuples forming the intermediate result can be held in the memory) T2 = ORDTBL JOIN T1 - 100 tuple reads (100 tuple reads from ORDTBL) - resulting relation with 50 tuples

T3 = ORDDATE, ITEM#, QTY(T2) (Projection performed as the final step. No more tuple i/o s) - 50 tuples (final result) Total no. of tuple i/o s = 10000 reads + 100 reads = 10100 tuple i/o's Comparison of the two Query Evaluation Methods 10,100 tuple I/O's (of Method 2) v/s 3,000,000 tuple I/O's (of Method 1) ! Thus by sequencing the operations differently a dramatic difference can be made in the performance of queries. Here it needs to be noted that in the Method 2 of evaluation, the first operation to be performed was a 'Select' which filters out 50 tuples from the 10,000 tuples in the ORD_ITEMS table. Thus this operation causes elimination of 9950 tuples. Thus elimination in the initial steps would help optimization. Some more examples: select CITY, COUNT(*) from CUSTTBL 1. where CITY != 'BOMBAY' group by CITY;

v/s

select CITY, COUNT(*) from CUSTTBL group by CITY having CITY != 'BOMBAY';

select * from ORDTBL 2. where to_char(ORDDATE,'dd-mm-yy') = '11-08-94';

v/s

select * from ORDTBL where ORDDATE = to_date('11-08-94', 'dd-mm-yy');

Here the second version is faster. In the first form of the query, a function to_char is applied on an attribute and hence needs to be evaluated for each tuple in the table. The time for this evaluation will be thus proportional to the cardinality of the relation.

In the second form, a function to_date is applied on a constant and hence needs to be evaluated just once, irrespective of the cardinality of the relation. Moreover, if the attribute ORDDATE is indexed, the index will not be used in the first case, since the attribute appears in an expression and its value is not directly used. 6.3 The Query Optimization Process The steps of query optimization are explained below. a) Cast into some Internal Representation – This step involves representing each SQL query into some internal representation which is more suitable for machine manipulation. The internal form typically chosen is a query tree as shown below. Query Tree for the SELECT statement discussed above:

b)Convert to Canonical Form – In this second step, the optimizer makes use of some transformation laws or rules for sequencing the internal operations involved. Some examples are given below. (Note: In all these examples the second form will be more efficient irrespective of the actual data values and physical access paths that exist in the stored database. ) Rule 1: (A JOIN B) WHERE restriction_A AND restriction_B

(A WHERE restriction_A) JOIN (B WHERE restriction_B) Restrictions when applied first, cause eliminations and hence better performance.

Rule 2: (A WHERE restriction_1) WHERE restriction_2

A WHERE restriction_1 AND restriction_2 Two restrictions applied as a single compound one instead applying the two individual restrictions separately. Rule 3: (A[projection_1])[projection_2]

A[projection_2] If there is a sequence of successive projections applied on the same relation, all but the last one can be ignored. i.e., The entire operation is equivalent to applying the last projection alone. Rule 4: (A[projection]) WHERE restriction

(A WHERE restriction)[projection] Restrictions when applied first, cause eliminations and hence better performance. Reference [1] gives more such general transformation laws. c)Choose Candidate Low-level Procedures – In this step, the optimizer decides how to execute the transformed query. At this stage factors such as existence of indexes or other access paths, physical clustering of records, distribution of data values etc. are considered. The basic strategy here is to consider the query expression as a set of low-level implementation procedures predefined for each operation. For eg., there will be a set of procedures for implementing the restriction operation: one (say, procedure 'a') for the case where the restriction attribute is indexed, one (say, procedure 'b') where the restriction attribute is hashed and so on. Each such procedure has and associated cost measure indicating the cost, typically in terms of disk I/Os. The optimizer chooses one or more candidate procedures for each low-level

operations in the query. The information about the current state of the database (existence of indexes, current cardinalities etc.) which is available from the system catalog will be used to make this choice of candidate procedures. d)Generate Query Plans and Choose the Cheapest – In this last step, query plans are generated by combining a set of candidate implementation procedures. This can be explained with the following example(A trivial one but illustrative enough). Assume that there is a query expression comprising a restriction, a join and a projection. Some examples, of implementation procedures available for each of these operations can be assumed as given in the table below.

Operation

Condition Existing

Implementation Procedure

Restriction

Restriction attribute is indexed

a

Restriction

Restriction attribute is hashed

b

Restriction

Restriction attribute is neither indexed nor hashed

c

Join

d

Join

e

Projection

f

Projection

g

Now the various query plans for the original query expression can be generated by making permutations of implementation procedures available for different operations. Thus the query plans can be – adf - adg – aef – aeg – bdf ... ... It has to be noted that in reality, the number of such query plans possible can be too many and hence generating all such plans and then choosing the cheapest will be expensive by itself. Hence a heuristic reduction of search space rather than exhaustive search needs to be done. Considering the above example, one such heuristic method can be as follows: If the system knows that the restriction attribute is neither indexed nor hashed, then the query plans involving implementation procedure 'c ' alone (and not 'a' and 'b')

need to be considered and the cheapest plan can be chosen from the reduced set of query plans. 6.4 Query Optimization in Oracle Some of the query optimization measures used in Oracle are the following: –Indexes unnecessary for small tables. i.e., if the size of the actual data record is not much larger than the index record, the search time in the index table and the data table will be comparable. Hence indexes will not make much difference in the performance of queries. –Indexes/clusters when retrieving less than 25% of rows. The overhead of searching in the index file will be more when retrieving more rows. –Multiple column WHERE clauses –evaluations causing largest number of eliminations performed first –JOIN-columns should be indexed. JOIN columns or Foreign Key columns may be indexed since queries based on these columns can be expected to be very frequent. –Index not used in queries containing NULL / NOT NULL. Index tables will not have NULL / NOT NULL entries. Hence need not search for these in the index table.

Sonali DBMS Notes

Short Description

Description

Comments

We need your help!