Dbms Notes for Anna University
Short Description
Download Dbms Notes for Anna University...
Description
Velammal College of Engineering and Technology Department of Computer Science and Engineering Database Management Systems UNIT-I Introduction: Definition: DBMS contains information about a particular enterprise. DBMS is a • Collection of interrelated data and • Set of programs to access those data Primary Goal of a DBMS: is to provide an Environment that is both convenient and efficient for people to use in retrieving and storing database information Advantages of a DBMS: Using a DBMS to manage data has many advantages a. Data Independence (DBMS provide an abstract View of the data to insulate application code from internal details such as data representation and storage) b. Redundancy Control (DBMS control the redundancy of persistent data) c. Efficient Data Access (DBMS uses sophisticated techniques to Store and Retrieve data efficiently) d. Data Integrity and Security (DBMS can enforce Integrity Constraints on the data and can enforce access controls that that governs what data is visible to different classes of users) e. Concurrent access (DBMS provides correct, concurrent access to data by multiple users) f. Authorized Access (DBMS provides Security and Authorization Subsystem, which the DBA uses to create accounts and to specify Account restriction) g. Data Administration (When several users share the data, centralizing the Administration of offer significant improvements) h. Reduced Application Development (DBMS in conjunction with the High-Level Interface to the data can facilitates Quick Development of Applications) i. Back up and Crash Recovery (DBMS provides facilities for Recovering from Hardware and Software Failure. The Backup and Recovery subsystem of DBMS is responsible for Recovery) j. Database Communication Interfaces (Current DBMS provides Special Communication Routines designed to allow the database to accept End User Requests within a Computer Networks. For example, the DBMS provides communication functions to access the database through the Internet, using Internet Browsers such as Netscape or Explorer as the Front End) k. Flexibility l. Availability of Up-to-Date Information to all Users Disadvantages of DBMS: •
When not to use a DBMS • High Initial Investment --Cost of Hardware --Cost of Software --Cost for Training people to use DBMS --Cost of maintaining the DBMS --May need for additional hardware. 1
--Overhead for providing security, recovery, integrity, and concurrency control. • When a DBMS may be unnecessary: --If the database and application are simple, well Defined, and not expected to change --If there are stringent real-time requirements that may not be met because of DBMS overhead. --If access to data by multiple users is not required. Database Applications: • Banking: all transactions • Airlines: reservations, schedules • Universities: registration, grades • Sales: customers, products, purchases • Online retailers: order tracking, customized recommendations • Manufacturing: production, inventory, orders, supply chain • Human resources: employee records, salaries, tax deductions Purpose: In the early days, database applications were built directly on top of file systems Drawbacks of using file systems to store data: • Data redundancy and inconsistency Multiple file formats, duplication of information in different files • Difficulty in accessing data Need to write a new program to carry out each new task • Data isolation — multiple files and formats • Integrity problems Integrity constraints (e.g. account balance > 0) become “buried” in program code rather than being stated explicitly Hard to add new constraints or change existing ones
Earlier System: File Management System (FMS) The FMS was the first method used to store data in a Computerized Database. The data item is stored SEQUENTIALLY in one large file. And the data items are accessed using some Access Programs written in a Third Generation Language (3GL) such as BASIC, COBOL, FORTRAN, C and Visual Basic. A particular relationship cannot be drawn between the stored items other than the sequence in which it is stored. If a particular data item has to be located, the Search starts at the beginning and items are checked sequentially till the required items is found. (This is SEQUENTIAL Search) Characteristics of FMS: 1. Sequential Record Access Method. 2. Meta-data is embedded into programs accessing files. 3. File System exhibit Structural Dependence (i.e. the access to the file is dependent on its Structure (Fields)) 4. File System exhibit Data Dependence (i.e. change in data types of File Data item such as from “int” to “float” requires changes in all programs that access the File.)
2
Disadvantages of File System: (Purpose/Need of/for Database Systems) Drawbacks of using file systems to store data are: 1. Data Redundancy and Inconsistency: In the File Processing System, the same information may appear in more than one file. For example the column “LOCATION” appears in two Files. So the same Information was duplicated in several places (in both the Files). This occurrence of same data column in several places (Files) is called Data Redundancy and it leads to Higher Storage, Access Cost and mainly Data Inconsistency. If suppose Mr. Abraham is transferred from “Chennai” to “Madurai”, the job of Data Processing Specialist is to update the “LOCATION” column exists in both the Files using Access Programs but unfortunately he updates the data item in File “ Employee’s Detail” only with File “Employee Location” Unupdated. Now, generating the Report yields Inconsistency results depending on which version of the data is used. This is called the Data Inconsistency and in simple terms it is the Lack of Data Integrity. 2. Difficulty in accessing Data: Suppose the HR of the Company ask the Data Processing Department to generate the list of Employee whose Postal Code is Less than 66666666. Because this request was not anticipated when the original system was developed, there is no application program on hand to meet it. But there is an application program to generate the List of all Employees with their Postal Code. So the HR has to two Options: Either he can obtain the list of all Employees and Work out manually to find the persons, or he can ask the DP Specialist to write the necessary application program. Both alternatives are obviously unsatisfactory. Suppose such a program is written, and several days later, the same Person needs to trim that list to include only those whose Salary is > $20000. As expected, a program to generate such a list does not exist. Again, the HR has the same two options, neither of which is satisfactory. There is a need to write a New Program to carry out Each New Task. So the Conventional File processing environments do not allow needed data to be retrieved in a convenient and efficient manner. 3. Data Isolation: Because data are scattered in various files, and files may be in different formats, it is difficult to write new application programs to retrieve the appropriate. 4. Integrity Problems: The data values stored in the Database must satisfy certain types of Consistency Constraints. For example, the minimum salary of any person should never fall below the prescribed amount (say, $7/hr). Developers should enforce these constraints in the system by adding appropriate code in the various applications programs. However, when new constraints are added, it is difficult to change the programs to enforce them. The problem is compounded when constraints involve several data items from different files. 5. Atomicity of Updates: A Computer System, like any other mechanical or electrical device, is subject to failure. In many applications, it is crucial to ensure that, once a failure has occurred and has been detected, the data are restored to the consistent state that existed prior to the failure. Consider a program to change the salary of “Mr. Sudarshan” from Rs.20000 to Rs.22000. If a system failure occurs during the execution the program, it is possible that the Rs.20000 was removed from the Salary column of “Mr. Sudarshan” but it is not added with the new value of Rs.22000, resulting in an Inconsistent Database State. Clearly, it is essential that either both the operation (Removing and Adding) occurs, or that neither occurs. That is the Salary Change must be Atomic—it must happen in its entirety or not at all. It is difficulty to ensure this property in a conventional File Processing System. In simpler words, the change of Salary should either Complete or not happen at all. 6. Concurrent –Access Anomalies: (Concurrent access by multiple users) Many systems allow multiple users to update the data simultaneously, so that the overall performance of the system is improved and a faster response time is possible. 3
Uncontrolled concurrent accesses can lead to inconsistencies. For example, reading a balance by two people and updating it at the same Time can leave the System Inconsistent. /*consider Bank account A, containing $500. If two customers withdraw funds $50 and $100 respectively from Account A at the same time, the result of the concurrent executions may leave the account in an Incorrect (or Inconsistent) state. Suppose that the programs executing on behalf of each withdrawal read the old balance, reduce that value by the amount being withdrawn, and write the result back. If two programs run concurrently, they may both read the value $500, and write back $450 and $400, respectively. Depending on which one writes the value last, the account may contain $450 or $400, rather than the correct value of $350. To guard against this possibility, the system must maintain some form of Supervision. Because data may be accessed previously, however, Supervision is difficult to provide.*/ 7. Security Problems: Not every user of the Database System should be able to access all the data. For example, in a University Database System, the Students need to see only that part of the database that has information about the various Courses and Syllabus available to them for Study. They do not need access to information about Personal details of Faculty Members. Since application programs are added to the System in an Ad Hoc (or Unplanned or Unprepared) Manner, it is difficulty to enforce such Security Constraints on Files. 8. Cost: Since due to Redundancy, Inconsistency, Concurrent access and low level Security offered by FMS, the Higher Cost is involved in every area ranges from the Low Programmer Productivity to Maintenance. Database systems offer solutions to all the above problems. These difficulties have prompted the development of DBMS. History of Database Systems 1950s and early 1960s: • Data processing using magnetic tapes for storage Tapes provide only sequential access • Punched cards for input Late 1960s and 1970s: • Hard disks allow direct access to data • Network and hierarchical data models in widespread use • Ted Codd defines the relational data model Would win the ACM Turing Award for this work IBM Research begins System R prototype UC Berkeley begins Ingres prototype • for the era) transaction processing 1980s: • evolve into commercial systems SQL becomes industrial standard • Parallel and distributed database systems • Object-oriented database systems 1990s: • Large decision support and data-mining applications • Large multi-terabyte data warehouses • Emergence of Web commerce 2000s: • XML and XQuery standards • Automated database administration 4
Levels of Abstraction The major purpose of the database system is to provide users with an abstract view of the data. • Physical level: describes how a record (e.g., customer) is stored. • Logical level: describes what data are stored in database, and what relationships exists among the data.
•
type customer = record customer_id : string; customer_name : string; customer_street : string; customer_city : integer; end; View level: it describes only part of the entire database. The application programs hide details of data types. Views can also hide information (such as an employee’s salary) for security purposes.
VIEW OF DATA: (Three Levels of Abstraction) An architecture for a database system
INSTANCES AND SCHEMAS Instances: The collection of information stored in the database at a particular moment is called as an instance of the database. Schema: The overall design of the database is called as the database schema. Ie) Schema – the logical structure of the database 5
Example: The database consists of information about a set of customers and accounts and the relationship between them Analogous to type information of a variable in a program Physical schema: it specifies the database design at the physical level Logical schema: it specifies the database design at the logical level Subschema : the different schemas at the view level are called as subschemas. Instance – the actual content of the database at a particular point in time Analogous to the value of a variable o The applications programs are depend on the logical schema. So the logical schema is most important. o Physical Data Independence is the ability to modify the physical schema without changing the logical schema o In general, the interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.
DATA MODELS Definition: A collection of conceptual tools for describing • Data • Data relationships • Data semantics • Data constraints TYPES: Relational model: Relational Model uses a collection tables to represent both data and the relationship among those data. It is an example of record-based model. The relational model is at a lower level of abstraction than E-R Model. Entity-Relationship data model (mainly for database design) : it consists of a collection of basic objects called “Entities”, and the relationships among those entities. Object-based data models (Object-oriented and Object-relational) Advantages of OO Data Model: • Structural Independence and Data Independence • Addition of Semantic content to the data model gives the data greater meaning • Easier to visualize more complex relationships within and between objects • Database Integrity is protected by the use of Inheritance
Disadvantages of OO Data Model: • No Standard Data Access method • Implementation requires substantial Hardware and O.S Overhead • Difficult to use properly
Semistructured data model (XML) Other older models: o Network model o Hierarchical model Hierarchical Model: In this model, Data items are represented using Records and its Parent/Child (1:N) Relationship is implemented using Links. Each parent has many Child but each Child may have only one Parent. Records are organized as collection of Trees. This model supports only One-to Many Relationship. 6
Advantages of Hierarchical Model: • Data Sharing • Data Independence • Parent/Child Relationship promotes Database Integrity • Efficient when dealing with a Large Database • Large Installed (Mainframe) Base • Abundant Business Applications
Disadvantages of Hierarchical Model: • Requirement of Knowledge of physical level of data storage • Complex and Inflexible to manage the database • Lacks Structural Independence • Time Consuming and Complicated Application Programming • Lack Ad Hoc Query capability for End Users • Extensive Programming Activity is required • No DDL or DML in DBMS
Network Model: In this model, data are represented as collection of records, and the Parent/Child Relationship is implemented using Links (Pointer) and too Ring Structures. This model supports Many-to-Many Relationship. This model is also called as CODASYL or DBTG model. The records in the database are organized as collections of arbitrary graphs. Advantages of Network Model: • Easier M:N relationship Implementation • Better Data Access comparing Hierarchical Data Model • Sufficient Data Independence • Enforced Data Integrity
Disadvantages of Network Model: • Use of Pointers leads to Complexity in Structure. As a result of the increased Complexity, Mapping of related record becomes very difficult. • Difficult to make change in the database (Lacks Structural Independence) • Difficult to use properly
Database Languages: Data Definition Language (DDL): Specification notation for defining the database schema Example: create table account ( account-number char(10), balance integer) DDL compiler generates a set of tables stored in a data dictionary or data directory. Data dictionary contains metadata (i.e., data about data) • Database schema • Data storage and definition language Specifies the storage structure and access methods used • Integrity constraints Domain constraints Referential integrity (references constraint in SQL) Assertions • Authorization Data Manipulation Language (DML): DML is a Language for accessing and manipulating the data organized by the appropriate data model. DML also known as query language 7
Two classes of DMLs are: • Procedural DMLs require a user to specify what data is required and how to get those data • Declarative (nonprocedural) DMLs require a user to specify what data is required without specifying how to get those data. It is easy to learn and use DMLs. • A query is a statement requesting the retrieval of information. • SQL is the most widely used query language RELATIONAL MODEL: Advantages of Relational Model: • Data Independence and Structural Independence • Easy to design the Database and to manage its contents • Less Programming Effort required • Powerful and flexible Query capability (SQL) • Ad Hoc capability Disadvantages of Relational Model: • It tends to slower than other Database System • RDBMS requires substantial Hardware and Operating System Overhead Example of tabular data in the relational model Attributes
Sample Relational Database 8
SQL SQL: widely used non-procedural language Example 1: Find the name of the customer with customer-id 192-83-7465 select customer.customer_name from customer where customer.customer_id = ‘192-83-7465’ Example 2: Find the balances of all accounts held by the customer with customer-id 19283-7465 select account.balance from depositor,account where depositor.customer_id=‘192-83-7465’and depositor.account_number = account.account_number Application programs generally access databases through one of Language extensions to allow embedded SQL Application program interface (e.g., ODBC/JDBC) which allow SQL queries to be sent to a database
Database Design: The process of designing the general structure of the database: Logical Design – Deciding on the database schema. Database design requires that we find a “good” collection of relation schemas. Business decision – What attributes should we record in the database? Computer Science decision – What relation schemas should we have and how should the attributes be distributed among the various relation schemas? Physical Design – Deciding on the physical layout of the database ER Model: It Models an enterprise as a collection of entities and relationships Entity: a “thing” or “object” in the enterprise that is distinguishable from other objects. It is described by a set of attributes. Relationship: an association among several entities Represented diagrammatically by an entity-relationship diagram:
9
Advantages of ER Model: o Structural Independence and Data Independence o ER model gives Database Designer, Programmers, and end users an easily understood Visual Representation of the data and the data relationship. o ER model is well integrated with the Relational database model o Visual modeling helps in conceptual simplicity Disadvantages of ER Model: o No DML o Limited Constraint Representation o Limited Relationship Representation o Loss of Information Content because attributes are usually removed to avoid crowed displays Object-Relational Data Models Extend the relational data model by including object orientation and constructs to deal with added data types. Allow attributes of tuples to have complex types, including non-atomic values such as nested relations. Preserve relational foundations, in particular the declarative access to data, while extending modeling power. Provide upward compatibility with existing relational languages. XML: Extensible Markup Language Defined by the WWW Consortium (W3C) Originally intended as a document markup language not a database language The ability to specify new tags, and to create nested tag structures made XML a great way to exchange data, not just documents XML has become the basis for all new generation data interchange formats. A wide variety of tools is available for parsing, browsing and querying XML documents/data
Database Architecture Database Users and Administrators: The people who work with the database can be categorized as database users or database administrator. Database Users: • • • •
Users are differentiated by the way they expect to interact with the system Application programmers – interact with system through DML calls Sophisticated users – form requests in a database query language Specialized users – write specialized database applications that do not fit into the traditional data processing framework Naïve users – invoke one of the permanent application programs that have been written previously Examples, people accessing database over the web, bank tellers, clerical staff 10
Database Administrator Coordinates all the activities of the database system; the database administrator has a good understanding of the enterprise’s information resources and needs. Functions of the DBA: • Schema Definition: The DBA creates the original database schema by writing a set of definitions that is translated by the DDL Compiler to a set of tables that is stored permanently in the Data Dictionary. • Storage Structure and Access-Method Definition: The DBA creates appropriate storage structures and access methods by writing a set of definitions, which is translated by the Data Storage and DDL Compiler. • Schema and Physical-Organization Modification: Programmers accomplish the relatively rare modifications either to the database Schema or to the description of the Physical Storage Organization by writing a set of definitions that is used by either DDL Compiler or the Data-Storage and DDL Compiler to generate modifications to the appropriate Internal System Tables ( Eg: Data Dictionary) • Granting of Authorization for Data Access: The granting of different types of Authorization allows the DBA to regulate which parts of the Database various Users can Access. The Authorization information is kept in a special system Structure that is consulted by the DB System whenever access to the Data is attempted in the system. • Integrity-Constraint Specification: The data values stored in the database must satisfy certain Consistency Constraints. For example, the Salary of the any employee(Programmer) in an organization should never fall below some limit(say, $4000 / month). This constraint must be specified explicitly by the DBA. And these Integrity Constraints are kept in a Special System Structure that is consulted by the DB System whenever an update takes place in the System. • Routine Maintenance: Examples of the DBA’s routine maintenance activities are: o Periodically backing up the Database, either onto Tapes or onto Remote servers, to prevent Loss of Data in case of Disasters such as Flooding. o Ensuring that enough free Disk Space is available for normal operations, and updating disk space as required. • Monitoring jobs running on the Database and ensuring that performance is not degraded by very expensive tasks submitted by some users. Transaction management: A transaction is a collection of operations that performs a single logical function in a database application Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g., power failures and operating system crashes) and transaction failures. Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database. The architecture of database systems is greatly influenced by the underlying computer system on which the database is running: 11
• • • •
Centralized Client-server Parallel (multi-processor) Distributed
System Structure/Architecture of a Database System (DBMS Components): Usually, Database System is partitioned into various modules of functions. Some of the functions of the database System is provided by the Operating System. So, the design of the Database System must include consideration of the interface between the Database system and OS. The functional components of the Database System are: 1. Query Processor 2. Storage Manager 1. Query Processor: This module contain the following Components: a. DML Compiler: This component translates the DML Statements in a Query Language into the Low-Level instructions that the Query Evaluation Engine understands. And too, it attempts to translate transform a user’s request into an equivalent and efficient form, thus finding a good strategy for executing the query (Query Optimization). b. Embedded DML Precompiler: This Component converts DML statements embedded in an Application Program to normal Procedure Calls in the Host Language. And it interacts with the DML Compiler to generate the appropriate Code. c. DDL Interpreter: This Component interprets DDL statements and records them in a Set of Tables containing Metadata (Data Dictionary) d. Query Evaluation Engine: This component executes Low-Level Instructions generated by the DML Compiler. 2. Storage Manager: It is a Program Module that provides the interfaces between the Low-Level data stored in the database and the application programs and queries submitted to the System. And it is responsible for the interaction with the File Manager. The raw data are stored on the disk using the File System, which is provided by the OS. The Storage Manager translates the various DML statements into Low-Level File-System commands. Thus, the Storage Manager is responsible for Storing, retrieving, and updating data in the database. The components present in this module are: a. Authorization and Integrity Manager: This component tests for the satisfaction of the Integrity Constraints and checks the authority of the users to access the data. b. Transaction Manager: This component ensures that the database remains in a consistent (correct) state despite System Failure, and that Concurrent Transaction executions proceed without conflicting. c. File Manager: This component manages the allocation of the space on the Disk Storage and the Data Structures used to represent information stored on the Disk. d. Buffer Manager: This component is responsible for Fetching data from the Disk Storage into Main Memory, and deciding what data to cache in Memory. This is a Critical part of the database, since it enables the database to handle data sizes that are much larger than the size of Main Memory.
12
Overall System Structure
Data Structures used by the Storage Manager for the Physical System Implementation of Database System: 1. Data Files: It Stores the Database itself. 2. Data Dictionary: It stores the Metadata about the Structure of the database, in particular the Schema of the Database. Since it is heavily used, greater emphasis should be placed in developing a good Design and Implementation of it. 3. Indices: It provides fast access to data items that hold particular values. 4. Statistical Data: It stores Statistical Information about the data in the Database. This information is used by the Query Processor to select efficient ways to execute a Query. Data Dictionary: It is a data structure used by Storage Manager to store Metadata (data about the data) that is Structure of the database, in particular the Schema of the Database. Since it is heavily used, greater emphasis should be placed in developing a good Design and Implementation of it. E-R Model: It Models an enterprise as a collection of entities and relationships Entity: a “thing” or “object” in the enterprise that is distinguishable from other objects. It is described by a set of attributes. Relationship: an association among several entities Entity Sets • An entity is an object that exists and is distinguishable from other objects. Example: specific person, company, event, plant 13
• •
Entities have attributes Example: people have names and addresses An entity set is a set of entities of the same type that share the same properties. Example: set of all persons, companies, trees, holidays
•
An entity is represented by a set of attributes, that is descriptive properties possessed by all members of an entity set.
•
Domain – Each entity has a set of values for each of its attributes. The set of permitted values for each attribute is known as domain Attribute types ( in E-R model): • Simple and composite attributes: Simple Attributes: The attributes which can not be divided into subparts. Ex. Roll number Composite Attributes: The attributes which can be divided into subparts. Ex. Address (Door number, street name, city, state etc.) • Single-valued and multi-valued attributes Single Valued Attributes: the attributes which are having single value for a particular entity. Ex. Loan number Multi valued Attribute: the attributes which are having more than one value. Ex. A person may have more than one phone number • Derived attributes: the value of this type of attribute can be computed from other attributes. E.g. age can be computed from the date of birth. An attribute may take null values when the values are not known for the entities.
•
Relationship Sets ■ A relationship is an association among several entities Example: Hayes depositor A-102 customer entity relationship set account entity ■ A relationship set is a mathematical relation among n ≥ 2 entities, each taken from entity sets {(e1, e2, … en) | e1 ∈ E1, e2 ∈ E2, …, en ∈ En} where (e1, e2, …, en) is a relationship ★ Example: (Hayes, A-102) ∈ depositor
Relational Model Basic Structure: Formally, given sets D1, D2, …. Dn a relation r is a subset of D1 x D2 x … x Dn Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai ∈ Di Example: If customer_name = {Jones, Smith, Curry, Lindsay} customer_street = {Main, North, Park} customer_city = {Harrison, Rye, Pittsfield} Then r = { (Jones, Main, Harrison), (Smith, North, Rye), 14
(Curry, North, Rye), (Lindsay, Park, Pittsfield) } is a relation over customer_name x customer_street x customer_city
Attribute Types: Each attribute of a relation has a name • The set of allowed values for each attribute is called the domain of the attribute • Attribute values are (normally) required to be atomic; that is, indivisible Note: multivalued attribute values are not atomic Note: composite attribute values are not atomic • The special value null is a member of every domain • The null value causes complications in the definition of many operations Relation Schema A1, A2, …, An are attributes R = (A1, A2, …, An ) is a relation schema Example: Customer_schema = (customer_name, customer_street, customer_city) r(R) is a relation on the relation schema R Example: customer (Customer_schema)
Relation Instance The current values (relation instance) of a relation are specified by a table An element t of r is a tuple, represented by a row in a table Keys Let K ⊆ R K is a superkey of R if values for K are sufficient to identify a unique tuple of each possible relation r(R) • by “possible r ” we mean a relation r that could exist in the enterprise we are modeling. Example: {customer_name, customer_street} and {customer_name} are both superkeys of Customer, if no two customers can possibly have the same name. K is a candidate key if K is minimal Example: {customer_name} is a candidate key for Customer, since it is a superkey (assuming no two customers can possibly have the same name), and no subset of it is a superkey. Primary Key
15
Query Languages Language in which user requests information from the database. Categories of languages Procedural Non-procedural, or declarative “Pure” languages: Relational algebra Tuple relational calculus Domain relational calculus Pure languages form underlying basis of query languages that people use.
Relational Algebra: Procedural language Six basic operators • select: σ • project: ∏ • union: ∪ • set difference: – • Cartesian product: x • rename: ρ The operators take one or two relations as inputs and produce a new relation as a result. Select Operation – Example Relation r
A
B
C
D
α α β β
α β β β
1 5 12 23
7 7 3 10
σA=B ^ D > 5 (r)
A
B
C
D
α β
α β
1 2 3
7 1 0
Select Operation: Notation: σ p(r) p is called the selection predicate Defined as: σp(r) = {t | t ∈ r and p(t)}
16
Where p is a formula in propositional calculus consisting of terms connected by : ∧ (and), ∨ (or), ¬ (not) Each term is one of: o op or o where op is one of: =, ≠, >, ≥. 1200 (loan)) Find the names of all customers who have a loan, an account, or both, from the bank: ∏customer_name (borrower) ∪ ∏customer_name (depositor) Find the names of all customers who have a loan and an account at bank. ∏customer_name (borrower) ∩ ∏customer_name (depositor)
• • • •
• • •
σamount > 1200 (loan)
Find the names of all customers who have a loan at the Perryridge branch. ∏customer_name (σbranch_name=“Perryridge” (σborrower.loan_number = loan.loan_number(borrower x loan))) Find the names of all customers who have a loan at the Perryridge branch but do not have an account at any branch of the bank. ∏customer_name (σbranch_name = “Perryridge” (σborrower.loan_number = loan.loan_number(borrower x loan))) – ∏customer_name(depositor) Find the names of all customers who have a loan at the Perryridge branch. Query 1 ∏customer_name (σbranch_name = “Perryridge” ( σborrower.loan_number = loan.loan_number (borrower x loan))) Query 2 ∏customer_name(σloan.loan_number = borrower.loan_number ( (σbranch_name = “Perryridge” (loan)) x borrower))
• Find the largest account balance Strategy: o Find those balances that are not the largest o Rename account relation as d so that we can compare each account balance with all others o Use set difference to find those account balances that were not found in the earlier step. o The query is: balance(account) - ∏account.balance (σaccount.balance < d.balance (account x rd (account)))
Formal Definition: •
A basic expression in the relational algebra consists of either one of the following: A relation in the database A constant relation 20
•
Let E1 and E2 be relational-algebra expressions; the following are all relationalalgebra expressions: E1 ∪ E2 E1 – E2 E1 x E2 σp (E1), P is a predicate on attributes in E1 ∏s(E1), S is a list consisting of some of the attributes in E1 ρ x (E1), x is the new name for the result of E1
Additional Definition: e define additional operations that do not add any power to the relational algebra, but that simplify common queries. Set intersection Natural join Division Assignment Set-Intersection Operation Notation: r ∩ s Defined as: r ∩ s = { t | t ∈ r and t ∈ s } Assume: o r, s have the same arity o attributes of r and s are compatible Note: r ∩ s = r – (r – s) Example: R s A B A B
α α β
1 2 1
α β
2 3
r∩ ∩s A α
B 2
Natural-Join Operation: Notation: r o Let Then, o o
s
r and s be relations on schemas R and S respectively. r s is a relation on schema R ∪ S obtained as follows: Consider each pair of tuples tr from r and ts from s. If tr and ts have the same value on each of the attributes in R ∩ S, add a tuple t to the result, where t has the same value as tr on r t has the same value as ts on s
Example: R = (A, B, C, D) S = (E, B, D) Result schema = (A, B, C, D, E) 21
r
s is defined as: ∏r.A, r.B, r.C, r.D, s.E (σr.B = s.B ∧ r.D = s.D (r x s))
Example: Relation r
s
A
B
C
D
B
D
E
1 2 4 1 2
α γ β γ β
a a b a b
α β γ α δ
1 3 1
a a a
α β γ
B
C
D
E
1 1 1 1 2
α α γ γ β
a a a a b
α γ α γ δ
A
α α α α δ
Division Operation • • •
Notation: r ÷ s Suited to queries that include the phrase “for all”. Let r and s be relations on schemas R and S respectively where R = (A1, …, Am , B1, …, Bn ) S = (B1, …, Bn) The result of r ÷ s is a relation on schema R – S = (A1, …, Am) r ÷ s = { t | t ∈ ∏ R-S (r) ∧ ∀ u ∈ s ( tu ∈ r ) } Where tu means the concatenation of tuples t and u to produce a single tuple Relation r s r÷s
22
A
B B
A α α α β γ δ δ δ ∈ ∈ β
1 2 3 1 1 1 3 4 6 1 2
1 2
=
α β
Properties: •
Property Let q = r ÷ s Then q is the largest relation satisfying q x s ⊆ r • Definition in terms of the basic algebra operation Let r(R) and s(S) be relations, and let S ⊆ R r ÷ s = ∏R-S (r ) – ∏R-S ( ( ∏R-S (r ) x s ) – ∏R-S,S(r )) To see why ∏R-S,S (r) simply reorders attributes of r ∏R-S (∏R-S (r ) x s ) – ∏R-S,S(r) ) gives those tuples t in ∏R-S (r ) such that for some tuple u ∈ s, tu ∉ r. Assignment Operation •
The assignment operation (←) provides a convenient way to express complex queries. Write query as a sequential program consisting of a series of assignments followed by an expression whose value is displayed as a result of the query. Assignment must always be made to a temporary relation variable.
Example: Write r ÷ s as temp1←∏R-S(r) temp2←∏R-S ((temp1xs) – ∏R-S,S (r )) result = temp1 – temp2 o The result to the right of the ← is assigned to the relation variable on the left of the ←. o May use variable in subsequent expressions.
23
EXTENDED RELATIONAL-ALGEBRA-OPERATIONS Generalized Projection Aggregate Functions Outer Join
Generalized Projection Extends the projection operation by allowing arithmetic functions to be used in the projection list. ∏ , ,..., ( E ) F1 F2
Fn
E is any relational-algebra expression Each of F1, F2, …, Fn are are arithmetic expressions involving constants and attributes in the schema of E. Given relation credit_info(customer_name, limit, credit_balance), find how much more each person can spend: ∏customer_name, limit – credit_balance (credit_info)
Aggregate Functions and Operations • Aggregation function takes a collection of values and returns a single value as a result. avg: average value min: minimum value max: maximum value sum: sum of values count: number of values • Aggregate operation in relational algebra E is any relational-algebra expression G1, G2 …, Gn is a list of attributes on which to group (can be empty) Each Fi is an aggregate function Each Ai is an attribute name g sum(c) (r) Relation r: A
B
sum(c )
C
7 α α 7 α β 3does not have a name β of aggregation β • Result β Can useβrename 10 operation to give it a name
27
For convenience, we permit renaming as part of aggregate operation
branch_name
g sum(balance) as sum_balance (account)
Outer Join • •
An extension of the join operation that avoids loss of information. Computes the join and then adds tuples form one relation that does not match tuples in the other relation to the result of the join. • Uses null values: null signifies that the value is unknown or does not exist All comparisons involving null are (roughly speaking) false by definition. We shall study precise meaning of comparisons with nulls later Relation loan Relation borrower 24
customer _name
loan_numbe r L-170 L-230
loan_nu mber L-170 L-230 L-260
branch_na me Downtown Redwood Perryridge
amount
Jones Smith Hayes
3000 4000 1700
Inner join Loan Borrower loan_number L-170 L-230
branch_name
amount
Downtown Redwood
3000 4000
customer_name Jones Smith
Left outer Join Loan
Borrower
loan_number L-170 L-230
branch_name Downtown Redwood
amount 3000 4000
customer_name Jones Smith
Null Values It is possible for tuples to have a null value, denoted by null, for some of their attributes null signifies an unknown value or that a value does not exist. The result of any arithmetic expression involving null is null. Aggregate functions simply ignore null values (as in SQL) For duplicate elimination and grouping, null is treated like any other value, and two nulls are assumed to be the same (as in SQL) Comparisons with null values return the special truth value: unknown o If false was used instead of unknown, then not (A < 5) would not be equivalent to A >= 5 Three-valued logic using the truth value unknown: o OR: (unknown or true) = true, (unknown or false) = unknown (unknown or unknown) = unknown o AND: (true and unknown) = unknown, (false and unknown) = false, (unknown and unknown) = unknown o NOT: (not unknown) = unknown o In SQL “P is unknown” evaluates to true if predicate P evaluates to unknown Result of select predicate is treated as false if it evaluates to unknown
25
Modification of the Database The content of the database may be modified using the following operations: Deletion Insertion Updating All these operations are expressed using the assignment operator. Deletion A delete request is expressed similarly to a query, except instead of displaying tuples to the user, the selected tuples are removed from the database. Can delete only whole tuples; cannot delete values on only particular attributes A deletion is expressed in relational algebra by: r←r–E where r is a relation and E is a relational algebra query. Example: Delete all account records in the Perryridge branch. ccount ← account – σ branch_name = “Perryridge” (account ) Delete all loan records with amount in the range of 0 to 50 oan ← loan – σ amount ≥ 0 and amount ≤ 50 (loan) Delete all accounts at branches located in Needham. r1 ← σ branch_city = “Needham” (account branch ) r2 ← ∏branch_name, account_number, balance (r1) r3 ← ∏ customer_name, account_number (r2 depositor) account ← account – r2 depositor ← depositor – r3
Insertion To insert data into a relation, we either: o specify a tuple to be inserted o write a query whose result is a set of tuples to be inserted in relational algebra, an insertion is expressed by: ← r ∪ E here r is a relation and E is a relational algebra expression. The insertion of a single tuple is expressed by letting E be a constant relation containing one tuple. Example: Insert information in the database specifying that Smith has $1200 in account A-973 at the Perryridge branch. ccount ← account ∪ {(“Perryridge”, A-973, 1200)} epositor ← depositor ∪ {(“Smith”, A-973)} Updating A mechanism to change a value in a tuple without charging all values in the tuple Use the generalized projection operator to do this task Each Fi is either o the I th attribute of r, if the I th attribute is not updated, or, 26
o if the attribute is to be updated Fi is an expression, involving only constants and the attributes of r, which gives the new value for the attribute Example: Make interest payments by increasing all balances by 5 percent. ccount ← ∏ account_number, branch_name, balance * 1.05 (account)
Relational Calculus: It is a Non-Procedural Query Language associated with the Relational Model It is a formal Query Language where we write declarative expression to specify a retrieval request and hence there is no description of how to evaluate a Query. So it is also called as Declarative Query Language A Calculus Expression specifies what is to be retrieved rather than how to retrieve it. The Relational Calculus had a big influence in the Design of Commercial Query Languages such as SQL and, especially Query-by-Example (QBE) Types of Relational Calculus: o Tuple Relational Calculus (TRC) o Domain Relational Calculus (DRC)
Tuple Relational Calculus (TRC):
It is a Non-Procedural/Declarative Query Language A Variables in TRC take on Tuples as Values TRC has had a more influence on SQL A Query in a TRC is expressed as { t | P(t) } Where t a tuple variable P(t) Predicate Calculus Formula/ Condition A tuple variable can ranges over a particular database relation ie it can take any individual tuple from that relation as its value The result of the query is set of all tuples t that satisfy the Condition/ formula P(t)
Domain Relational Calculus (DRC): It is a Non-Procedural/Declarative Query Language This method uses domain variables Domain variables take values from an attributes’ domain, rather than values from an entire tuple DRC has had a more influence on QBE An expression in DRC has the general form of { < x1, x2 , x3 , ….., xn > | P( x1, x2 , x3 , ….., xn ) } where x1, x2 , x3 , ….., xn domain variables ranges over domains of attributes P Formula/ Condition
27
UNIT-II SQL: Introduction: It is the most widely used Commercial and Standard Relational Database Language It is originally developed at IBM’s San Jose Research Laboratory This language is originally called as SEQUEL and implemented as part of the System R project (1974—1977) Almost other vendors introduced DBMS products based on SQL, and it is now a de facto standard Now the SEQUEL is called as SQL(Structured Query Language) In 1986, the ANSI(American National Standards Institute) and ISO (International Standards Organization) published an SQL standard, called SQL86 IBM published its own corporate SQL standard, the Systems Application Architecture Database Interface (SAA-SQL) in 1987 In 1989, an extended standard for SQL was published SQL-89 and the Database System available today supports at least the features of SQL-89 And in 1992, ANSI/ISO proposed a new SQL standard namely SQL-92 The Current Standard is SQL99 or SQL 1999. And in this Standard, the “Object Relational” concepts have been added. Foreseen Standard is SQL:200x , which is in draft form. SQL can either be specified by a command-line tool or it can be embedded into a general purpose programming language such as Cobol, "C", Pascal, etc. Parts in SQL Language: The SQL Language has 7 parts: 1. Data Definition Language (DDL): It provides commands for defining relation schemas, deleting relations, creating indices, and modifying relation schemas 2. Interactive Data-Manipulation (DML): It includes a Query Language based on both the relational algebra and the Tuple Relational Calculus. It includes commands to insert tuples into, delete from, and to modify tuples in the database 3. Embedded DML: This form of SQL is designed to use within general purpose Programming Languages, such as PL/I, COBOL, PASCAL, FORTRAN, and C 4. View Definition: The SQL DDL includes commands for defining View 5. Authorization: The SQL DDL includes commands for specifying access rights to relations and Views 6. Integrity: The SQL DDL includes commands for specifying Integrity Constraints that the data stored in the database must satisfy. Updates that violates the ICs are disallowed 7. Transaction Control: SQL includes commands for specifying the beginning and Ending of Transactions. Basic Structure of SQL: The basic structure of an SQL expression consists of three clauses: Select, from and where. Select clause corresponds to the Projection operation of the Relational Algebra. It is used to list the desired attributes in the result relation. From clause corresponds to the Cartesian Product operation of the Relational Algebra. It lists the relations to be scanned in the evaluation of expression.
28
Where clause corresponds to the Selection Predicate of the Relational Algebra. It consists of a predicate involving attributes of the relations that appear in the from clause A typical SQL query has the form : Select A1, A2 , A3 , …….., An From r1 , r2 , ……, rm Where P A1, A2 , A3 , …….., An r1 , r2 , ……, rm P
Attributes Relations Predicate
The Relational algebra equivalent of the above query is ∏A1, A2 , A3 , …….., An (P (r1 × r2 ×……× rm)) The SELECT Clause: 1. Write the Description 2. Write the Syntax 3. Explain the use of distinct , * , and all keywords 4. Provide at least 4 examples Select Queries The WHERE Clause: 1. Write the Description 2. Write the Syntax (Select A1, A2 ,….., An from r1 , r2 , ……, rm where P ) 3. List the Logical Connectives that can be used in where clause 4. Provide atleast 3 example queries that use Logical connectives 5. Explain the use of between and and keywords 6. Provide atleast 3 example queries that use between and and keywords 7. Explain the use of not between keywords 8. Provide atleast 2 example queries that use not between clause The FROM Clause: 1. Write the description 2. Write the Syntax (Select A1, A2 ,….., An from r1 , r2 , ……, rm where P ) 3. Provide atleast 3 example queries that involves single relation/table only 4. Provide atleast 3 example queries that involves two relations/tables CREATE TABLE: An SQL relation is defined using the create table command: create table r (A1 D1, A2 D2, ..., An Dn, (integrity-constraint1), ..., (integrity-constraintk)) • r is the name of the relation • each Ai is an attribute name in the schema of relation r • Di is the data type of values in the domain of attribute Ai Example: create table branch (branch_namechar(15) not null, branch_city char(30), assets integer) Example: Declare branch_name as the primary key for branch and ensure that the values of assets are non-negative. 29
create table branch (branch_namechar(15), branch_city char(30), assets integer, primary key (branch_name)) DROP AND ALTER TABLE: The drop table command deletes all information about the dropped relation from the database. The alter table command is used to add attributes to an existing relation: alter table r add A D where A is the name of the attribute to be added to relation r and D is the domain of A. The alter table command can also be used to drop attributes of a relation: alter table r drop A where A is the name of an attribute of relation r Rename Operation: SQL provides mechanism for renaming both relations and attributes General form is : oldname as newname The as clause can appear in both the select and from clause Provide atleast 3 example queries String Operation: Some of the pattern-matching operators available in SQL are a. Like it a comparison operator b. Percent(%) % character matches any substring c. Underscore(_) _ character matches any character Provide atleast 3 example queries SQL allows us to search for mismatches instead of matches by using the not like SQL also permits a variety of functions on character strings, such as concatenation (||), extracting substrings , finding the length of strings, converting between uppercase and lowercase Provide atleast 3 example queries which we saw in the cs235 lab String Functions Tuple Variable: Tuple variables are defined in the from clause via the use of the as clause. Example: Find the customer names and their loan numbers for all customers having a loan at some branch. select customer_name, T.loan_number, S.amount from borrower as T, loan as S where T.loan_number = S.loan_number Ordering the Display of Tuples: Tuples present in the relation can sorted SQL provides the order by clause to arrange the tuples in result relation in sorted order Provide atleast 3 example queries that uses order by clause for sorting in ascending order Explain the use of desc keyword Provide atleast 3 example queries which uses order by with desc Set Operations: 1. UNION operation a. UNION syntax b. UNION ALL syntax Eaxmple: 30
Find all customers who have a loan, an account, or both: (select customer_name from depositor) union (select customer_name from borrower) 2. INTERSECT Operation a. INTERSECT syntax b. INTERSECT ALL syntax Example: Find all customers w`ho have both a loan and an account. (select customer_name from depositor) intersect (select customer_name from borrower) 3. EXCEPT Operation a. EXCEPT Syntax b. EXCEPT ALL syntax Example: Find all customers who have an account but no loan. (select customer_name from depositor) except (select customer_name from borrower) Aggregate Operation: Aggregation functions are 1. Average : avg 2. Minimum : min 3. Maximum : max 4. Total : sum 5. Count : count View: Views are “virtual relations” defined by a Query expression Any relation that is not part of the logical model but is made visible to a user as virtual relation, is called a View Views are useful mechanism for simplifying database queries, but modification of the database through views nay have potential disadvantageous consequences. Why we need a View? It is not desirable for all users to see the entire logical model that is certain data has to be hided from users for Security concern To create a personalized collection of relations that is matched with user’s intuition rather than the entire Logical model View definition: We can define/create a view using the create view statement. Syntax: create view as where any relational Algebra expression View name can be used in places wherever a relation name can be allowed Example:1: Example:2: Example:3:
create view vstud as Пrollno,sname (Stud) create view vstudfees as Пrollno,sname,(total—paid) AS feebalance (Studfees) create view vstudmarks as Stud Mark 31
create view vstudperc as Πrollno,sname( σpcen>85 (StudPercent) ) Consider the following view definition Create view vstudper as Πrollno,pcen(StudPercent) Here if there is any modification(Insertion, Deletion or Update) in relation studpercent, then the set of tuples in the view vstudper also changes. So at any given time, the set of tuples in the view relation is defined as the result of evaluation of the Query expression that defines the view at that time.
Example:4:
Updates through Views and Null Values: Although views are a useful tool for queries, they present significant problems if Updates, Insertions, or Deletions are expressed with them The difficulty is that a modification to database expressed in terms of a view must be translated o modification to the actual relations in the Logical model of the database Consider the relation studfees(rollno,name,feebal) and following view definition Create view vstudfees as Πrollno,name(Studfees) Suppose we plan to insert the following tuple into the view vstudfees vstudfees ∪ { (100190, “kumar” ) } This insertion must also take place in the relation studfees since the view vstudfees is constructed from this relation. But to insert a tuple into the original relation, we need a value for feebal. There are two approaches to deal with the insertion o Reject the insertion, and return an error message to the user o Insert a tuple (100190, “kumar”, null) into the relation studfees Due to these problems, modifications are generally not permitted on the views, except in limited cases.
Null Values: • It is possible for tuples to have a null value, denoted by null, for some of their attributes • null signifies an unknown value or that a value does not exist. • The predicate is null can be used to check for null values. o Example: Find all loan number which appear in the loan relation with null values for amount. select loan_number from loan where amount is null • The result of any arithmetic expression involving null is null o Example: 5 + null returns null • However, aggregate functions simply ignore nulls • Any comparison with null returns unknown o Example: 5 < null or null null or null = null • Three-valued logic using the truth value unknown: o OR: (unknown or true) = true, (unknown or false) = unknown (unknown or unknown) = unknown o AND: (true and unknown) = unknown, (false and unknown) = false, (unknown and unknown) = unknown o NOT: (not unknown) = unknown o “P is unknown” evaluates to true if predicate P evaluates to unknown 32
•
Result of where clause predicate is treated as false if it evaluates to unknown
NESTED SUBQUERIES: • SQL provides a mechanism for the nesting of subqueries. • A subquery is a select-from-where expression that is nested within another query. • A common use of subqueries is to perform tests for set membership, set comparisons, and set cardinality. • Find all customers who have both an account and a loan at the bank. select distinct customer_name from borrower where customer_name in (select customer_name from depositor ) Set Comparision: Find all branches that have greater assets than some branch located in Brooklyn. select distinct T.branch_name from branch as T, branch as S where T.assets > S.assets and S.branch_city = ‘ Brooklyn’ Test for Empty Relations • The exists construct returns the value true if the argument subquery is nonempty. • exists r ⇔ r ≠ Ø • not exists r ⇔ r = Ø Example: Find all customers who have an account at all branches located in Brooklyn. select distinct S.customer_name from depositor as S where not exists ( (select branch_name from branch where branch_city = ‘Brooklyn’) except (select R.branch_name from depositor as T, account as R where T.account_number = R.account_number and S.customer_name = T.customer_name )) Test for Absence of Duplicate Tuples • The unique construct tests whether a subquery has any duplicate tuples in its result. • Find all customers who have at most one account at the Perryridge branch. select T.customer_name from depositor as T where unique ( select R.customer_name from account, depositor as R where T.customer_name = R.customer_name and R.account_number = account.account_number and account.branch_name = ‘ Perryridge’ ) Derived Relations: 33
• •
SQL allows a subquery expression to be used in the from clause Find the average account balance of those branches where the average account balance is greater than $1200. select branch_name, avg_balance from (select branch_name, avg (balance) from account group by branch_name ) as branch_avg ( branch_name, avg_balance ) where avg_balance > 1200 Note that we do not need to use the having clause, since we compute the temporary (view) relation branch_avg in the from clause, and the attributes of branch_avg can be used directly in the where clause. With clause: • The with clause provides a way of defining a temporary view whose definition is available only to the query in which the with clause occurs. • Find all accounts with the maximum balance with max_balance (value) as select max (balance) from account select account_number from account, max_balance where account.balance = max_balance.value Complex Query using With Clause Find all branches where the total account deposit is greater than the average of the total account deposits at all branches. with branch_total (branch_name, value) as select branch_name, sum (balance) from account group by branch_name with branch_total_avg (value) as select avg (value) from branch_total select branch_name from branch_total, branch_total_avg where branch_total.value >= branch_total_avg.value Modification of the Database: Deletion i) Delete all account tuples at the Perryridge branch delete from account where branch_name = ‘Perryridge’ ii) Delete all accounts at every branch located in the city ‘Needham’. delete from account where branch_name in (select branch_name from branch where branch_city = ‘Needham’) Insertion: i) Add a new tuple to account 34
insert into account values (‘A-9732’, ‘Perryridge’,1200) or equivalently insert into account (branch_name, balance, account_number) values 1200, ‘A-9732’)
(‘Perryridge’,
ii)Add a new tuple to account with balance set to null insert into account values (‘A-777’,‘Perryridge’, null ) Updates: Increase all accounts with balances over $10,000 by 6%, all other accounts receive 5%. Write two update statements: update account set balance = balance ∗ 1.06 where balance > 10000 update account set balance = balance ∗ 1.05 where balance ≤ 10000 Case Statement for Conditional Updates Same query as before: Increase all accounts with balances over $10,000 by 6%, all other accounts receive 5%. update account set balance = case when balance = 4.00) o The domain has a constraint that ensures that the hourly_wage is greater than 4.00 o The clause constraint value_test is optional; useful to indicate which constraint an update violated. Syntax: check (P ), where P is a predicate Example: Declare branch_name as the primary key for branch and ensure that the values of assets are non-negative. create table branch (branch_name char(15), branch_city char(30), assets integer, primary key (branch_name), check (assets >= 0)) Referential Integrity: Referential integrity ensures that a value that appears in one relation for a given set of attributes also appears for a certain set of attributes in another relation. Example: If “Perryridge” is a branch name appearing in one of the tuples in the account relation, then there exists a tuple in the branch relation for branch “Perryridge”. Primary and candidate keys and foreign keys can be specified as part of the SQL create table statement: • The primary key clause lists attributes that comprise the primary key. • The unique key clause lists attributes that comprise a candidate key. • The foreign key clause lists the attributes that comprise the foreign key and the name of the relation referenced by the foreign key. By default, a foreign key references the primary key attributes of the referenced table. Example: 1. create table customer(customer_name char(20), customer_street char(30),customer_city char(30),primary key (customer_name )) 2. create table branch (branch_name char(15), branch_city char(30), assets numeric(12,2), primary key (branch_name )) 3. create table account(account_number char(10),branch_name char(15), balance integer,primary key (account_number), foreign key (branch_name) references branch ) Assertion • An assertion is a predicate expressing a condition that we wish the database always to satisfy. • An assertion in SQL takes the form create assertion check • When an assertion is made, the system tests it for validity, and tests it again on every update that may violate the assertion This testing may introduce a significant amount of overhead; hence assertions should be used with great care. 38
•
Asserting for all X, P(X) is achieved in a round-about fashion using not exists X such that not P(X)
Every loan has at least one borrower who maintains an account with a minimum balance or $1000.00 create assertion balance_constraint check (not exists ( select * from loan where not exists ( select * from borrower, depositor, account where loan.loan_number = borrower.loan_number and borrower.customer_name = depositor.customer_name and depositor.account_number = account.account_number and account.balance >= 1000))) Triggers: • A trigger is a statement that is executed automatically by the system as a side effect of a modification to the database. • To design a trigger mechanism, we must: o Specify the conditions under which the trigger is to be executed. o Specify the actions to be taken when the trigger executes. • Triggers introduced to SQL standard in SQL:1999, but supported even earlier using non-standard syntax by most databases • Suppose that instead of allowing negative account balances, the bank deals with overdrafts by o setting the account balance to zero o creating a loan in the amount of the overdraft o giving this loan a loan number identical to the account number of the overdrawn account • The condition for executing the trigger is an update to the account relation that results in a negative balance value. create trigger overdraft-trigger after update on account referencing new row as nrow for each row when nrow.balance < 0 begin atomic insert into borrower (select customer-name, account-number from depositor where nrow.account-number = depositor.account-number); insert into loan values (n.row.account-number, nrow.branch-name, – nrow.balance); update account set balance = 0 where account.account-number = nrow.account-number end Trigger Events: • Triggering event can be insert, delete or update • Triggers on update can be restricted to specific attributes 39
E.g. create trigger overdraft-trigger after update of balance on account •
Values of attributes before and after an update can be referenced o referencing old row as : for deletes and updates o referencing new row as : for inserts and updates • Triggers can be activated before an event, which can serve as extra constraints. E.g. convert blanks to null. create trigger setnull-trigger before update on r referencing new row as nrow for each row when nrow.phone-number = ‘ ‘ set nrow.phone-number = null Instead of executing a separate action for each affected row, a single action can be executed for all rows affected by a transaction • Use for each statement instead of for each row • Use referencing old table or referencing new table to refer to temporary tables (called transition tables) containing the affected rows • Can be more efficient when dealing with SQL statements that update a large number of rows Authorization: Forms of authorization on parts of the database: • Read - allows reading, but not modification of data. • Insert - allows insertion of new data, but not modification of existing data. • Update - allows modification, but not deletion of data. • Delete - allows deletion of data. Forms of authorization to modify the database schema (covered in Chapter 8): • Index - allows creation and deletion of indices. • Resources - allows creation of new relations. • Alteration - allows addition or deletion of attributes in a relation. • Drop - allows deletion of relations The grant statement is used to confer authorization grant on to is: a user-id public, which allows all valid users the privilege granted A role (more on this in Chapter 8) Granting a privilege on a view does not imply granting any privileges on the underlying relations.The grantor of the privilege must already hold the privilege on the specified item (or be the database administrator). Privilleges: Select: allows read access to relation,or the ability to query using the view Example: grant users U1, U2, and U3 select authorization on the branch relation: grant select on branch to U1, U2, U3 Insert: the ability to insert tuples Update: the ability to update using the SQL update statement Delete: the ability to delete tuples.
40
The revoke statement is used to revoke authorization. revoke on from Example: revoke select on branch from U1, U2, U3 • may be all to revoke all privileges the revokee may hold. • If includes public, all users lose the privilege except those granted it explicitly. • If the same privilege was granted twice to the same user by different grantees, the user may retain the privilege after the revocation. • All privileges that depend on the privilege being revoked are also revoked. Embedded SQL: A language in which SQL queries can be embedded are referred as a Host Programming Language or Host Language The use of SQL commands within a host Language Program is defined as Embedded SQL Some of the Programming Language in which SQL statement can be embedded are FORTRAN, PASCAL, PL/I, COBOL, C , C++ , JAVA , etc Execution of Embedded SQL Program Programs written in the host language can use the embedded SQL like the programming language statement to access, manipulate and update data stored in a Database In embedded SQL, all Query Processing is done by the Database System and the result is made available to the Program one tuple(record) at a time An embedded SQL Program is processed by a special Preprocessor prior to compilation This Special Preprocessor replaces the embedded SQL requests into the host language Procedure Calls that allow run-time execution of the database accesses Then the resulting program is compiled by the Host-Language Compiler Circumstances for the use of Embedded SQL: Not all queries can be expressed in SQL, since SQL does not provide the full expressive power of a general-purpose language. That is, there exist queries that can be expressed in a Programming languages such as C, COBOL, PASCAL, or FORTRAN that cannot be expressed in SQL Non-Declarative actions – such as Printing a Report, Interacting with a User, or Sending the result of a Query to a Graphical User Interface – cannot be done from within SQL Declaring Variables and Exceptions SQL statements can refer to variables defined in the host language program. Such host-language variables must be prefixed by a colon(:) in SQL statements and must be declared between the commands EXEC SQL BEGIN DECLARE SECTION and EXEC SQL END DECLARE SECTION General Form is EXEC SQL BEGIN DECLARE SECTION DECLARATIONS EXEC SQL END DECLARE SECTION Example EXEC SQL BEGIN DECLARE SECTION Char sname[20]; 41
Long rollno; Short percentage; EXEC SQL END DECLARE SECTION Embedding SQL Statements: All SQL statements that are to be embedded within a host language program must be prefixed by EXEC SQL General form is: EXEC SQL END-EXEC In C Programming Language, the Syntax is EXEC SQL ; In JAVA, the Syntax is #SQL { } ; Examples:(In C Programming Language) o EXEC SQL INSERT INTO student VALUES (:sname, :rollno, :percentage); Disadvantage in Embedding SQL Statements: The major disadvantage in Embedding SQL statements in a host language like C is that an Impedance Mismatch occurs because SQL operates on sets of records, whereas languages like C do not cleanly support a set-of-records abstraction. Remedy to the impedance Mismatch: The solution is a mechanism that allows us to retrieve rows one at a time from a relation The mechanism is termed as Cursors. Cursors: It is a mechanism that allows us to retrieve rows one at a time from a relation We can declare a cursor on any relation or on any SQL Query(because every query returns a set of rows). Once a cursor is declared, we can open it (which positions the cursor just before the first row); fetch the next row; move the cursor (to the next row, to the row after the next n, to the first row, or to the previous row, etc., by specifying additional parameters for the FETCH command); or close the cursor. Thus, a cursor essentially allows us to retrieve the rows in a table by positioning the cursor at a particular row and reading its contents General Form of the Cursor definition: DECLARE cursorname CURSOR FOR Some query [ ORDER BY attribute ] [ FOR READ ONLY | FOR UPDATE ] Dynamic SQL • Allows programs to construct and submit SQL queries at run time. Example of the use of dynamic SQL from within a C program. char * sqlprog = “update account set balance = balance * 1.05 where account_number = ?” EXEC SQL prepare dynprog from :sqlprog; char account [10] = “A-101”; EXEC SQL execute dynprog using :account; 42
The dynamic SQL program contains a ?, which is a place holder for a value that is provided when the SQL program is executed. RELATIONAL DATABASE DESIGN Pitfalls in the Relational Database Design: Among the undesirable properties that a bad design may have are: • Repetition of information • Inability to represent certain information Repetition of Information:
Dept relation
Consider the Dept relation Deptno If a tuple say (“Ganesh”, 33) is inserted into the dept relation, then Name Kumar 11 more than one employee comes under the department 33. Sherly 22 Therefore, the deptno 33 occurs more than once. This is a kind of Jaya 11 repetition of information. Arun 33 If a tuple say (“Kumar”, 11) is inserted into the dept relation, then the tuple (“Kumar”, 11) occurs more than once. This is an another kind of Repetition of Information. Drawbacks due to Repetition of information: o Wastage of Spaces o Problem occurs during Update operation o If a Database is updated based on deptno , then the changes will affect all the employees those come under that deptno Decomposition: Definition: The process of dividing or decomposing a relation schema into several schemas due to bad database design is termed as Decomposition. Concept: Consider the relational database schema Employee= (ename, projno, projname, loc) Let the following be the relational instance for the employee relation In this employee relation instance, we find Ename Projno Projname loc that 3 employees works in Project “Banking” Kumar 1000 Banking Madurai , 1 employee in Project “Airline” and 1 Jaya 2000 Airline Bangalore employee in Project “Railways”. We find that the property “Repetition of Arun 1000 Banking Madurai Information” true here that is value (1000, Guna 3000 Railways Chennai Banking, Madurai) occurs repeatedly. Dinesh 1000 Banking Madurai So the relation schema Employee= (ename, projno, projname, loc) exhibit a bad Database Design So we go for Decomposition Now we Decompose this relation schema into two schemas such that the values stored in the original relation may be obtained using JOIN operation So the decomposed schemas are Emp1= (ename, projno) and Emp2 = (projno, projname, loc) We can reconstruct the employee relation from emp1 and emp2 by applying JOIN operation that is Emp1 |x| Emp2 The resultant relation obtained from Emp1|x| Emp2 may contain some extra tuples or may lose some tuples. 43
If the resultant relation obtained from decomposition join operation without any loss of tuples or records, then the Join operation is termed as Lossless Join Decomposition If the resultant relation obtained from decomposition join operation with loss to any tuples or records, then the Join operation is termed as Lossy Join Decomposition .This decomposition leads to a Bad Database Design. Desirable Properties of Decomposition: Some of the desirable properties of Decomposition are o Lossless-Join Decomposition o Dependency Preservation o Repetition of information Lossless-join Decomposition: • When Decomposing a relation due to bad database design, it is important that the Decomposition be Lossless • This Decomposition is also termed as Non-additive Join decomposition. • Let R be a relation schema, and let F be a set of functional dependencies on R. Let R1 and R2 form the schemas by Decomposing R. • This Decomposition is said to be a Lossless Join decomposition of R if at least one of the following Functional Dependencies are in F+ : R1 ∩ R2 R1 R1 ∩ R2 R2 Dependency Preservation: • This property ensures that each FD is represented in some individual relations resulting after Decomposition • Algorithm for testing Dependency Preservation: Compute F + ; for each schema Ri in D do Begin Fi := the restriction of F + to Ri ; End F’ := Phi for each restriction Fi do begin F’ := F’ ∪ Fi End Compute F’+ ; If (F’+ == F + ) then return (true) Else return (false); • Here D = { R1 , R2 , . . . . . , Rn ) • Disadvantage of Algorithm: The Computation of F + takes the exponential time Repetition of Information: o The decomposition of relation should not suffer from the problem of Repetition of Information. o Here the Decomposition of Employee= (ename, projno, projname, loc) separates Employee and Project data into distinct relations eliminates Redundancy ie Repetition of information Normalization: Definition:
44
Normalization is the process that takes a relational schema through a series of tests to “Certify” whether it satisfies a certain Normal Form. The Normalization process proceeds in a Top-Down fashion by evaluating each relation against the criteria for Normal Forms and Decomposing relations as necessary. So the Normalization Process is also termed as Relational Design by Analysis. Some of the Normal forms are 1NF 2NF 3NF BCNF 4NF 5NF PJNF (Project Join Normal Form) DKNF (Domain Key Normal Form) Definitions of Key: A superkey of relation schema R = { A1 , A2 , …… , An } is a set of attributes S Belongs to or subset of R with the property that no two tuples t1 and t2 in any relation state r of R will have t1[S] = t2[S]. A Key K is a superkey with the additional property that removal of any attribute from K will cause K not to be a Superkey anymore. The difference between a Key and a Super Key is that a Key has to be minimal (if we have a key K= {A1 , A2 , …. , Ak } of R, then K—{Ai} is not a Key of R for any i , 1 ∈ r then < y1, z1, w2 > ∈ r and < y2, z2, w1 > ∈ r • Note that since the behavior of Z and W are identical it follows that Y →→ Z if Y →→ W Use: We use multivalued dependencies in two ways: 1. To test relations to determine whether they are legal under a given set of functional and multivalued dependencies 2. To specify constraints on the set of legal relations. We shall thus concern ourselves only with relations that satisfy a given set of functional and multivalued dependencies. Theory: • If a relation r fails to satisfy a given multivalued dependency, we can construct a relations r′ that does satisfy the multivalued dependency by adding tuples to r. • From the definition of multivalued dependency, we can derive the following rule: If α → β, then α →→ β That is, every functional dependency is also a multivalued dependency • The closure D+ of D is the set of all functional and multivalued dependencies logically implied by D. • We can compute D+ from D, using the formal definitions of functional dependencies and multivalued dependencies. • We can manage with such reasoning for very simple multivalued dependencies, which seem to be most common in practice • For complex dependencies, it is better to reason about sets of dependencies using a system of inference rules Denormalization for Performance • May want to use non-normalized schema for performance • For example, displaying customer_name along with account_number and balance requires join of account with depositor • Alternative 1: Use denormalized relation containing attributes of account as well as depositor with all above attributes faster lookup • extra space and extra execution time for updates • extra coding work for programmer and possibility of error in extra code • Alternative 2: use a materialized view defined as account depositor Benefits and drawbacks same as above, except no extra coding work for programmer and avoids possible errors 55
Overall Database Design Process We have assumed schema R is given • R could have been generated when converting E-R diagram to a set of tables. • R could have been a single relation containing all attributes that are of interest (called universal relation). • Normalization breaks R into smaller relations. • R could have been the result of some ad hoc design of relations, which we then test/convert to normal form.
56
UNIT - III
Query Processing Basic Steps in Query Processing: 1. 2. 3.
Parsing and translation Optimization Evaluation
Parsing and translation: A query expressed in a high level language such as SQL must first be scanned, parsed, and validated. The scanner identifies the language tokens – such as SQL keywords , attribute names, and relation names in the text of the query. The parser checks the query syntax to determine whether it is formulated according to the syntax rules of the query language.The query must be validated, by checking that all attribute and relation names are valid and semantically meaningful. An intermediate representation of the query is then created , usually as a tree data structure called a query tree. It is also possible to represent the query using a graph data structure called query graph.
2. Optimization: The DBMS must then devise an execution strategy for retrieving the result of the query from the database files. A query has many possible execution strategies, and the process of choosing a suitable one for processing a query is known as query optimization. 57
There are two rules for implementing query optimization. First one is based on heuristic rules for ordering the operations in a query execution strategy. The second technique involves systematically estimating the cost of different execution strategies and choosing the execution plan with lower cost estimate.
Evaluation: The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query.
Translating queries into relational algebra: Typically SQL queries are decomposed into query blocks, which form the basic units that can be translated into algebraic operation and optimized. A query block contains a single SELECT – FROM – WHERE expression, as well as GROUP BY and HAVING clauses. Consider the following SQL query: Select lname, fname From emp Where salary > (select max(salary) from emp where dno = 5); This query includes a nested sub query and hence would be decomposed into two blocks. The inner block is: (select max (salary) From emp Where dno = 5) The outer block is: Select lname, fname From emp Where salary >c Where c represents the result returned from the inner block. The inner block could be translated into the extended relational algebra
And the outer block into the expression
58
Evaluation of Expressions : Alternatives for evaluating an entire expression tree Materialization: generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk. Repeat. Pipelining: pass on tuples to parent operations even as an operation is being executed Materialized evaluation: evaluate one operation at a time, starting at the lowest-level. Use intermediate results materialized into temporary relations to evaluate next-level operations. E.g., in figure below, compute and store
σ balance < 2500 ( account )
59
then compute the store its join with customer, and finally compute the projections on customer-name. Materialized evaluation is always applicable. Cost of writing results to disk and reading them back can be quite high
Pipelining Pipelined evaluation : Evaluate several operations simultaneously, passing the results of one operation on to the next. Much cheaper than materialization: no need to store a temporary relation to disk. Pipelining may not always be possible – e.g., sort, hash-join. For pipelining to be effective, use evaluation algorithms that generate output tuples even as tuples are received for inputs to the operation.
Storage and File Structure Classification of Physical Storage Media: Storage media can be classified based on the following characteristics: i) Speed with which data can be accessed ii) Cost per unit of data iii) Reliability data loss on power failure or system crash physical failure of the storage device. Types of storage : volatile storage: loses contents when power is switched off non-volatile storage: Contents persist even when power is switched off. Includes secondary and tertiary storage, as well as batter- backed up main-memory. Physical Storage Media: 60
Cache – fastest and most costly form of storage; volatile; managed by the computer system hardware. Main memory: fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds) generally too small (or too expensive) to store the entire database capacities of up to a few Gigabytes widely used currently Capacities have gone up and per-byte costs have decreased steadily and rapidly (roughly factor of 2 every 2 to 3 years) Volatile — contents of main memory are usually lost if a power failure or system crash occurs. Flash memory Data survives power failure Data can be written at a location only once, but location can be erased and written to again Can support only a limited number (10K – 1M) of write/erase cycles. Erasing of memory has to be done to an entire bank of memory Reads are roughly as fast as main memory But writes are slow (few microseconds), erase is slower Cost per unit of storage roughly similar to main memory Widely used in embedded devices such as digital camera Magnetic-disk: Data is stored on spinning disk, and read/written magnetically. This is the Primary medium for the long-term storage of data; typically stores entire database. Data must be moved from disk to main memory for access, and written back for storage. Much slower access than main memory (more on this later) direct-access – possible to read data on disk in any order, unlike magnetic tape Capacities range up to roughly 400 GB currently Much larger capacity and cost/byte than main memory/flash memory Growing constantly and rapidly with technology improvements (factor of 2 to 3 every 2 years) Survives power failures and system crashes disk failure can destroy data, but is rare Is a type of EEPROM (Electrically Erasable Programmable Read-Only Memory) Optical storage : Non-volatile, data is read optically from a spinning disk using a laser .CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms.Write-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R, DVD+R).Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-RAM).Reads and writes are slower than with magnetic disk Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data Tape storage 61
non-volatile, used primarily for backup (to recover from disk failure), and for archival data sequential-access – much slower than disk very high capacity (40 to 300 GB tapes available).Tape can be removed from drive ⇒ storage costs much cheaper than disk, but drives are expensive. Tape jukeboxes available for storing massive amounts of data hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)
Storage Hierarchy
primary storage: Fastest media but volatile (cache, main memory). secondary storage: next level in hierarchy, non-volatile, moderately fast access time also called on-line storage E.g. flash memory, magnetic disks tertiary storage: lowest level in hierarchy, non-volatile, slow access time also called off-line storage E.g. magnetic tape, optical storage Magnetic Disks: Read-write head Positioned very close to the platter surface (almost touching it).Reads or writes magnetically encoded information. Surface of platter divided into circular tracks Over 50K-100K tracks per platter on typical hard disks Each track is divided into sectors. A sector is the smallest unit of data that can be read or written. Sector size typically 512 bytes. Typical sectors per track: 500 (on inner tracks) to 1000 (on outer tracks). To read/write a sector disk arm swings to position head on right track. platter spins continually; data is read/written as sector passes under head Head-disk assemblies : multiple disk platters on a single spindle (1 to 5 usually) one head per platter, mounted on a common arm. Cylinder i consists of ith track of all the platters 62
Disk controller – interfaces between the computer system and the disk drive hardware. accepts high-level commands to read or write a sector initiates actions such as moving the disk arm to the right track and actually reading or writing the data Computes and attaches checksums to each sector to verify that data is read back correctly If data is corrupted, with very high probability stored checksum won’t match recomputed checksum Ensures successful writing by reading back sector after writing it Performs remapping of bad sectors Disk Subsystem
Multiple disks connected to a computer system through a controller Controllers functionality (checksum, bad sector remapping) often carried out by individual disks; reduces load on controller Disk interface standards families ATA (AT adaptor) range of standards SATA (Serial ATA) SCSI (Small Computer System Interconnect) range of standards Several variants of each standard (different speeds and capabilities) RAID RAID: Redundant Arrays of Independent Disks disk organization techniques that manage a large numbers of disks, providing a view of a single disk of high capacity and high speed by using multiple disks in parallel, and high reliability by storing data redundantly, so that data can be recovered even if a disk fails The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days) Techniques for using redundancy to avoid data loss are critical with large numbers of disks Originally a cost-effective alternative to large, expensive disks I in RAID originally stood for ``inexpensive’’ Today RAIDs are used for their higher reliability and bandwidth. The “I” is interpreted as independent Improvement of Reliability via Redundancy
63
Redundancy – store extra information that can be used to rebuild information lost in a disk failure E.g., Mirroring (or shadowing) Duplicate every disk. Logical disk consists of two physical disks. Every write is carried out on both disks Reads can take place from either disk If one disk in a pair fails, data still available in the other Data loss would occur only if a disk fails, and its mirror disk also fails before the system is repaired – Probability of combined event is very small » Except for dependent failure modes such as fire or building collapse or electrical power surges Mean time to data loss depends on mean time to failure, and mean time to repair E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives mean time to data loss of 500*106 hours (or 57,000 years) for a mirrored pair of disks (ignoring dependent failure modes) Two main goals of parallelism in a disk system: 1. Load balance multiple small accesses to increase throughput 2. Parallelize large accesses to reduce response time. Improve transfer rate by striping data across multiple disks. Bit-level striping – split the bits of each byte across multiple disks In an array of eight disks, write bit i of each byte to disk i. Each access can read data at eight times the rate of a single disk. But seek/access time worse than for a single disk Bit level striping is not used much any more Block-level striping – with n disks, block i of a file goes to disk (i mod n) + 1 Requests for different blocks can run in parallel if the blocks reside on different disks A request for a long sequence of blocks can utilize all disks in parallel
RAID Levels RAID Level 0: Block striping; non-redundant.Used in high-performance applications where data lose is not critical. RAID Level 1: Mirrored disks with block striping. Offers best write performance. Popular for applications such as storing log files in a database system. RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping. RAID Level 3:
64
Bit-Interleaved Parity a single parity bit is enough for error correction, not just detection, since we know which disk has failed When writing data, corresponding parity bits must also be computed and written to a parity bit disk To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk) Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to participate in every I/O. Subsumes Level 2 (provides all its benefits, at lower cost). RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a separate disk for corresponding blocks from N other disks. When writing data block, corresponding block of parity bits must also be computed and written to parity disk. To find value of a damaged block, compute XOR of bits from corresponding blocks (including parity block) from other disks. Provides higher I/O rates for independent block reads than Level 3 block read goes to a single disk, so blocks stored on different disks can be read in parallel Provides high transfer rates for reads of multiple blocks than no-striping Before writing a block, parity data must be computed Can be done by using old parity block, old value of current block and new value of current block (2 block reads + 2 block writes) Or by recomputing the parity value using the new values of blocks corresponding to the parity block – More efficient for writing large amounts of data sequentially Parity block becomes a bottleneck for independent block writes since every block write also writes to parity disk
RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks. Higher I/O rates than Level 4. Block writes occur in parallel if the blocks and their parity blocks are on different disks. Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk. RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard against multiple disk failures. Better reliability than Level 5 at a higher cost; not used as widely.
65
Optical Disks: Compact disk-read only memory (CD-ROM) Removable disks, 640 MB per disk Seek time about 100 msec (optical read head is heavier and slower) Higher latency (3000 RPM) and lower data-transfer rates (3-6 MB/s) compared to magnetic disks Digital Video Disk (DVD) DVD-5 holds 4.7 GB , and DVD-9 holds 8.5 GB DVD-10 and DVD-18 are double sided formats with capacities of 9.4 GB and 17 GB Slow seek time, for same reasons as CD-ROM Record once versions (CD-R and DVD-R) are popular data can only be written once, and cannot be erased. high capacity and long lifetime; used for archival storage Multi-write versions (CD-RW, DVD-RW, DVD+RW and DVD-RAM) also available Magnetic Tapes: Hold large volumes of data and provide high transfer rates Few GB for DAT (Digital Audio Tape) format, 10-40 GB with DLT (Digital Linear Tape) format, 100 GB+ with Ultrium format, and 330 GB with Ampex helical scan format Transfer rates from few to 10s of MB/s Currently the cheapest storage medium Tapes are cheap, but cost of drives is very high Very slow access time in comparison to magnetic disks and optical disks limited to sequential access. Some formats (Accelis) provide faster seek (10s of seconds) at cost of lower capacity Used mainly for backup, for storage of infrequently used information, and as an off-line medium for transferring information from one system to another. Tape jukeboxes used for very large capacity storage (terabyte (1012 bytes) to petabye (1015 bytes) Storage Access: A database file is partitioned into fixed-length storage units called blocks. Blocks are units of both storage allocation and data transfer. Database system seeks to minimize the number of block transfers between the disk and memory. We can reduce the number of disk accesses by keeping as many blocks as possible in main memory. Buffer – portion of main memory available to store copies of disk blocks. Buffer manager – subsystem responsible for allocating buffer space in main memory. Programs call on the buffer manager when they need a block from disk. 1. If the block is already in the buffer, buffer manager returns the address of the block in main memory 2. If the block is not in the buffer, the buffer manager 1. Allocates space in the buffer for the block 1. Replacing (throwing out) some other block, if required, to make space for the new block. 66
2. Replaced block written back to disk only if it was modified since the most recent time that it was written to/fetched from the disk. 2. Reads the block from the disk to the buffer, and returns the address of the block in main memory to requester. Most operating systems replace the block least recently used (LRU strategy) Idea behind LRU – use past pattern of block references as a predictor of future references Queries have well-defined access patterns (such as sequential scans), and a database system can use the information in a user’s query to predict future references LRU can be a bad strategy for certain access patterns involving repeated scans of data For example: when computing the join of 2 relations r and s by a nested loops for each tuple tr of r do for each tuple ts of s do if the tuples tr and ts match … Mixed strategy with hints on replacement strategy provided by the query optimizer is preferable
Indexing and Hashing: Indexes are used to speed up the retrieval of records in response to certain search conditions. In indexes, the records in the indexing file are in the sorted order, making it easy to find the record looking for. for eg, to retrieve an account record for an account number, the database system would look up an index to find on which disk block the record resides and then fetch the disk block to get the account record. An index has the following fields: Search Key - attribute to set of attributes used to look up records in a file. An index file consists of records (called index entries) of the form Search key and pointer.Index files are typically much smaller than the original file. Types of indices: Two basic kinds of indices: Ordered indices: search keys are stored in sorted order Hash indices: function”.
search keys are distributed uniformly across “buckets” using a “hash
Index Evaluation Metrics: Each technique must be evaluated on the basis of the following factors: 1)Access type 2)Acees time 67
3)insertion time 4) deletion time 5)spaceover head.
Ordered indices:
In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library. Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file. A primary index is an ordered file whose records are of fixed length with two fields. 1) first field is of the same data type as the ordering key field – called the primary key of the data file. 2) second field is a pointer to a disk block in the data file.
Consider: • An ordered file with r= 30,000 records, block size B= 1024 bytes. •
file records are of fixed size and are un spanned, with record length R=100 bytes.
•
blocking factor (bfr) = B/R = 1024/ 100 = 10 records per block;
•
the number of blocks needed for the file is b= r / bfr = 30,000 /10 =3000 blocks.
•
for an binary search log 2 b = log 2 3000 = 12 block access.
For primary indexes. •
The ordering key field v=9 bytes long , block pointer p=6 bytes long
•
The size if the index entry = Ri = (9+6) = 15 bytes. 68
•
so, bfri = 1024 / 15 = 68 entries per block.
•
The total no. of index entries ri is equal to the number of blocks in the data file, which is 3000.
•
No. of index blocks bi = ri / bfri (3000 / 68 ) =45 blocks.
•
For an binary search log 2 45 = 6 block access
There are two types of ordered indices can be used , they are 1) Dense index: An index record appears for every search key value in the file. In dense primary index, the index record contains the search key value and a pointer to the first data record with that search key value. The rest of the records with the same search key value would be sorted sequentially after the first record. Dense index implementations may store a list of pointers to all records with the same search key value.
Sparse index: •
An ordered record appears for only some of the search key values.
Advantages: Dense index - faster to locate a record rather than sparse index. 69
sparse index – requires less space and less maintenance over head.
Clustering index: To speed up retrieval of records that have the same value for the clustering field clustering index can be used.
•
A clustering index is also an ordered file with two fields; the first field is of the same type as the clustering field of the data file and second field is a block pointer.
•
If records of a file physically ordered on a non key field that field is called a clustering field.
Secondary index: An index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index. • •
A secondary index is also an ordered file with two fields. The first field is of the same type as some non ordering field of the data field and second field is either block pointer or record pointer
70
. Eg: • Eg, r=30000 fixed length records. • Size R=100 bytes • Block size B=1024 bytes. • File has b =3000 blocks. • Secondary index key field v=9 bytes long. • Block pointer= 6 bytes long •
Bfri = 1024 / 15 = 68 entries per block
•
Total no. of index entries ri equal to the number of records in the data file = 30,000
• •
Number blocks needed for index = 30,000/ 68 = 442 blocks. for an binary search log 2 442 = 9 block access
Multilevel Index: If primary index does not fit in memory, access becomes expensive. To reduce number of disk accesses to index records, treat primary index kept on disk as a sequential file and construct a sparse index on it. outer index – a sparse index of primary index inner index – the primary index file If even outer index is too large to fit in main memory, yet another level of index can be created, and so on. Indices at all levels must be updated on insertion or deletion from the file.
71
B+-Tree Index Files: Disadvantage of indexed-sequential files: 1) Performance degrades as file grows, since many overflow blocks get created. 2) Periodic reorganization of entire file is required. Advantage of B+-tree index files: 1) Automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. 2) Reorganization of entire file is not required to maintain performance. Disadvantage of B+-trees: extra insertion and deletion overhead, space overhead. A B+-tree is a rooted tree satisfying the following properties: 1) All paths from root to leaf are of the same length. 2) Each node that is not a root or a leaf has between [n/2] and n children. 3) A leaf node has between [(n–1)/2] and n–1 values Special cases: If the root is not a leaf, it has at least 2 children. If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1) values. 72
•
B+ Tree is a dynamic, multilevel index with maximum and minimum bounds on the number of keys.
•
In B+ Tree all records are stored at the lowest level of the tree.
•
Only keys are stored interior blocks.
•
Every node except root must be at least ½ full
Structure of B+ Tree: • Typical node of B+ tree contains up to (n-1)search key values k1,k2,…kn-1 and n pointers p1,p2,…pn.
Ki are the search-key values Pi are pointers to children (for non-leaf nodes) or pointers to records orbuckets of records (for leaf nodes). The search-keys in a node are ordered K1 < K2 < K3 < . . . < Kn–1 Structure of leaf node: •
For i=1,2,3…n-1 , pointer pi points to either a file record with search key value Ki or to a bucket of pointers. • The non leaf nodes of B+ tree form a multilevel ( sparse) index on the leaf nodes. • The structure of non leaf node is the same as that for leaf nodes except that all pointers are pointers to the tree nodes.
•
A non leaf node may hold up to n pointers, and must hold at least [n/2] pointers. The number of pointers in a node is called the Fanout of the node. Ex:
73
•
To calculate the order p of a B+ Tree ; •
Suppose that,
• • • •
Search key v=9 bytes Block size B =512 bytes Record pointer pr = 7 bytes Block pointer P=6 bytes.
•
An internal node of B+ tree can have up to P tree pointers and p-1 search key values, these must fit in to a single block. Hence, (p * P) + (( p-1) * v).
View Serializability: Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the following three conditions are met: 1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction Ti must, in schedule S´, also read the initial value of Q. 2.For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the value of Q that was produced by transaction Tj . 3. For each data item Q, the transaction (if any) that performs the final write(Q) operation in schedule S must perform the final write(Q) operation in schedule S´. As can be seen, view equivalence is is also based purely on reads and writes alone. • A schedule S is view serializable it is view equivalent to a serial schedule. • Every conflict serializable schedule is also view serializable. • Below is a schedule which is view-serializable but not conflict serializable
86
•
What serial schedule is above equivalent to?
•
Every view serializable schedule that is not conflict serializable has blind writes.
•
The schedule below produces same outcome as the serial schedule < T1, T5 >, yet is not conflict equivalent or view equivalent to it.
Determining such equivalence requires analysis of operations other than read and write Testing for Serializability: • • • •
Consider some schedule of a set of transactions T1, T2, ..., Tn Precedence graph — a direct graph where the vertices are the transactions (names). We draw an arc from Ti to Tj if the two transaction conflict, and Ti accessed the data item on which the conflict arose earlier. We may label the arc by the item that was accessed.
Example 1
Example Schedule (Schedule A) + Precedence Graph
87
Test for Conflict Serializability • • •
A schedule is conflict serializable if and only if its precedence graph is acyclic. Cycle-detection algorithms exist which take order n2 time, where n is the number of vertices in the graph. o (Better algorithms take order n + e where e is the number of edges.) If precedence graph is acyclic, the serializability order can be obtained by a topological sorting of the graph. This is a linear order consistent with the partial order of the graph.
Test for View Serializability: •
• •
The precedence graph test for conflict serializability cannot be used directly to test for view serializability. o Extension to test for view serializability has cost exponential in the size of the precedence graph. The problem of checking if a schedule is view serializable falls in the class of NPcomplete problems. Thus existence of an efficient algorithm is extremely unlikely. However practical algorithms that just check some sufficient conditions for view serializability can still be used.
Recoverability: Recoverable Schedules Need to address the effect of transaction failures on concurrently running transactions.
88
• •
•
Recoverable schedule — if a transaction Tj reads a data item previously written by a transaction Ti , then the commit operation of Ti appears before the commit operation of Tj. The following schedule (Schedule 11) is not recoverable if T9 commits immediately after the read
If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent database state. Hence, database must ensure that schedules are recoverable.
Cascading Rollbacks: Cascading rollback – a single transaction failure leads to a series of transaction rollbacks. Consider the following schedule where none of the transactions has yet committed (so the schedule is recoverable)
• •
If T10 fails, T11 and T12 must also be rolled back. Can lead to the undoing of a significant amount of work
Cascadeless Schedules: • • •
Cascadeless schedules — cascading rollbacks cannot occur; for each pair of transactions Ti and Tj such that Tj reads a data item previously written by Ti, the commit operation of Ti appears before the read operation of Tj. Every cascadeless schedule is also recoverable It is desirable to restrict the schedules to those that are cascadeless
Concurrency Control •
A database must provide a mechanism that will ensure that all possible schedules are o either conflict or view serializable, and o are recoverable and preferably cascadeless
•
A policy in which only one transaction can execute at a time generates serial schedules, but provides a poor degree of concurrency o Are serial schedules recoverable/cascadeless?
•
Testing a schedule for serializability after it has executed is a little too late! 89
•
Goal – to develop concurrency control protocols that will assure serializability.
Transaction Definition in SQL: •
Data manipulation language must include a construct for specifying the set of actions that comprise a transaction.
•
In SQL, a transaction begins implicitly.
•
A transaction in SQL ends by: o Commit work commits current transaction and begins a new one. o Rollback work causes current transaction to abort.
•
Levels of consistency specified by SQL-92: o Serializable — default o Repeatable read o Read committed o Read uncommitted
Implementation of Isolation: • • • •
Schedules must be conflict or view serializable, and recoverable, for the sake of database consistency, and preferably cascadeless. A policy in which only one transaction can execute at a time generates serial schedules, but provides a poor degree of concurrency. Concurrency-control schemes tradeoff between the amount of concurrency they allow and the amount of overhead that they incur. Some schemes allow only conflict-serializable schedules to be generated, while others allow view-serializable schedules that are not conflict-serializable.
Concurrency Control Lock-Based Protocols: •
A lock is a mechanism to control concurrent access to a data item
•
Data items can be locked in two modes : 1. exclusive (X) mode. Data item can be both read as well as written. X-lock is requested using lock-X instruction. 2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S instruction. Lock requests are made to concurrency-control manager. Transaction can proceed only after request is granted. Lock-compatibility matrix
90
•
A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions.
•
Any number of transactions can hold shared locks on an item, o but if any transaction holds an exclusive on the item no other transaction may hold any lock on the item.
•
If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held by other transactions have been released. The lock is then granted. Example of a transaction performing locking: T2: lock-S(A); read (A); unlock(A); lock-S(B); read (B); unlock(B); display(A+B) •
Locking as above is not sufficient to guarantee serializability — if A and B get updated in-between the read of A and B, the displayed sum would be wrong.
•
A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules.
Consider the partial schedule:
•
Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.
•
Such a situation is called a deadlock. o To handle a deadlock one of T3 or T4 must be rolled back and its locks released
•
The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.
•
Starvation is also possible if concurrency control manager is badly designed. For example: o A transaction may be waiting for an X-lock on an item, while a sequence of other transactions request and are granted an S-lock on the same item. o The same transaction is repeatedly rolled back due to deadlocks.
91
•
Concurrency control manager can be designed to prevent starvation
The Two-Phase Locking Protocol: •
This is a protocol which ensures conflict-serializable schedules.
•
Phase 1: Growing Phase o transaction may obtain locks o transaction may not release locks.
•
Phase 2: Shrinking Phase o transaction may release locks o transaction may not obtain locks
•
The protocol assures serializability. It can be proved that the transactions can be serialized in the order of their lock points (i.e. the point where a transaction acquired its final lock).
•
Two-phase locking does not ensure freedom from deadlocks.
•
Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts.
•
Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit.
Lock Table: • • • • •
Black rectangles indicate granted locks, white ones indicate waiting requests. Lock table also records the type of lock granted or requested New request is added to the end of the queue of requests for the data item, and granted if it is compatible with all earlier locks Unlock requests result in the request being deleted, and later requests are checked to see if they can now be granted If transaction aborts, all waiting or granted requests of the transaction are deleted o lock manager may keep a list of locks held by each transaction, to implement this efficiently
92
Graph-Based Protocols:
Graph-based protocols are an alternative to two-phase locking Impose a partial ordering → on the set D = {d1, d2 ,..., dh} of all data items. o If di → dj then any transaction accessing both di and dj must access di before accessing dj. o Implies that the set D may now be viewed as a directed acyclic graph, called a database graph. The tree-protocol is a simple kind of graph protocol.
Tree Protocol
• • • •
Only exclusive locks are allowed. The first lock by Ti may be on any data item. Subsequently, a data Q can be locked by Ti only if the parent of Q is currently locked by Ti. Data items may be unlocked at any time. The tree protocol ensures conflict serializability as well as freedom from deadlock.
•
Unlocking may occur earlier in the tree-locking protocol than in the two-phase locking protocol. o shorter waiting times, and increase in concurrency o protocol is deadlock-free, no rollbacks are required
•
Drawbacks: o Protocol does not guarantee recoverability or cascade freedom 93
Need to introduce commit dependencies to ensure recoverability o Transactions may have to lock data items that they do not access. increased locking overhead, and additional waiting time potential decrease in concurrency • •
Schedules not possible under two-phase locking are possible under tree protocol, and vice versa. Each transaction is issued a timestamp when it enters the system. If an old transaction Ti has time-stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that TS(Ti) onto stable storage. During recovery we need to consider only the most recent transaction Ti that started before the checkpoint, and transactions that started after Ti. 7. Scan backwards from end of log to find the most recent record 8. Continue scanning backwards till a record is found. 9. Need only consider the part of log following above start record. Earlier part of log can be ignored during recovery, and can be erased whenever desired. 10. For all transactions (starting from Ti or later) with no , execute undo(Ti). (Done only in case of immediate modification.). 11. Scanning forward in the log, for all transactions starting with a , execute redo(Ti).
from Ti or later
Example of Checkpoints
• • •
T1 can be ignored (updates already output to disk due to checkpoint) T2 and T3 redone. T4 undone
Shadow Paging: •
Shadow paging is an alternative to log-based recovery; this scheme is useful if transactions execute serially.
•
Idea: maintain two page tables during the lifetime of a transaction –the current page table, and the shadow page table.
•
Store the shadow page table in nonvolatile storage, such that state of the database prior to transaction execution may be recovered. o Shadow page table is never modified during execution.
•
To start with, both the page tables are identical. Only current page table is used for data item accesses during execution of the transaction.
•
Whenever any page is about to be written for the first time o A copy of this page is made onto an unused page. o The current page table is then made to point to the copy o The update is performed on the copy 104
Sample Page Table
Shadow and current page tables after write to page
To commit a transaction : 1. Flush all modified pages in main memory to disk. 2. Output current page table to disk. 3. Make the current page table the new shadow page table, as follows: keep a pointer to the shadow page table at a fixed (known) location on disk. to make the current page table the new shadow page table, simply update the pointer to point to current page table on disk • Once pointer to shadow page table has been written, transaction is committed. •
No recovery is needed after a crash — new transactions can start right away, using the shadow page table.
•
Pages not pointed to from current/shadow page table should be freed (garbage collected).
Advantages of shadow-paging over log-based schemes: 1. 2.
no overhead of writing log records recovery is trivial
105
Disadvantages : a. Copying the entire page table is very expensive i. Can be reduced by using a page table structured like a B+-tree 1. No need to copy entire tree, only need to copy paths in the tree that lead to updated leaf nodes b. Commit overhead is high even with above extension i. Need to flush every updated page, and page table c. Data gets fragmented (related pages get separated on disk) d. After every transaction completion, the database pages containing old versions of modified data need to be garbage collected e. Hard to extend algorithm to allow transactions to run concurrently i. Easier to extend log based schemes
106
UNIT V 1. Object Oriented Databases-Need for Complex Data types- OO data Model 2. Object Relational Database- Nested relations- Complex Types- Inheritance Reference Types 3. Distributed databases- Homogenous and Heterogeneous, Distributed data Storage 4. XML – Structure of XML- Data- XML Document- Schema- Querying and Transformation 5. Data Mining 6. Data Warehousing
XML -Extensible Markup Language • • • •
It is defined by the WWW Consortium (W3C) It is Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML It is extensible. (ie new user defined tags can be added which is not supported by HTML) Tags make data (relatively) self-documenting E.g. A-101 Downtown 500 A-101 Johnson
Purpose: • •
Used to create special-purpose markup languages Facilitates sharing of structured text and information across the Internet
The Structure of XML Tag represents a label for a section of data. Example: … Element: section of data beginning with and ending with matching Elements must be properly nested Proper nesting … …. Improper nesting … …. 107
Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element. Every document must have a single top-level element Example of Nested Elements Hayes Main Harrison A-102 Perryridge 400 … •
Empty Tags →
•
Attributes -- Variable Value Bindings in the Tag
•
Free Text I'm free!!!
Attributes: Elements can have attributes A-102 Perryridge 400 Attributes are specified by name=value pairs inside the starting tag of an element. An element may have several attributes, but each attribute name can only occur once . To store string data that may contain tags, without the tags being interpreted as subelements, use CDATA as below … ]]> Here, and are treated as just strings Namespaces: •
XML data has to be exchanged between organizations 108
•
Same tag name may have different meaning in different organizations, causing confusion on exchanged documents Specifying a unique string as an element name avoids confusion Better solution: use unique-name:element-name Avoid using long unique names all over document by using XML Namespaces … Downtown Brooklyn
• • • •
… XML Document Schema Database schemas constrain what information can be stored, and the data types of stored values. XML documents are not required to have an associated schema. However, schemas are very important for XML data exchange, Otherwise, a site cannot automatically interpret data received from another site. Two mechanisms for specifying XML schema Document Type Definition (DTD) Widely used XML Schema Newer, not yet widely used Document Type Definition (DTD): The type of an XML document can be specified using a DTD DTD constraints structure of XML data What elements can occur What attributes can/must an element have What subelements can/must occur inside each element, and how many times. DTD does not constrain data types All values represented as strings in XML DTD syntax Subelements can be specified as names of elements, or #PCDATA (parsed character data), i.e., character strings EMPTY (no subelements) or ANY (anything can be a subelement) Example Subelement specification may have regular expressions 109
Notation: – “|” - alternatives – “+” - 1 or more occurrences – “*” - 0 or more occurrences Example: Bank DTD: ]> Attribute specification : for each attribute Name Type of attribute CDATA ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs) – more on this later Whether mandatory (#REQUIRED) has a default value (value), or neither (#IMPLIED) Limitations of DTDs: No typing of text elements and attributes All values are strings, no integers, reals, etc. Difficult to specify unordered sets of subelements Order is usually irrelevant in databases (A | B)* allows specification of an unordered set, but Cannot ensure that each of A and B occurs only once IDs and IDREFs are untyped The owners attribute of an account may contain a reference to another account, which is meaningless owners attribute should ideally be constrained to refer to customer elements
XML Schema: XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports Typing of values 110
■ BUT:
E.g. integer, string, etc Also, constraints on min/max values User defined types Is itself specified in XML syntax, unlike DTDs More standard representation, but verbose Is integrated with namespaces Many more features List types, uniqueness and foreign key constraints, inheritance .. significantly more complicated than DTDs, not yet widely used.
Querying and Transforming XML Data The Translation of information from one XML schema to another and Querying on XML data are closely related, and handled by the same tools. The Standard XML querying/translation languages are XPath Simple language consisting of path expressions XSLT Simple language designed for translation from XML to XML and XML to HTML XQuery An XML query language with a rich set of features Wide variety of other languages have been proposed, and some served as basis for the Xquery standard XML-QL, Quilt, XQL, … XPath XPath is used to address (select) parts of documents using path expressions A path expression is a sequence of steps separated by “/” Think of file names in a directory hierarchy Result of path expression: set of values that along with their containing elements/attributes match the specified path E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns Joe Mary E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags The initial “/” denotes root of the document (above the top-level tag) Path expressions are evaluated left to right Each step operates on the set of instances produced by the previous step Selection predicates may follow any step in a path, in [ ] E.g. /bank-2/account[balance > 400] returns account elements with a balance value greater than 400 /bank-2/account[balance] returns account elements containing a balance subelement Attributes are accessed using “@” E.g. /bank-2/account[balance > 400]/@account-number returns the account numbers of those accounts with balance > 400 IDREF attributes are not dereferenced automatically (more on this later) Functions in XPath 111
XPath provides several functions The function count() at the end of a path counts the number of elements in the set generated by the path E.g. /bank-2/account[customer/count() > 2] – Returns accounts with > 2 customers Also function for testing position (1, 2, ..) of node w.r.t. siblings Boolean connectives and and or and function not() can be used in predicates IDREFs can be referenced using function id() id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks E.g. /bank-2/account/id(@owner) returns all customers referred to from the owners attribute of account elements. XSLT: A stylesheet stores formatting options for a document, usually separately from document E.g. HTML style sheet may specify font colors and sizes for headings, etc. The XML Stylesheet Language (XSL) was originally designed for generating HTML from XML XSLT is a general-purpose transformation language Can translate XML to XML, and XML to HTML XSLT transformations are expressed using rules called templates Templates combine selection using XPath with construction of results XSLT Templates: Example of XSLT template with match and select part The match attribute of xsl:template specifies a pattern in XPath Elements in the XML document matching the pattern are processed by the actions within the xsl:template element xsl:value-of selects (outputs) specified values (here, customer-name) If an element matches several templates, only one is used Which one depends on a complex priority scheme/user-defined priorities We assume only one template matches any element Elements that do not match any template are output as is The template matches all elements that do not match any other template, and is used to ensure that their contents do not get output. 112
XQuery: XQuery is a general purpose query language for XML data Currently being standardized by the World Wide Web Consortium (W3C) The textbook description is based on a March 2001 draft of the standard. The final version may differ, but major features likely to stay unchanged. Alpha version of XQuery engine available free from Microsoft XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL XQuery uses a for … let … where .. result … syntax for SQL from where SQL where result SQL select let allows temporary variables, and has no equivalent in SQL FLWR Syntax in XQuery : For clause uses XPath expressions, and variable in for clause ranges over values in the set returned by XPath Simple FLWR expression in XQuery find all accounts with balance > 400, with each result enclosed in an .. tag for $x in /bank-2/account let $acctno := $x/@account-number where $x/balance > 400 return $acctno Let clause not really needed in this query, and selection can be done In XPath. Query can be written as: for $x in /bank-2/account[balance>400] return $X/@account-number Advantages of XML • Data Reusability Single Source - Multiple output • Targeted Retrieval Documents authored by, not about • Flexibility • Accessibility • Portability Strengths and Weaknesses: Strengths 113
• • • • •
Just Text: Compatible with Web/Internet Protocols Human + Machine Readable Represents most CS datastructures well Strict Syntax → Parsing Fast and Efficient
Weaknesses • Verbose/Redundant • Trouble modeling overlapping data structures (non hierarchical)
Object-Based Databases • • • •
Extend the relational data model by including object orientation and constructs to deal with added data types. Allow attributes of tuples to have complex types, including non-atomic values such as nested relations. Preserve relational foundations, in particular the declarative access to data, while extending modeling power. Upward compatibility with existing relational languages.
Complex Data Types: Motivation: Permit non-atomic domains (atomic ≡ indivisible) Example of non-atomic domain: set of integers,or set of tuples Allows more intuitive modeling for applications with complex data Intuitive definition: allow relations whenever we allow atomic (scalar) values — relations within relations Retains mathematical foundation of relational model Structured Types and Inheritance in SQL: Structured types can be declared and used in SQL create type Name as (firstname varchar(20), lastname varchar(20)) final create type Address as (street varchar(20), city varchar(20), zipcode varchar(20)) not final Note: final and not final indicate whether subtypes can be created
Methods: Can add a method declaration with a structured type. method ageOnDate (onDate date) returns interval year 114
Method body is given separately. create instance method ageOnDate (onDate date) returns interval year for CustomerType begin return onDate - self.dateOfBirth; end We can now find the age of each customer: select name.lastname, ageOnDate (current_date) from customer Inheritance: Suppose that we have the following type definition for people: create type Person (name varchar(20), address varchar(20)) Using inheritance to define the student and teacher types create type Student under Person (degree varchar(20), department varchar(20)) create type Teacher under Person (salary integer, department varchar(20)) Subtypes can redefine methods by using overriding method in place of method in the method declaration
Multiple Inheritance: SQL:1999 and SQL:2003 do not support multiple inheritance If our type system supports multiple inheritance, we can define a type for teaching assistant as follows: create type Teaching Assistant under Student, Teacher To avoid a conflict between the two occurrences of department we can rename them create type Teaching Assistant under Student with (department as student_dept ), Teacher with (department as teacher_dept ) Object-Identity and Reference Types: Define a type Department with a field name and a field head which is a reference to the type Person, with table people as scope: create type Department ( name varchar (20), head ref (Person) scope people) We can then create a table departments as follows create table departments of Department We can omit the declaration scope people from the type declaration and instead make an addition to the create table statement: 115
create table departments of Department (head with options scope people)
Path Expressions: Find the names and addresses of the heads of all departments: select head –>name, head –>address from departments An expression such as “head–>name” is called a path expression Path expressions help avoid explicit joins If department head were not a reference, a join of departments with people would be required to get at the address Makes expressing the query much easier for the user
Object-Relational Databases • • • •
Extend the relational data model by including object orientation and constructs to deal with added data types. Allow attributes of tuples to have complex types, including non-atomic values such as nested relations. Preserve relational foundations, in particular the declarative access to data, while extending modeling power. Upward compatibility with existing relational languages.
Nested Relations: •
•
Motivation: ♣ Permit non-atomic domains (NFNF, or NF2) ♣ Example of non-atomic domain: sets of integers, or tuples ♣ Allows more intuitive modeling for applications with complex data Intuitive definition: ♣ allow relations whenever we allow atomic (scalar) values — relations within relations ♣ Retains mathematical foundation of relational model ♣ Violates first normal form.
Example: library information system • Each book has ♣ title, ♣ a set of authors, ♣ Publisher, and ♣ a set of keywords • Non-1NF relation books
116
1NF Version of Nested Relation
4NF Decomposition of Nested Relation • Remove awkwardness of flat-books by assuming that the following multivalued dependencies hold: title author title keyword title pub-name, pub-branch • Decompose flat-doc into 4NF using the schemas: (title, author) (title, keyword) (title, pub-name, pub-branch) 4NF Decomposition of flat–books
Problems with 4NF Schema • 4NF design may require more joins in queries. • 1NF relational view flat-books defined by join of 4NF relations: 117
•
o eliminates the need for users to perform joins, o but loses the one-to-one correspondence between tuples and documents. o And has a large amount of redundancy NFNF relations representation is much more natural here.
Nesting: • • • •
Nesting is the opposite of un-nesting, creating a collection-valued attribute, using set “aggregate function” NOTE: SQL:1999 does not support nesting Nesting can be done in a manner similar to aggregation, but using the function set() in place of an aggregation operation, to create a set To nest the flat-books relation on the attribute keyword: select title, author, Publisher(pub_name, pub_branch) as publisher, set(keyword) as keyword-list from flat-books group by title, author, publisher
External Language Functions/Procedures: •
SQL:1999 permits the use of functions and procedures written in other languages such as C or C++ • Declaring external language procedures and functions create procedure author-count-proc(in title varchar(20), out count integer) language C external name’ /usr/avi/bin/author-count-proc’ create function author-count(title varchar(20)) returns integer language C external name ‘/usr/avi/bin/author-count’
Comparison of O-O and O-R Databases: • •
Summary of strengths of various database systems: Relational systems o simple data types, powerful query languages, high protection. • Persistent-programming-language-based OODBs o complex data types, integration with programming language, high performance. • Object-relational systems o complex data types, powerful query languages, high protection. • Note: Many real systems blur these boundaries E.g. persistent programming language built as a wrapper on a relational database offers first two benefits, but may have poor performance.
Distributed Databases 118
Distributed Database System • A distributed database system consists of loosely coupled sites that share no physical component. • The data reside in several relations. • Database systems that run on each site are independent of each other • Transactions may access data at one or more sites
Homogeneous Distributed Databases: In a homogeneous distributed database All sites have identical software Are aware of each other and agree to cooperate in processing user requests. Each site surrenders part of its autonomy in terms of right to change schemas or software Appears to user as a single system Heterogeneous distributed database: Different sites may use different schemas and software Difference in schema is a major problem for query processing Difference in software is a major problem for transaction processing Distributed Data Storage: • • • •
Assume relational data model Replication o System maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance. Fragmentation o Relation is partitioned into several fragments stored in distinct sites Replication and fragmentation can be combined o Relation is partitioned into several fragments: system maintains several identical replicas of each such fragment.
Data Replication: • • •
A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites. Full replication of a relation is the case where the relation is stored at all sites. Fully redundant databases are those in which every site contains a copy of the entire database.
Advantages of Replication:
Availability: failure of site containing relation r does not result in unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes in parallel. 119
Reduced data transfer: relation r is available locally at each site containing a replica of r.
Disadvantages of Replication: Increased cost of updates: each replica of relation r must be updated. Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. One solution: choose one copy as primary copy and apply concurrency control operations on primary copy Data Fragmentation: • • •
• •
Division of relation r into fragments r1, r2, …, rn which contain sufficient information to reconstruct relation r. Horizontal fragmentation: each tuple of r is assigned to one or more fragments Vertical fragmentation: the schema for relation r is split into several smaller schemas o All schemas must contain a common candidate key (or superkey) to ensure lossless join property. o A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key. Example : relation account with following schema Account = (branch_name, account_number, balance )
Advantages of Fragmentation • Horizontal: o allows parallel processing on fragments of a relation o allows a relation to be split so that tuples are located where they are most frequently accessed • Vertical: o allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed o tuple-id attribute allows efficient joining of vertical fragments o allows parallel processing on a relation • Vertical and horizontal fragmentation can be mixed. o Fragments may be successively fragmented to an arbitrary depth. Data Transparency: Data transparency: Degree to which system user may remain unaware of the details of how and where the data items are stored in a distributed system Consider transparency issues in relation to: o Fragmentation transparency o Replication transparency o Location transparency
Distributed Transaction – refer the text book.
Data Warehouse: 120
• • • •
•
•
•
•
data warehousing is subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process. a data warehouse is data management and data analysis data webhouse is a distributed data warehouse that is implement over the web with no central data repository goal: is to integrate enterprise wide corporate data into a single reository from which users can easily run queries Subject-orientedWH is organized around the major subjects of the enterprise..rather than the major application areas.. This is reflected in the need to store decisionsupport data rather than application-oriented data Integratedbecause the source data come together from different enterprise-wide applications systems. The source data is often inconsistent using..The integrated data source must be made consistent to present a unified view of the data to the users Time-variantthe source data in the WH is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots Non-volatiledata is not update in real time but is refresh from OS on a regular basis. New data is always added as a supplement to DB, rather than replacement. The DB continually absorbs this new data, incrementally integrating it with previous data
The benefits of data warehousing: • • •
The potential benefits of data warehousing are high returns on substantial competitive advantage.. increased productivity of corporate decision-makers.
comparision of OLTP systems and data warehousing system:
OLTP systems
Data warehousing systems
121
investment..
Hold current data Stores detailed data Data is dynamic Repetitive processing High level of transaction throughput Predictable pattern of usage Transaction-driven Application-orented Supports day-to-day decisions Serves large number of clerical/operation users
Holds historical data Stores detailed, lightly, and highly summarized data Data is largely static Ad hoc, unstructured, and heuristic processing Medium to how level of transaction throughput Unpredictable pattern of usage Analysis driven Subject-oriented supports strategic decisions Serves relatively how number of managerial users
Problems: • • • • • • • • • •
Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance Long-duration projects Complexity of integration
The architecture of data warehouse:
122
The architecture Operational data source1
High summarized data
Meta-data Operational data source 2
Query Manage Lightly summarized
Load Manager
Operational data source n
DBMS Detailed data
OLAP(online analytical processing) tools
Operational data store (ods) Operational data store (ODS)
Archive/backup data End-user access tools
Typical architecture of a data warehouse
•
Operational data sourcesfor the DW is supplied from mainframe operational data held in first generation hierarchical and network databases, departmental data held in proprietary file systems, private data held on workstaions and private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organization’s suppliers or customers.
•
Operational datastore(ODS)is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse.
•
load manageralso called the frontend component, it performance all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse
•
warehouse managerperforms all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data
•
query manageralso called backend component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries. 123
•
detailed, lightly and lightly summarized data,archive/backup data
•
meta-data
•
end-user access toolscan be categorized into five main groups: data reporting and query tools, application development tools, executive information system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools
Data flows : • • • • •
Inflow- The processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse. upflow- The process associated with adding value to the data in the warehouse through summarizing, packaging , packaging, and distribution of the data downflow- The processes associated with archiving and backing-up of data in the warehouse outflow- The process associated with making the data availabe to the end-users Meta-flow- The processes associated with the management of the meta-data
The critical steps in the construction of a data warehouse: a. Extraction b. Cleansing c. Transformation • after the critical steps, loading the results into target system can be carried out either by separate products, or by a single, categories.
Schema Design: A data warehouse is based on a multidimensional data model which views data in the form of a data cube. A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions. Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables Schema: The data warehouse follows the following schema architectures: 1. Star Schema 2. snow flake schema 3. galaxy schema Star schema: A fact table in the middle connected to a set of dimension tables Example of Star Schema
124
time
item
time_key day day_of_the_week month quarter year
time_key item_key
item_key item_name brand type supplier_type
branch_key
branch
location_key
branch_key branch_name branch_type
units_sold dollars_sold avg_sales
Measures
location location_key street city province_or_street country
Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Example of Snowflake Schema time time_key day day_of_the_wee k month quarter year
item Sales Fact Table time_key item_key branch_key
branch
location_key
branch_key branch_nam e branch_type
units_sold dollars_sold avg_sales
Measures
item_key item_name brand type supplier_ke y
supplier supplier_key supplier_type
location location_key street city_key
city city_key city province_or_stree t country
Fact constellations: in large applications one fact table is not enough to define the application. In these kinds of applications multiple fact tables had been used to share dimension tables. And this can be viewed as a collection of stars, therefore called galaxy schema or fact constellation. 125
Data mart : •
data mart a subset of a data warehouse that supports the requirements of particular department or business function
•
a data mart focuses on only the requirements of users associated with one department or business function
•
data marts do not normally contain detailed operational data, unlike data warehouses
•
as data marts contain less data compared with data warehouses, data marts are more easily understood and navigated
Warehouse Manager
Operational data source1
Operational data source 2
Lightly summarized data
Load Manager
Operational data source n
Query Manage
Detailed data
Operational data store (ods)
Reporting, query,application development, and EIS(executive information system) tools
High summarized data
Meta-data
DBMS
OLAP(online analytical processing) tools
Warehouse Manager Data mining
(First Tier)
(Third Tier)
Operational data store (ODS) End-user access tools
Archive/backup data Data Mart summarized data(Relational database)
Summarized data (Multi-dimension database)
(Second Tier)
Typical data warehouse adn data mart architecture
Reasons for creating a data mart: • • •
To give users access to the data they need to analyze most often To provide data in a form that matches the collective view of the data by a group of users in a department or business function To improve end-user response time due to the reduction in the volume of data to be accessed
Typical OLAP Operations
126
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction. Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
. Virtual warehouse is a set of views over operational databases. Only some of the possible summary views may be materialized.
Data Mining: •
The process of knowledge discovery or retrieval of hidden information from data banks and collections of databases. Initial steps involve selection and pre-processing by an expert in the appropriate knowledge domain
Data mining (knowledge discovery in databases) is Extraction of interesting (nontrivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases. Applications of data mining:
Prediction based on past history: o Predict if a credit card applicant poses a good credit risk, based on some attributes like income, job type, age, .. and past history o Predict if a pattern of phone calling card usage is likely to be fraudulent
Some examples of prediction mechanisms: o Classification Given a new item whose class is unknown, predict to which class it belongs o Regression formulae Given a set of mappings for an unknown function, predict the function result for a new parameter value
Descriptive Patterns: o Associations Find books that are often bought by “similar” customers. If a new such customer buys one such book, suggest the others too. o Associations may be used as a first step in detecting causation E.g. association between exposure to chemical X and cancer, o Clusters 127
E.g. typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics
Some basic operations: •
Predictive: – Regression – Classification
•
Descriptive: – Clustering / similarity matching – Association rules and variants – Deviation detection
Classification Rules:
Classification rules help assign new objects to classes. ⇒ E.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk? Classification rules for above example could use a variety of data, such as educational level, salary, age, etc. ⇒ ∀ person P, P.degree = masters and P.income > 75,000 ⇒ P.credit = excellent ⇒ ∀ person P, P.degree = bachelors and (P.income ≥ 25,000 and P.income ≤ 75,000) ⇒ P.credit = good Rules are not necessarily exact: there may be some misclassifications Classification rules can be shown compactly as a decision tree.
Decision Tree
Construction of Decision Trees: • •
Training set: a data sample in which the classification (class) is already known. (normally 1/3 of the database) We use the Greedy- top down generation of decision trees. 128
Internal node: Each internal node of the tree partitions the data into groups based on a partitioning attribute, and a partitioning condition for the node Leaf node: all (or most) of the items at the node belong to the same class, or all attributes have been considered, and no further partitioning is possible. Best Splits This is going to express based on which attribute the split on the decision tree occurs: • Pick best attributes and conditions on which to partition • The purity of a set S of training instances can be measured quantitatively in several ways. Notation: number of classes = k, number of instances = |S|, fraction of instances in class i = pi.. Use any one of the method and select the attribute. Finding Best Splits: • •
Categorical attributes (with no meaningful order): Multi-way split, one child for each value Binary split: try all possible breakup of values into two sets, and pick the best Continuous-valued attributes (can be sorted in a meaningful order) Binary split: Sort values, try each as a split point – E.g. if values are 1, 10, 15, 25, split at ≤1, ≤ 10, ≤ 15 Pick the value that gives best split Multi-way split: A series of binary splits on the same attribute has roughly equivalent effect
Decision-Tree Construction Algorithm: Procedure GrowTree (S ) Partition (S ); Procedure Partition (S) if ( purity (S ) > δp or |S| < δs ) then return; for each attribute A evaluate splits on attribute A; Use best split found (across all attributes) to partition S into S1, S2, …., Sr, for i = 1, 2, ….., r Partition (Si ); Other Types of Classifiers Neural net classifiers are studied in artificial intelligence and are not covered here Bayesian classifiers use Bayes theorem, which says p (cj | d ) = p (d | cj ) p (cj ) 129
p(d) where p (cj | d ) = probability of instance d being in class cj, p (d | cj ) = probability of generating instance d given class cj, p (cj ) = probability of occurrence of class cj, and p (d ) = probability of instance d occuring Naïve Bayesian Classifiers: •
•
Bayesian classifiers require computation of p (d | cj ) precomputation of p (cj ) p (d ) can be ignored since it is the same for all classes To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p (d | cj) = p (d1 | cj ) * p (d2 | cj ) * ….* (p (dn | cj ) Each of the p (di | cj ) can be estimated from a histogram on di values for each class cj the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and store
Regression: • • • • • •
Regression deals with the prediction of a value, rather than a class. Given values for a set of variables, X1, X2, …, Xn, we wish to predict the value of a variable Y. One way is to infer coefficients a0, a1, a1, …, an such that Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn Finding such a linear polynomial is called linear regression. In general, the process of finding a curve that fits the data is also called curve fitting. The fit may only be approximate because of noise in the data, or because the relationship is not exactly a polynomial Regression aims to find coefficients that give the best possible fit.
Association Rules: •
•
Retail shops are often interested in associations between different items that people buy. Someone who buys bread is quite likely also to buy milk A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts. Associations information can be used in several ways. E.g. when a customer buys a particular book, an online shop may suggest associated books.
Association rules: bread ⇒ milk DB-Concepts, OS-Concepts ⇒ Networks Left hand side: antecedent, right hand side: consequent An association rule must have an associated population; the population consists of a set of instances 130
E.g. each transaction (sale) at a shop is an instance, and the set of all transactions is the population • Rules have an associated support, as well as an associated confidence. • Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. E.g. suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for the rule is milk ⇒ screwdrivers is low. • Confidence is a measure of how often the consequent is true when the antecedent is true. E.g. the rule bread ⇒ milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milk. We are generally only interested in association rules with reasonably high support (e.g. support of 2% or greater) Naïve algorithm 1. Consider all possible sets of relevant items. 2. For each set find its support (i.e. count how many transactions purchase all items in the set). Large itemsets: sets with sufficiently high support 3. Use large itemsets to generate association rules. From itemset A generate the rule A - {b } ⇒b for each b ∈ A. Support of rule = support (A). Confidence of rule = support (A ) / support (A - {b }) Finding Support • Determine support of itemsets via a single pass on set of transactions • Large itemsets: sets with a high count at the end of the pass • If memory not enough to hold all counts for all itemsets use multiple passes, considering only some itemsets in each pass. • Optimization: Once an itemset is eliminated because its count (support) is too small none of its supersets needs to be considered. • The a priori technique to find large itemsets: Pass 1: count support of all sets with just 1 item. Eliminate those items with low support Pass i: candidates: every set of i items such that all its i-1 item subsets are large Count support of all candidates Stop if there are no candidates Other Types of Associations Basic association rules have several limitations Deviations from the expected probability are more interesting E.g. if many people purchase bread, and many people purchase cereal, quite a few would be expected to purchase both We are interested in positive as well as negative correlations between sets of items • Positive correlation: co-occurrence is higher than predicted • Negative correlation: co-occurrence is lower than predicted Sequence associations / correlations E.g. whenever bonds go up, stock prices go down in 2 days • Deviations from temporal patterns E.g. deviation from a steady growth E.g. sales of winter wear go down in summer Not surprising, part of a known pattern. 131
Look for deviation from value predicted using past patterns
Clustering: Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster Can be formalized using distance metrics in several ways Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized Centroid: point defined by taking average of coordinates in each dimension. Another metric: minimize average distance between every pair of points in a cluster Other Types of Mining: Text mining: application of data mining to textual documents ● cluster Web pages to find related pages ● cluster pages a user has visited to organize their visit history ● classify Web pages automatically into a Web directory Data visualization systems help users examine large volumes of data and detect patterns visually ● Can visually encode large amounts of information on a single screen ● Humans are very good a detecting visual patterns
132
View more...
Comments