Faculty of Computer Science & Engineering Ho Chi Minh City University of Technology
Data modeling
Dr. Vo Thi Ngoc Chau (
[email protected])
Semester 1 – 2014-2015
1
Main references [1] R. Elmasri, S. R. Navathe, Fundamentals of Database
Systems- 4th Edition, Pearson- Addison Wesley, 2003.
[2] H. G. Molina, J. D. Ullman, J. Widom, Database System Implementation, Prentice-Hall, 2000. [3] H. G. Molina, J. D. Ullman, J. Widom, Database Systems: The Complete Book, Prentice-Hall, 2002
[4] A. Silberschatz, H. F. Korth, S. Sudarshan, Database System Concepts –3rd Edition, McGraw-Hill, 1999. [5] R. Ramakrishnan, J. Gehrke, Database Management Systems- 2rd Edition, McGraw-Hill. [6] M. P. Papazoglou, S. Spaccapietra, Z. Tari, Advances in ObjectOriented Data Modeling, MIT Press, 2000. [7]. G. Simsion, Data Modeling: Theory and Practice, Technics Publications, LLC, 2007.
2
Content
2.1. Concepts
2.2. Why is data modeling important?
2.3. Data modeling with entity-
relationship and relational models
2.4. Dependencies and normalization
2.5. Conclusion 3
2.1. Concepts Data
modeling
Conceptual
data model
Entity relationship model
Representational
Relational data model
Database
data model
design
Relational database design 4
Data modeling
Modeling - Cambridge dictionary
Model = a representation of something, either as a physical object which is usually smaller than the real object, or as a simple description
of the object which might be used in calculations
Modeling = constructing a representation of something, either as a physical object which is usually smaller than the real object, or as a simple description of the object which might be used in calculations
5
Data modeling
Data modeling – [7], pp. 31-32
“formalizing and representing the data structures of reality”
“ a representation of the things of significance to an enterprise and the relationships among those things”
Shoval and Frumermann, 1994, p28
Hay, 1996a
“an attempt to capture the essence of things both concrete and abstract”
Keuffel, 1996 6
Data modeling
Data modeling – [7], pp. 31-32
“an abstract representation of the data about entities, events, activities, and their associations within an organization”
McFadden, Hoffer et al., 1999
“The core idea underlying all the definitions is the same: a data model is used for describing entities and their relationships within a core domain.”
Topi and Ramesh, 2002 7
Data modeling
Data modeling – [7], pp. 31-32
“Data modeling is generally viewed as a design activity”
“an activity that involves the creation of abstractions”
Davydov, 1994
“the art and science of arranging the structure and relationship of data”
Srinivasan and Te‟eni, 1990
McComb, 2004, p293
“data modeling is a design discipline”
Simsion and Witt, 2005, p7
8
Conceptual data model
A data model (aka semantic data model)
Provides concepts close to the way many users perceive data
Provides the concepts essential for supporting the application environment at a very high nonsystem-specific level
Used for a conceptual schema of a database from data requirements
Example: entity-relationship model 9
Conceptual data model
Characteristics of a conceptual data model
Expressiveness
Simplicity
A small number of basic concepts that are distinct and orthogonal in their meaning
Formality
Simple enough for an end user to use and understand an easy diagrammatic notation
Minimality
Distinctions between data, relationships, constraints
Concepts must be formally defined state criteria for the validity of a schema in the model
Unique interpretation
Complete and unambiguous semantics for each modeling construct
S. Navathe. Evolution of data modeling for databases. Communications of the ACM 35(9)(1992) 112-123.
10
Representational data model
A data model (aka implementation/logical data model)
Provides concepts understood by end-users but not too far removed from the way data is organized within the computer
Hides some details of data storage but able to be implemented on a computer system in a direct way (in some DBMS)
Example: relational data model 11
Database design
Design the logical and physical structure of one or more databases to accommodate the information
needs of the users in an organization for a defined set of applications
The overall database design activity has to undergo a systematic process called the design methodology,
whether the target database is managed by an RDBMS, ODBMS, or ORDBMS, ...
The result of the design activity is a rigidly defined database schema that cannot easily be modified once the database is implemented.
12
Database design
Goals:
Satisfy the information content requirements of
the specified users and applications
Provide a natural and easy-to-understand
structuring of the information
Support processing requirements and any
performance objectives, such as response time, processing time, and storage space 13
Database design
Figure 2.1 Phases of database design and implementation for
large databases [1], pp. 368
14
Database design
Six main phases of the overall database design and implementation process
Requirements collection and analysis
Conceptual database design
Choice of a DBMS
Data model mapping (also called logical
database design)
Physical database design
Database system implementation and tuning
15
Database design
Phase 1: Requirements Collection and Analysis
Identify the major application areas and user groups that will use the database or whole work will be affected by the database
Study and analyze existing documentation concerning the applications
Study the current operating environment and planned use of the information
Analyze the types of transactions and their frequencies as well as of the flow of information within the system Study geographic characteristics regarding users, origin of transactions, destination of reports, … Specify the input and output data for the transactions
Collect written responses to sets of questions from the potential database users or user groups
Users‟ priorities and the importance users place on various applications
Requirements from users and applications of the information system that will interact with the database system
Time-consuming but crucial to the success of the information system
Know and analyze the expectations of the users and the intended uses of 16 the database in as much detail as possible
Database design
Phase 2: Conceptual Database Design
Phase 2a: Conceptual Schema Design
Examine the data requirements resulting from Phase 1 Produce a conceptual database schema in a DBMSindependent high-level data model
Phase 2b: Transaction Design
Design the characteristics of known database transactions (applications) in a DBMS-independent way to ensure that the database schema will include all the information required by these transactions Identify transactions‟ input/output and functional behavior
Group transactions into three categories: retrieval transactions, update transactions, mixed transactions
17
Database design
Phase 2: Conceptual Database Design
Phase 2a: Conceptual Schema Design
Goal = a complete understanding of the database structure, meaning (semantics), interrelationships, and constraints
Identify: entity types, relationship types, attributes, key attributes, cardinality and participation constraints on relationships, weak entity types, specialization/generalization hierarchies
Approaches
The centralized (or one-shot) schema design approach The view integration approach
18
Database design
Phase 2a: Conceptual Schema Design
The centralized (or one-shot) schema design approach
The requirements of the different applications and user groups from Phase 1 are merged into a single set of requirements before schema design begins. A single schema corresponding to the merged set of requirements is then designed. The DBA is responsible for deciding how to merge the requirements and for designing the conceptual schema for the whole database
Once the conceptual schema is designed and finalized, external schemas for the various user groups and applications can be specified by the DBA.
19
Database design
Phase 2a: Conceptual Schema Design
The view integration approach
The requirements are not merged. A schema (or view) is designed for each user group or application based only on its own requirements. Each user group or application
These schemas are merged or integrated into a global conceptual schema for the entire database.
The DBA
The individual views can be reconstructed as external schemas after view integration. 20
Database design
Phase 3: Choice of a DBMS
Technical factors
Storage structures and access paths, the user and programmer interfaces, high-level query languages, development tools, architectural options, … supported by DBMS DBMS portability among different types of hardware
Non-technical factors
Suitability & type of the DBMS for the task
…
Cost: software acquisition cost, maintenance cost, hardware acquisition cost, database creation and conversion cost, personnel cost, training cost, operating cost Availability of vendor services 21
Database design
Phase 4: Data Model Mapping (Logical Database Design)
Create a conceptual schema and external schemas in the data model of the selected DBMS
The result of this phase should be DDL statements in the language of the chosen DBMS that specify the conceptual and external level schemas of the database system. 22
Database design
Phase 5: Physical Database Design
Restricted to choosing the most appropriate structures for the database files from among the options offered by that DBMS
Choose specific storage structures and access paths for the database files
Response time
Space utilization
Transaction throughput
Estimate record size and number of records in each database file
Estimate the update and retrieval patterns for the file cumulatively from all the transactions
Estimate the file growth, either in the record size because of new attributes or in the number of records
23
Database design
Phase 6: Database System Implementation and Tuning
Create the database schemas and (empty) database files
Responsibility of the DBA in conjunction with the database designers
Reformat the data for loading into the new database if needed
Load/populate with the data if needed
Implement database transactions referring to the conceptual specifications of transaction, then write and test program code with embedded DML commands
Responsibility of the application programmers
Database tuning continues as long as the database is in existence, as long as performance problems are discovered, and while the requirements keep changing. 24
Relational database design
Two main approaches
A top-down design technique
Designing a conceptual schema in a high-level data model Mapping the conceptual schema into a relational database schema Analyzing each of the resulting relations based on the functional dependencies and assigned primary keys
A bottom-up design technique (relational synthesis)
View relational database schema design strictly in terms of functional and other types of dependencies specified on the database attributes Synthesize the relation schemas applying a normalization algorithm
25
2.2. Why is data modeling important?
Leverage
Make programming simpler and cheaper
Poor data organization can be expensive to fix.
Conciseness
Take more directly to the heart of the business requirements
Data quality
Problems with data quality can be traced back to a lack of consistency in:
Defining and interpreting data
Implementing mechanisms to enforce the definitions
A key role in achieving good data quality by establishing a common understanding of what is to be held in each table and column and how it is to be interpreted
G. C. Simsion, G. C. Witt, Data modeling essentials – 3rd edition, Elsevier Inc, 2005, pp. 8-10.
26
2.3. Data modeling with entity-relationship and relational models Entity-relationship Relational Data
model
data model
model mapping
27
Database design Entity Relationship Model P. P-S. Chen. The EntityRelationship Model – Toward
a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
Figure 2.1 Phases of database design and implementation for
large databases [1], pp. 368
28
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model – Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
The entity-relationship model adopts the more natural view that the real world consists of entities and relationships.
The model can achieve a high degree of data independence and is based on set theory and relation theory.
The entity-relationship model can be used as a basis for a unified view of data.
A special diagrammatric technique, the entity-relationship diagram, is introduced as a tool for database design.
29
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model – Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
Fig. 1. Analysis of data models using multiple levels of logical views
30
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model – Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
The entity-relationship (ER) modeling concepts
Entity types
Relationship types
Attributes
Keys
Structural constraints 31
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model – Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
Symbol
Meaning
Example
Employee
Entity type
Employees
Dependent
Weak entity type
Dependents of each employee
works_on
Relationship type
Employee works_on Project
dependents_of
Identifying relationship of the weak entity type
Dependents dependents_of Employees 32
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model – Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
Street
Symbol
Meaning
Example
Name
Attribute
Name of an employee
EmployeeID
Key Attribute
Distinct identifier of an employee
PhoneNumber
Multivalued Attribute
Phone numbers of an employee
Composite Attribute
Address (Street, District, City) of an employee
Derived Attribute
Age of an employee (derived from date of birth)
District
Address
Age
City
33
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model – Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
Constraints
34
Figure 3.2. An ER schema diagram for the COMPANY database [1], pp. 54.
35
a. List the entity types b. Is there a weak entity type? If so, give its name, partial key, and identifying relationship c.
What constraints do the partial key and the identifying relationship of the weak entity type specify in this diagram?
d. List the names of all relationship types, and specify the (min, max) constraint on each participation of an entity type in a relationship type.
Figure 3.18. An ER diagram for a BANK database schema [1], pp. 81.
36
Enhanced Entity Relationship Modeling
Class/subclass relationships and type inheritance
The relationship between a superclass and any one of its subclasses
Specialization
Define a set of subclasses of an entity type
Establish additional specific attributes with each subclass
Establish additional specific relationship types between each subclass and other entity types or other subclasses
Generalization
A reverse process of abstraction in which we suppress the differences among several entity types
Identify their common features
Generalize them into a single superclass of which the original entity types are special subclasses 37
partial specialization
total specialization
Figure 4.1. EER diagram notation to represent subclasses and specialization. [1], pp. 87.
38
Figure 4.3. Generalization [1], pp. 91.
39
- Disjoint (d): the subclasses of the specialization must be disjoint.
- Overlapping (o): the subclasses are not constrained to be disjoint, their sets of entities may overlap. Figure 4.5. EER diagram notation for overlapping (nondisjoint) specialization. [1], pp. 93.
40
Figure 4.7. A specialization lattice with multiple inheritance for a UNIVERSITY database [1], pp. 96. a. List superclasses, subclasses of each superclass
b. List class/subclass relationships c. Describe constraints on each specialization
41
Database design
A representational data
model supported by the chosen DBMS Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
Figure 2.1 Phases of database design and implementation for
large databases [1], pp. 368
42
Data model
E. F. Codd. Data models in database management, ACM, 1980.
A combination of three following components
(1). A collection of data structure types (the building
blocks of any database that conforms to the model);
(2). A collection of operators or inferencing rules, which can be applied to any valid instances of the data types listed in (1), to retrieve or derive data from any parts
of those structures in any combinations desired;
(3). A collection of general integrity rules, which implicitly or explicitly define the set of consistent
database states or changes of state or both –-- these rules may sometimes be expressed as insert-updatedelete rules
43
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
The term relation is used here in its accepted mathematical sense.
A relational database based on the relational data model is a collection of relations.
Given sets S1, S2, …, Sn (not necessarily distinct), R is a relation on these n sets if it is a set of ntuples each of which has its first element from S1, its second element from S2, and so on.
Sj is the jth domain of R.
R is said to have degree n.
More concisely, R is a subset of the Cartesian product S1xS2x…xSn.
44
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
Fig. 1. A relation of degree 4 A relation of degree 4, called supply, which reflects the shipments-in-progress of parts from specified suppliers to specified projects in specified quantities. 45
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
An n-ary relation R has the following properties:
Each row represents an n-tuple of R.
The ordering of rows is immaterial.
All rows are distinct.
The ordering of columns is significant – it corresponds to the ordering S1, S2, …, Sn of the domains on which R is defined.
The significance of each column is partially conveyed by labeling it with the name of the corresponding domain.
46
Data model
E. F. Codd. Data models in database management, ACM, 1980.
A combination of three following components
(1). A collection of data structure types (the building
blocks of any database that conforms to the model);
(2). A collection of operators or inferencing rules, which can be applied to any valid instances of the data types listed in (1), to retrieve or derive data from any parts
of those structures in any combinations desired;
(3). A collection of general integrity rules, which implicitly or explicitly define the set of consistent
database states or changes of state or both –-- these rules may sometimes be expressed as insert-updatedelete rules
47
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
Operations on relations
Usual set operations: union, intersection, minus
Projection ()
Select a subset of the tuples from a relation that satisfy a selection condition
Cartesian product (cross product) (x)
Select certain columns of a relation (striking out the others)
Selection ()
Compatible relations
Combine tuples from two relations in a combinatorial fashion
Join (inner/outer join, equijoin, natural join)
Combine related tuples from two relations into single tuples
49
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
50
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
T ←R :S
51
Data model
E. F. Codd. Data models in database management, ACM, 1980.
A combination of three following components
(1). A collection of data structure types (the building
blocks of any database that conforms to the model);
(2). A collection of operators or inferencing rules, which can be applied to any valid instances of the data types listed in (1), to retrieve or derive data from any parts
of those structures in any combinations desired;
(3). A collection of general integrity rules, which implicitly or explicitly define the set of consistent
database states or changes of state or both –-- these rules may sometimes be expressed as insert-updatedelete rules
52
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
Constraints inherent in the data model
Domain constraints
Key constraints
Constraints on nulls
Entity integrity constraints
Referential integrity constraints
53
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
Domain constraints
Within each tuple, the value of each attribute A must be an atomic value from the domain dom(A).
Key constraints 1. Two distinct tuples in any state of the relation cannot have identical values for (all) the attributes in the key. 2. It is a minimal superkey – that is, a superkey from which we cannot remove any attributes and still have the uniqueness constraint in condition 1 hold.
Superkey of the relation schema R specifies a uniqueness constraint that no two distinct tuples in any state r of R can have the same value for the superkey.
54
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
Key constraints
A relation schema may have more than one key.
Constraints on nulls
Specify whether null values are or are not permitted on attributes
Entity integrity constraint
Candidate keys: primary key and secondary keys
No primary key value can be null.
Referential integrity constraint
Specify a referential integrity constraint between the two relation schemas R1 and R2 55
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
Referential integrity constraint
A set of attributes FK (foreign key) in relation schema R1 is a foreign key of R1 that references relation R2 if it satisfies the following two rules:
The attributes in FK have the same domain(s) as the primary key attributes PK of R2; the attributes FK are said to reference or refer to the relation R2.
A value of FK in a tuple t1 of the current state r1(R1) either occurs as a value of PK for some tuple t2 in the
current state r2(R2) or is null. In the former case, we have t1[FK] = t2[PK], and we say that the tuple t1 references or refers to the tuple t2.
56
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.
Relation Employee Relation schema: EMPLOYEE (FNAME, MINIT, LNAME, SSN,
BDATE, ADDRESS, SEX, SALARY, SUPERSSN, DNO) Attributes = FNAME, MINIT, LNAME, SSN, … Domain of Attribute SEX = {F, M}
Domain of SALARY = a set of positive integer numbers Primary key = SSN
Foreign key = SUPERSSN
57
The SQL standard has gone through a number of revisions. http://en.wikipedia.org/wiki/SQL (Accessed on 09/05/2012)
High-level database language on relational databases: SQL-86, SQL-89, SQL-92, SQL-99, …
58
SQL (structured query language)
Data Definition Language (DDL): Create, Alter, Drop
Data Manipulation Language (DML): Select, Insert, Update, Delete
Data Control Language (DCL): Commit, Rollback, Grant, Revoke 59
SQL (structured query language) CREATE TABLE TableName {(colName dataType [NOT NULL] [UNIQUE]
[DEFAULT defaultOption] [CHECK searchCondition] [,...]} [PRIMARY KEY (listOfColumns),] {[UNIQUE (listOfColumns),] […,]}
{[FOREIGN KEY (listOfFKColumns) REFERENCES ParentTableName [(listOfCKColumns)], [ON UPDATE referentialAction] [ON DELETE referentialAction ]] [,…]}
{[CHECK (searchCondition)] [,…] })
60
SQL (structured query language) SELECT [DISTINCT | ALL] {* | [columnExpression [AS newName]] [,...] } FROM TableName [alias] [, ...] [WHERE condition] [GROUP BY columnList] [HAVING condition] [ORDER BY columnList] 61
Database design
Data model mapping from a conceptual schema based on the ER model to a relational database schema based on the relational data
model
Figure 2.1 Phases of database design and implementation for
large databases [1], pp. 368
62
Data model mapping
Table 7.1. Correspondence between ER and Relational Models [1], pp. 198.
63
Data model mapping
Mapping of specialization or generalization
Convert each specialization with m subclasses {S1, S2, …, Sm} and (generalized) superclass C, where the attributes of C are {k, a1, …, an} and k is the (primary) key, into relation schemas using one of the four following options:
Multiple relations – Superclass and subclasses
Multiple relations – Subclass relations only (total)
Single relation with one type attribute (disjoint)
Single relation with multiple type attributes (overlapping)
64
Using Multiple relations – Subclass relations only (total) Figure 7.4, pp. 200.
Figure 4.3. Generalization [1], pp. 91.
65
Figure 7.4. Using Single relation with multiple type attributes with Boolean fields MFlag and PFlag. [1], pp. 200.
66
Data model mapping
Mapping algorithms
Automatically create a relational schema from a conceptual schema design
Notes on data model mapping
Slightly different for other post-relational data models as compared to the relational data model
Constraints on databases
Inherent model-based constraints
Schema-based constraints
Application-based constraints
67
68
69
70
71
Query
Example: For every project located in „Stafford‟, retrieve the project number, the controlling department number and the department manager‟s last name, address and birthdate.
SQL query: SELECT P.NUMBER, P.DNUM, E.LNAME, E.ADDRESS, E.BDATE FROM PROJECT AS P, DEPARTMENT AS D, EMPLOYEE AS E WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND
P.PLOCATION=„STAFFORD‟;
Relational algebra expression: PNUMBER, DNUM, LNAME, ADDRESS, BDATE(((σPLOCATION=‘STAFFORD’(PROJECT))
DNUM=DNUMBER (DEPARTMENT))
MGRSSN=SSN
(EMPLOYEE))72
Query: For each project on which more than two employees work, retrieve the project number, project name, and the number of employees who work on that project. SELECT
PNUMBER, PNAME, COUNT (*) AS PROJ_NUMBER
FROM
PROJECT, WORKS_ON
WHERE
PNUMBER=PNO
GROUP BY
PNUMBER, PNAME
HAVING
COUNT (*) > 2
Temp (PNUMBER, PNAME, PROJ_NUMBER) ← PNUMBER, PNAME
(
PNUMBER=PNO(PROJECT
Result ←
COUNT *
X WORKS_ON))
PROJ_NUMBER > 2
(Temp) 73
Query: retrieve the last name and first name of each employee
whose salary is greater than the max salary in department 5.
LNAME, FNAME σSALARY > C EMPLOYEE
MAX SALARY σDNO = 5 EMPLOYEE
74
Data model mapping
Informal measures of quality for relation schema design
Semantics of the attributes
Reducing the redundant values in tuples
Storage space & update anomalies
Reducing the null values in tuples
How to interpret the attribute values stored in a tuple of the relation – how the attribute values in a tuple relate to one another
Multiple interpretations of nulls
Disallowing the possibility of generating spurious tuples
Join relations with equality conditions on attributes that are either primary keys or foreign keys in a way that guarantees that no spurious tuples are generated 75
2.4. Dependencies and normalization
Functional dependency (FD)
The single most important concept in relational schema design theory
A constraint between two sets of attributes from the database
A property of the semantics or meaning of the attributes.
Examples
SSN ENAME The value of an employee‟s social security number (SSN) uniquely determines the employee name (ENAME)
PNUMBER {PNAME, PLOCATION}
{SSN, PNUMBER} HOURS
76
2.4. Dependencies and normalization
Functional dependency (FD)
A constraint between two sets of attributes from the database
A constraint on the possible tuples that can form a relation state r of a relation schema R = {A1, A2, …, An} Denoted by XY
X, Y: two sets of attributes that are subsets of R For any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y]. The values of Y of a tuple in r depend on (are determined by) the values of X. The values of X of a tuple uniquely (functionally) determine the values of Y. If XY in R, this does not say whether or not YX in R. 77
2.4. Dependencies and normalization
Armstrong‟s inference rules (IR1-3) and others (IR4-6)
78
2.4. Dependencies and normalization
Closure F+ of a set of functional dependencies F specified on relation schema R F+, the closure of F, is formally the set of all dependencies that include F as well as all dependencies that can be inferred from F.
79
2.4. Dependencies and normalization
Closure F+ of a set of functional dependencies F
F = {SSN→{ENAME, BDATE, ADDRESS, DNUMBER}, DNUMBER→{DNAME, DMGRSSN}}
F+ = {SSN→{ENAME, BDATE, ADDRESS, DNUMBER}, DNUMBER→{DNAME, DMGRSSN}, SSN→SSN, …,
DNUMBER→ DNAME, SSN→{DNAME, DMGRSSN}}
80
2.4. Dependencies and normalization
Closure F+ of a set of functional dependencies F
F = {{SSN, PNUMBER}→HOURS, SSN→ENAME, PNUMBER→{PNAME, PLOCATION}}
F+ = ???
81
2.4. Dependencies and normalization
Closure X+ of a set of attributes X under F where X is a set of attributes that appears as a left-hand side of a functional dependency in F
X+, the closure of X under F, is the set of attributes that are functionally determined by X based on F. 82
2.4. Dependencies and normalization
Closure X+ of a set of attributes X under F where X is a set of attributes that appears as a left-hand side of a functional dependency in F
83
2.4. Dependencies and normalization
Closure X+ of a set of attributes X under F where X is a set of attributes that appears as a left-hand side of a functional dependency in F
F = {{SSN, PNUMBER}→HOURS, SSN→ENAME,
PNUMBER→{PNAME, PLOCATION}}
{SSN, PNUMBER}+ = {SSN, PNUMBER, HOURS, ENAME, PNAME, PLOCATION}
{SSN}+ = {SSN, ENAME}
{PNUMBER}+ = {PNUMBER, PNAME, PLOCATION}
84
2.4. Dependencies and normalization
Closure X+ of a set of attributes X under F where X is a set of attributes that appears as a left-hand side of a functional dependency in F
F = {{A, B}→{C}, {A}→{D, E}, {B}→{M}, {M}→{G, H},
{D}→{I, J}}
{A, B}+ = ???
{A}+ = ???
{B}+ = ???
{M}+ = ???, and {D}+ = ???
85
2.4. Dependencies and normalization
Identifying a key K of R based on a set of given functional dependencies F
86
2.4. Dependencies and normalization
Identifying a key K of R based on a set of given functional dependencies F
R = (A, B, C, D, E, G)
F = {CE, AEG, EAD}
K = ???
87
2.4. Dependencies and normalization
Identifying a key K of R based on a set of given functional dependencies F
R = (A, B, C, D, E, G)
F = {CE, AEG, EAD}
K = {A, B, C, D, E, G}
(K-{A})+ = {B, C, D, E, G, A} K = {B, C, D, E, G}
(K-{B})+ = {C, D, E, G, A} K = {B, C, D, E, G}
(K-{C})+ = {B, D, E, G, A} K = {B, C, D, E, G}
(K-{D})+ = {B, C, E, G, A, D} K = {B, C, E, G}
(K-{E})+ = {B, C, G, E, A, D} K = {B, C, G}
(K-{G})+ = {B, C, E, A, D, G} K = {B, C}
88
2.4. Dependencies and normalization
Identifying a key K of R based on a set of given functional dependencies F
R = (A, B, C, E, G)
F = {AB, BCE, EGA}
K = ???
89
2.4. Dependencies and normalization
Identifying all keys K of R based on a set of given functional dependencies F
Find N = U - fFright(f) N is a set of the attributes not in the right side of any functional dependency in F. U is a set of all the attributes of R.
Find N+, the closure of N under F
If N+ = U then there is only one key K = N.
Otherwise, find D = fFright(f) - fFleft(f) D is a set of the attributes only in the right side of the functional dependencies in F.
Find L = U – (N+D) L is a set of the attributes neither in the right side nor functionally determined by N.
Let Li be any subset of L. Find (NLi)+ under F.
If (NLi)+ = U then K = NLi.
Ki, Kj such that Ki Kj, remove Kj.
90
2.4. Dependencies and normalization
Identifying all keys K of R based on a set of given functional dependencies F
R = (A, B, C, E, G)
F = {AB, BCE, EGA}
All keys K ???
91
2.4. Dependencies and normalization
R = (A, B, C, E, G)
F = {AB, BCE, EGA}
N = U - fFright(f) = {C, G}
N+ = {C, G}
D = fFright(f) - fFleft(f) =
L = U – (N+D) = {A, B, E}
Subsets of L: L1 = {A}, L2 = {B}, L3 = {E}, L4 = {A, B}, L5 = {A, E}, L6 = {B, E}, L7 = {A, B, E}
(NL1)+ = {C, G, A, B, E} K1 = {C, G, A}
(NL2)+ = {C, G, B, E, A} K2 = {C, G, B}
(NL3)+ = {C, G, E, A, B} K3 = {C, G, E}
92
2.4. Dependencies and normalization
Identifying all keys K of R based on a set of given functional dependencies F
R = (A, B, C, E, G)
F = {ABC, CGE, BG, EA}
All keys K ???
93
2.4. Dependencies and normalization
Membership of a functional dependency X→Y with respect to a set of functional dependencies F, i.e. X→Y is inferred from F, denoted by F╞ X→Y
A functional dependency X→Y is inferred from a set of dependencies F specified on R if X→Y holds in every legal relation state r of R; that is, whenever r satisfies all the dependencies in F, X→Y also holds in r.
94
2.4. Dependencies and normalization
Membership of a functional dependency X→Y with respect to a set of functional dependencies F, i.e. X→Y is inferred from F, denoted by F╞ X→Y
Check if F╞ X→Y
Find X+, the closure of X under F
If Y X+ then F╞ X→Y 95
2.4. Dependencies and normalization
Membership of a functional dependency X→Y with respect to a set of functional dependencies F, i.e. X→Y is inferred from F, denoted by F╞ X→Y
F = {A → C, AC → D, E → AD, E → H}
F ╞ A → CD
A+ under F = {A, C, D} {C, D} F ╞ A → CD
F ╞ E → AH ???
96
2.4. Dependencies and normalization
A set of functional dependencies F is said to cover another set of functional dependencies E if every FD in E is also in F+; that is, if every dependency in E can be inferred from F. Alternatively, E is covered by F.
Two sets of functional dependencies E and F are equivalent if E+ = F+; that is, every dependency in E can be inferred from F and every dependency in F can be inferred from E; that is also, E covers F and F covers E.
97
2.4. Dependencies and normalization
F = {A → C, AC → D, E → AD, E → H}
G = {A → CD, E → AH}
Are F and G equivalent? * Check if F covers G. * Check if G covers F.
98
2.4. Dependencies and normalization
The minimal cover F of a set of functional dependencies E is a set of functional dependencies that satisfies the property that every dependency in E is in the closure F+ of F. In addition, this property is lost if any dependency from the set F is removed; F must have no redundancies in it, and the dependencies in F are in a standard form. 99
2.4. Dependencies and normalization
The minimal cover F of a set of functional dependencies E 1.
Every dependency in F has a single attribute for its righthand side.
2.
We cannot replace any dependency X → A in F with a dependency Y → A, where Y is a proper subset of X, and still
have a set of dependencies that is equivalent to F. 3.
We cannot remove any dependency from F and still have a set of dependencies that is equivalent to F. 100
2.4. Dependencies and normalization
The minimal cover F of a set of functional dependencies E
101
2.4. Dependencies and normalization
The minimal cover F of a set of functional dependencies E
E = {A → B, ABCD → I, IJ → G, IJ → H, ACDJ → GI}
The minimal cover F of E ???
102
2.4. Dependencies and normalization
Normalization
A process of analyzing the given relation schemas based on their FDs and primary keys to achieve the desirable properties of:
(1) minimizing redundancy (2) minimizing the insertion, deletion, and update anomalies
Normal form of a relation
The highest normal form condition that the relation meets the degree to which it has been normalized
103
2.4. Dependencies and normalization
The process of normalization through decomposition produces the relational schemas that have:
The lossless join or non-additive join property
No spurious tuple generation problem occurs.
The dependency preservation property
Each FD is represented in some individual relation. 104
2.4. Dependencies and normalization Table 10.1. Summary of normal forms based on primary keys and corresponding normalization ([1], pp. 321)
105
2.4. Dependencies and normalization
Pure relations based on the relational data model are all in 1NF.
Relations in 2NF have non-key attributes
functionally dependent on the whole primary key.
Relations in 3NF have non-key attributes
only functionally dependent on the whole primary key.
106
Primary key = {SSN, PNUMBER} Non-key attributes = HOURS, ENAME, PNAME, PLOCATION
FD1: {SSN, PNUMBER} HOURS FD2: SSN ENAME FD3: PNUMBER {PNAME, PLOCATION}
Functionally dependent on a part of the primary key 107
Primary key = SSN
Non-key attributes = ENAME, BDATE, ADDRESS, DNUMBER, DNAME, DMGRSSN FD1: SSN {ENAME, BDATE, ADDRESS, DNUMBER} FD2: DNUMBER {DNAME, DMGRSSN}
Non-key attributes are functionally dependent on another non-key attribute. 108
2.4. Dependencies and normalization
Relation: LOTS (PROPERTY_ID#, COUNTY_NAME, LOT#, AREA, PRICE, TAX_RATE)
Primary key: PROPERTY_ID#
Secondary key: {COUNTY_NAME, LOT#}
Functional dependencies
FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA, PRICE, TAX_RATE} FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA, PRICE, TAX_RATE}
FD3: COUNTY_NAME TAX_RATE
FD4: AREA PRICE
109
Is LOTS really in 1NF? Normalization is carried out as follows. Determine LOTS1, LOTS2, LOTS1A, and LOTS1B.
Figure 10.11 (d). Summary of the progressive normalization of LOTS. [1], pp. 322. 110
2.4. Dependencies and normalization
Relation: LOTS (PROPERTY_ID#, COUNTY_NAME, LOT#, AREA, PRICE, TAX_RATE)
Primary key: PROPERTY_ID#
Secondary key: {COUNTY_NAME, LOT#}
Functional dependencies
FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA, PRICE, TAX_RATE} FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA, PRICE, TAX_RATE}
FD3: COUNTY_NAME TAX_RATE
FD4: AREA PRICE
2NF 111
2.4. Dependencies and normalization
Relation: LOTS1 (PROPERTY_ID#, COUNTY_NAME, LOT#, AREA, PRICE) -- in 2NF
Primary key: PROPERTY_ID#
Secondary key: {COUNTY_NAME, LOT#}
Functional dependencies
FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA, PRICE}
FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA, PRICE}
FD4: AREA PRICE
3NF
Relation: LOTS2 (COUNTY_NAME, TAX_RATE) -- in 3NF
Primary key: COUNTY_NAME
FD: COUNTY_NAME TAX_RATE 112
2.4. Dependencies and normalization
Relation: LOTS1A (PROPERTY_ID#, COUNTY_NAME, LOT#, AREA) -- in 3NF
Primary key: PROPERTY_ID#
Secondary key: {COUNTY_NAME, LOT#}
Functional dependencies
FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA}
FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA}
Relation: LOTS1B (AREA, PRICE) -- in 3NF
Primary key: AREA
FD: AREA PRICE
Relation: LOTS2 (COUNTY_NAME, TAX_RATE) -- in 3NF
Primary key: COUNTY_NAME
FD: COUNTY_NAME TAX_RATE
113
2.4. Dependencies and normalization
Boyce-Codd normal form (BCNF)
A simpler form of 3NF, but stricter than 3NF
Every relation in BCNF is also in 3NF; however, a relation in 3NF is not necessarily in BCNF.
For a relation in 3NF, each non-key attribute must:
Fully functionally dependent on every key Nontransitively dependent on every key
Not functionally dependent on other nonkey attributes
For a relation in BCNF, all key/non-key attributes must be functionally dependent on superkeys. XA holds in R => X is a superkey of R. XA holds in R and X is not a superkey of R => R is not 114 in BCNF.
2.4. Dependencies and normalization
Figure 10.12. (a) BCNF normalization of LOTS1A [1], pp. 325.
FD5: AREA COUNTY_NAME
AREA is not a superkey of LOTS1A. LOTS1A is in 3NF; but not in BCNF.
115
2.4. Dependencies and normalization
Candidate key = {STUDENT, COURSE} FD1: {STUDENT, COURSE} INSTRUCTOR FD2: INSTRUCTOR COURSE
Normalization??? 116
2.4. Dependencies and normalization
Multivalued dependencies
A multivalued dependency X
Y specified on
relation schema R, where X and Y are both
subsets of R, specifies the following constraint on any relation state r of R: if two tuple t1 and t2 exist in r such that t1[X] = t2[X], then two tuples
t3 and t4 should also exist in r with the following properties, where Z is used to denote (R-(XUY)):
t3[X] = t4[X] = t1[X] = t2[X]
t3[Y] = t1[Y] and t4[Y] = t2[Y]
t3[Z] = t2[Z] and t4[Z] = t1[Z]
117
2.4. Dependencies and normalization
Multivalued dependencies
ENAME
PNAME
ENAME
DNAME
No association between PNAME and DNAME 118
2.4. Dependencies and normalization
Multivalued dependencies (MVDs)
A multivalued dependency X
Y is a trivial MVD
if:
(a) Y is a subset of X
(b) X U Y = R
A trivial MVD does not specify any significant or meaningful constraint on R.
A nontrivial MVD satisfies neither (a) nor (b). 119
2.4. Dependencies and normalization
Fourth normal form
A relation schema R is in 4NF with respect to a set of dependencies (that includes functional dependencies and multivalued dependencies) if,
for every nontrivial multivalued dependency X
Y
in F+, X is a superkey for R.
120
2.5. Conclusion
Data modeling
Database design process: 6 phases
Conceptual database design
Choice of a DBMS a representational data model
Entity-relationship model
Relational data model
Data model mapping
Functional/multivalued dependencies
Normalization: 1NF, 2NF, 3NF, BCNF, 4NF 121
Questions ???
122
Review - Chapter 12 in [1] 12.1. What are the six phases of database design? Discuss each phase.
12.2. Which of the six phases are considered the main activities of the database design process itself? Why? 12.3. Why is it important to design the schemas and applications in parallel? 12.4. Why is it important to use an implementation-independent data model during conceptual schema design? What models are used in current design tools? Why?
12.5. Discuss the importance of Requirements Collection and Analysis. 12.7. Discuss the characteristics that a data model for conceptual schema design should possess. 12.8. Compare and contrast the two main approaches to conceptual schema design.
12.9. Discuss the strategies for designing a single conceptual schema from its requirements. 12.10. What are the steps of the view integration approach to conceptual schema design? What are the difficulties during each step? 12.13. Discuss the factors that influence the choice of a DBMS package for the information system of an organization. 12.14. What is system-independent data model mapping? How is it different from system-dependent data model mapping? 12.15. What are the important factors that influence physical database design?
123
Review - Chapters 3, 4 in [1] 3.1. Discuss the role of a high-level data model in the database design process.
3.2. List the various cases where use of a null value would be appropriate. 3.3. Define the following terms: entity, attribute, attribute value, relationship instance, composite attribute, multivalued attribute, derived attribute, complex attribute, key attribute, value set (domain). 3.4. What is an entity type? What is an entity set? Explain the differences among an entity, an entity type, and an entity set. 3.6. What is a relationship type? Explain the differences among a relationship instance, a relationship type, and a relationship set. 3.7. What is a participation role? When is it necessary to use role names in the description of relationship types? 3.12. When is the concept of a weak entity used in data modeling? Define the terms owner entity type, weak entity type, identifying relationship type, and partial key. 4.1. What is a subclass? When is a subclass needed in data modeling? Define the following terms: superclass of a subclass, superclass/subclass relationship, is-a relationship, specialization, generalization, category, specific (local) attributes) specific relationships. 4.6. Discuss the two main types of constraints on specializations and generalizations.
124
Review - Chapter 5 in [1] 5.1. Define the following terms: domain, attribute, n-tuple, relation schema, relation state, degree of a relation, relational database schema, relational database state.
5.2. Why are tuples in a relation not ordered? 5.3. Why are duplicate tuples not allowed in a relation? 5.4. What is the difference between a key and a superkey? 5.8. Discuss the entity integrity and referential integrity constraints. Why is each considered important? 5.9. Define foreign key. What is this concept used for? 125
Review - Chapter 10 in [1] 10.1. Discuss attribute semantics as an informal measure of goodness for a relation schema. 10.2. Discuss insertion, deletion, and modification anomalies. Why are they considered bad? Illustrate with examples. 10.3. Why should nulls in a relation be avoided as far as possible? Discuss the problem of spurious tuples and how we may prevent it. 10.4. State the informal guidelines for relation schema design that we discussed. Illustrate how violation of these guidelines may be harmful. 10.5. What is a functional dependency? What are the possible sources of the information that defines the functional dependencies that hold among the attributes of a relation schema? 10.6. Why can we not infer a functional dependency automatically from a particular relation state? 10.12. What does the term unnormalized relation refer to? How did the normal forms develop historically from first normal form up to Boyce-Codd normal form? 10.13. Define first, second, and third normal forms when only primary keys are considered. How do the general definitions of 2NF and 3NF, which consider all keys of a relation, differ from those that consider only primary keys? 10.14. What undesirable dependencies are avoided when a relation is in 2NF? 10.15. What undesirable dependencies are avoided when a relation is in 3NF? 10.16. Define Boyce-Codd normal form. How does it differ from 3NF? Why is it 126 considered a stronger form of 3NF?