Database Design With UML and SQL

October 15, 2017 | Author: Romi Dellarose | Category: Relational Model, Relational Database, Sql, Databases, Table (Database)

Share Embed Donate

Report this link

Short Description

Download Database Design With UML and SQL...

Description

Database design with UML and SQL, 3rd edition Welcome to students in Dr. Klusener's class at Vreije Universitiet Amsterdam! And many thanks to readers who have found this site useful. Yes, I'm still considering how to make exercise answers available, and possibly how to produce an easily-printable version. Watch for updates here! Also available on tomjewett.com: web accessibility resources and consulting.

Introduction This third edition of dbDesign is a general update, both to meet legal requirements for U.S. “Section 508” accessibility and to bring the code into compliance with the latest World Wide Web Consortium standards. In the process, I've tried to make the SQL examples as generic as possible, although you will still have to consult the documentation for your own database system. Graphics no longer require the SVG plugin; large-image and text-only views of each graphic are provided for all readers; the menu is now arranged by topic areas; and the print version (minus left-side navigation) is done automatically by a style sheet. The second edition was largely motivated by the very helpful comments of Prof. Alvaro Monge, as well as by my own observations in two semesters of using its predecessor in class. Major changes included the clear separation of UML from its implementation in the relational model, the introduction of relational algebra terminology as an aid to understanding SQL, and an increased emphasis on natural-language understanding of the design. The original site was the outgrowth of a previous book project, Practical Relational Database Design, by Wayne Dick and Tom Jewett. The move online featured condensed discussions, an integrated view of database concepts and skills, and use of the Unified Modeling Language in the design process. I’m grateful for the positive response that the site has received so far, both from my own students and from online readers worldwide. In every edition of this site, I owe a huge debt of gratitude to Prof. Wayne Dick, lead author of the earlier PRDD book and internationally-known accessibility expert. We’ve worked together for so long that it’s hard to identify separate authorship of the material here—I hope that this general acknowledgement will suffice. The students in my Fall 2002 class were especially helpful in “test driving” each page and asking lots of questions! As always with teaching materials, my students are the main source of inspiration and motivation to develop the site. Tom Jewett Department of Computer Engineering and Computer Science, Emeritus California State University, Long Beach www.cecs.csulb.edu/~jewett/ email: [email protected]

Copyright © 2002–2006, by Tom Jewett. Links to this site are welcome and encouraged. Individual copies may be printed for non-commercial classroom or personal use; however, this material may not be reposted to other web sites or newsgroups, or included in any printed or electronic publication, whether modified or not, without specific permission from the author. Database design is a process of modeling an enterprise in the real world. In fact, a database itself is a model of the real world that contains selected information needed by the enterprise. Many models and languages—some formally and mathematically defined, some informal and intuitive—are used by designers. Here are the ones that we present in this tutorial: • The Unified Modeling Language (UML) was designed for software engineering of large systems using object-oriented (OO) programming languages. UML is a very large language; we will use only a small portion of it here, to model those portions of an enterprise that will be represented in the database. It is our tool for communicating with the client in terms that are used in the enterprise. • The Entity-Relationship (ER) model is used in many database development systems. There are many different graphic standards that can represent the ER model. Some of the most modern of these look very similar to the UML class diagram, but may also include elements of the relational model. • The Relational Model (RM) is the formal model of a database that was developed for IBM in the early 1970s by Dr. E.F. Codd. It is largely based on set theory, which makes it both powerful and easy to implement in computers. All modern relational databases are based on this model. We will use it to represent information that does not (and should not) appear in the UML model but is needed for us to build functioning databases. • Relational Algebra (RA) is a formal language used to symbolically manipulate objects of the relational model. • The table model is an informal set of terms for relational model objects. These are the terms used most often by database developers. • The Structured Query Language (SQL, pronounced “sequel” or “ess-que-ell”) is used to build and manipulate relational databases. It is based on relational algebra, but provides additional capabilities that are needed in commercial systems. It is a declarative, rather than a procedural, programming language. There is a standard for this language, but products vary in how closely they implement it.

Basic structures: classes and schemes The UML class A UML class (ER term: entity) is any “thing” in the enterprise that is to be represented in our database. It could be a physical “thing” or simply a fact about the enterprise or an event that happens in the real world.

Example: We’ll build a sales database—it could be for any kind of business. To sell anything, we need customers, so a Customer will be our first class (entity) type. • The first step in modeling a class is to describe it in natural language. This helps us to know exactly what this class (“thing”) means in the enterprise. We can describe a customer like this: “A customer is any person who has done business with us or who we think might do business with us in the future. We need to know this person’s name, phone number and address in order to contact him or her.” • Each class is uniquely defined by its set of attributes (UML and ER), also called properties in some OO languages. Each attribute is one piece of information that characterizes each member of this class in the database. Together, they provide the structure for our database tables or code objects. • In UML, we will only identify descriptive attributes—those which actually provide realworld information (relevant to the enterprise) about the class that we are modeling. (These are sometimes called natural attributes.) We will not add “ID numbers” or similar attributes that we make up to use only inside the database.

Class diagram The class diagram shows the class name (always a singular noun) and its list of attributes.

Other views of this diagram: Large image - Data dictionary (text)

Relation scheme In an OO programming language, each class is instantiated with objects of that class. In building a relational database, each class is first translated into a relational model scheme. The scheme is identified by the plural form of the class name, and starts with all of the attributes from the class diagram.

Other views of this diagram: Large image - Data dictionary (text)

• In the relational model, a scheme is defined as a set of attributes, together with an assignment rule that associates each attribute with a set of legal values that may be assigned to it. These values are called the domain of the attribute. We’ve chosen to show the scheme graphically, but we could also have written it in set notation: Customers Scheme = {cFirstname, cLastname, cPhone, cStreet, cZipCode}. • There is no convenient graphical way to represent domains; we’ll discuss this issue in a later page. For the moment, our Customers relation scheme looks exactly like the Customer class diagram, only drawn sideways. It won’t stay that way for long. • It’s important to recognize that defining schemes or domains as sets of something automatically tells us a lot more about them: - They cannot contain duplicate elements. Our Customers scheme, for example, cannot have two cPhone attributes (even if they are called cPhone1 and cPhone2). - The elements in them are unordered. It doesn't matter if a customer's name is listed in order “Last, First” or “First Last”—they mean the same thing. - We can develop rules for what can be included in them and what is excluded from them. For example, zip codes don’t belong in the domain (set) of phone numbers, and vice-versa. - We can define subsets of them—for example, we can display only a selected set of attributes from a scheme, or we can limit the domain of an attribute to a specific range of values. - They may be manipulated with the usual set operators. In a later page, we will show how both the union and the intersection of schemes are used to join (combine) the information from two or more tables based on different schemes.

Table structure When we actually build the database, each relation scheme becomes the structure for one table. The SQL syntax for creating the table includes a data type for each attribute, which is needed for the database but is not the same as the domain of the attribute. CREATE TABLE customers ( cfirstname VARCHAR(20) NOT NULL, clastname VARCHAR(20) NOT NULL, cphone VARCHAR(20) NOT NULL, cstreet VARCHAR(50), czipcode VARCHAR(5));

In this example, VARCHAR is simply a variable-length character string of no more than the number of characters in parentheses. Consult your own system documentation for supported data types. We will explain the extra keyword NOT NULL when we look at rows and tables.

Exercise: designing classes Design classes to represent the following “things” in the given enterprises. For each one, describe the class in English, then draw the class diagram. • A student at a university. • A faculty member at a university. • A work of art that is displayed in a gallery or museum. • An automobile that is registered with the Motor Vehicle Department. • A pizza that is on the menu at a restaurant. The solution to this exercise will be discussed in class or posted online at a later date.

Basic structures: rows and tables Representing data in rows Each real-world individual of a class (for example, each customer who does business with our enterprise) is represented by a row of information in a database table. The row is defined in the relational model as a tuple that is constructed over a given scheme. Mathematically, the tuple is a function that assigns a constant value from the attribute domain to each attribute of the scheme. Notice that because the scheme is a set of attributes, we could show them in any order without changing the meaning of the data in the row (tuple).

Other views of this diagram: Large image - Description (text) In formal notation, we could show the assignments explicitly, where t represents a tuple: tTJ = ‹cfirstname := 'Tom', clastname := 'Jewett', cphone := '714-555-1212', cstreet := '10200 Slater', czipcode := '92708'› In practice, when we create a table row in SQL, we are actually making the assignment of domain values to attributes, just as in the tuple definition. INSERT INTO customers (cfirstname, clastname, cphone, cstreet, czipcode) VALUES ('Tom', 'Jewett', '714-555-1212', '10200 Slater', '92708');

In SQL, you can omit the attribute names from the INSERT INTO statement, as long as you keep the comma-delimited list of values in exactly the same order that was used to create the table. When we change the data in a table row using SQL, we are also following the tuple definition of assigning domain values to attributes. UPDATE customers SET cphone = '714-555-2323' WHERE cphone = '714-555-1212';

Tables A database table is simply a collection of zero or more rows. This follows from the relational model definition of a relation as a set of tuples over the same scheme. (The name “relational model” comes from the relation being the central object in this model.)

Other views of this diagram: Large image - Description (text) • Knowing that the relation (table) is a set of tuples (rows) tells us more about this structure, as we saw with schemes and domains. - Each tuple/row is unique; there are no duplicates - Tuples/rows are unordered; we can display them in any way we like and the meaning doesn’t change. (SQL gives us the capability to control the display order.) - Tuples/rows may be included in a relation/table set if they are constructed on the scheme of that relation; they are excluded otherwise. (It would make no sense to have an Order row in the Customers table.) - We can define subsets of the rows by specifying criteria for inclusion in the subset. (Again, this is part of a SQL query.) - We can find the union, intersection, or difference of the rows in two or more tables, as long as they are constructed over the same scheme.

Insuring unique rows Since each row in a table must be unique, no two rows can have exactly the same values for every one of their attributes. Therefore, there must be some set of attributes (it might be the set of all attributes) in each relation whose values, taken together, guarantee uniqueness of

each row. Any set of attributes that can do this is called a super key (SK). Super keys are a property of the relation (table), filled in with any reasonable set of real-world data, even though we show them in the relation scheme drawing for convenience. The database designer picks one of the possible super key attribute sets to serve as the primary key (PK) of the relation. (Notice that the PK is an SK, but not all SKs are PKs!) The PK is sometimes also called a unique identifier for each row of the table. This is not an arbitrary choice—we’ll discuss it in detail on a later page. For our customers table, we’ll pick the customer’s first name, last name, and phone number. We are likely to have at least two customers with the same first and last name, but it is very unlikely that they will both have the same phone number. In SQL, we specify the primary key with a constraint on the table that lists the PK attributes. We also give the constraint a name that is easy for us to remember later (like “customers_pk” here). ALTER TABLE customers ADD CONSTRAINT customers_pk PRIMARY KEY (cfirstname, clastname, cphone);

We also can specify the primary key when we create the table. The NOT NULL constraint prevents the PK attributes from being left empty, since ,ULL is a special constant in database systems that means “this field doesn’t have any value assigned to it.” It’s not the same as a zero length string or the number zero. CREATE TABLE customers ( cfirstname VARCHAR(20) NOT NULL, clastname VARCHAR(20) NOT NULL, cphone VARCHAR(20) NOT NULL, cstreet VARCHAR(50), czipcode VARCHAR(5)), CONSTRAINT customers_pk PRIMARY KEY (cfirstname, clastname, cphone);

Basic structure: associations The UML association The UML association (ER term: relationship) is the way that two classes are functionally connected to each other. Example: We want our customers to be able to place orders for the products that we sell, so we need to model the Order class and its association with the Customer. Notice that while the Customer class represents a physical “thing,” the Order class represents an event that happens in the enterprise. Both are equally valid class types. We will first describe the Order: “An order is created when a customer decides to buy one or more of our products. We need to know when the order was placed (date and time), and which customer representative sold the order.”

The association between the customer and the order will tell us which customer placed the order. We will describe the association in natural language just as we described the classes, but we will also include information about how few (at minimum) and how many (at maximum) individuals of one class may be connected to a single individual of the other class. This is called the multiplicity of the association (ER term: cardinality), and we describe it in both directions. “Each customer places zero or more orders.” (* in the diagram below means “many”, and any quantity more than one is the same as “many” in a database.) “Each order is placed by one and only one customer.” (Bad English—passive voice—but makes sense!)

Class diagram

Other views of this diagram: Large image - Data dictionary (text) • In the diagram, the association is simply shown by a line connecting the two class types. It is named with a verb that describes the action; an arrow shows which way to read the verb. Symbols at each end of the line represent the multiplicity of the association, as we described it above. • Looking at the maximum multiplicity at each end of the line (1 and * here), we call this a one-to-many association. • The UML representation of the Order class contains only its own descriptive attributes. The UML association tells which customer placed an order. In the database, we will need a different way to identify the customer; that will be part of the relation scheme (below).

Relation scheme diagram The relation scheme for the new Orders table contains all of the attributes from the class diagram, as before. But we also need to represent the association in the database; that is, we need to record which customer placed each order. We do this by copying the PK attributes of the Customer into the Orders scheme. The copied attributes are called a foreign key (FK), which is simply an image of the linked relation’s primary key.

Other views of this diagram: Large image - Data dictionary (text) • Since we can’t have an order without a customer, we call Customers the parent and Orders the child scheme in this association. The “one” side of an association is always the parent, and provides the PK attributes to be copied. The “many” side of an association is always the child, into which the FK attributes are copied. Memorize it: one, parent, PK; many, child, FK. • An FK might or might not become part of the PK of the child relation into which it is copied. In this case, it does, since we need to know both who placed an order and when the order was placed in order to identify it uniquely.

The child table The Orders table is created in exactly the same way as the Customers, including all of the attributes from the Orders scheme: CREATE TABLE orders ( cfirstname VARCHAR(20), clastname VARCHAR(20), cphone VARCHAR(20), orderdate DATE, soldby VARCHAR(20));

• Notice that the FK attributes must be exactly the same data type and size as they were defined in the PK table. • The DATE data type includes the time in some database systems, but not in others (which would need an additional ordertime attribute to permit more than one order per customer in a single day). For simplicity, we have omitted the time from our illustrations. • To insure that every row of the Orders table is unique, we need to know both who the customer is and what day (and time) the order was placed. We specify all of these attributes as the pk: ALTER TABLE orders ADD CONSTRAINT orders_pk PRIMARY KEY (cfirstname, clastname, cphone, orderdate);

• In addition, we need to identify which attributes make up the FK, and where they are found as a PK. The FK constraint will insure that every order contains a valid customer name and phone number—this is called maintaining the referential integrity of the database.

ALTER TABLE orders ADD CONSTRAINT orders_customers_fk FOREIGN KEY (cfirstname, clastname, cphone) REFERENCES customers (cfirstname, clastname, cphone);

• When you look at some typical data in the Orders table, you will see that some customers have placed more than one order. For each of these, the same customer information is copied in the FK columns—but the dates will be different. Of course, we hope to see many orders that were placed on the same date—but the customers will be different. You will also see that some customers haven’t placed any orders at all; their PK information is simply not found in the orders table. Orders cphone orderdate soldby cfirstname clastname Alvaro Monge 562-333-4141 2003-07-14 Patrick Wayne Dick 562-777-3030 2003-07-14 Patrick Alvaro Monge 562-333-4141 2003-07-18 Kathleen Alvaro Monge 562-333-4141 2003-07-20 Kathleen .ote: The date format shown in our examples ('yyyy-mm-dd') is used by many but not all systems. Consult the reference for your own software to be sure.

Exercise: patients and blood samples (Wayne Dick) The one-to-many association is perhaps the most common one that you will encounter in database modeling. As an example, we will look at the enterprise of a medical clinic. We wish to track the level of various substances (for example, cholesterol or alcohol) in the blood of patients. For each blood sample that is taken from the patient, one test will be performed and the date of the sample, the substance tested, and the measured level of that substance will be recorded in a database. • Describe each class in English. • Draw the class diagram. • Describe each association in English (both directions). • Draw the relation scheme. The solution to this exercise will be discussed in class or posted online at a later date.

Exercise: cities and states

Part of a database that you are developing will contain information about cities and states in the United States. Each city is located in only one state. (Texarkana, Texas is a different city than Texarkana, Arkansas.) • Describe each class in English. • Draw the class diagram. • Describe each association in English (both directions). • Draw the relation scheme. The solution to this exercise will be discussed in class or posted online at a later date.

Exercise: library books You are building a very simplified beginning of the database for a library. The library, of course, owns (physical) books that are stored on shelves and checked out by customers. Each of these books is represented by a catalog entry (now in the computer, but think of an oldfashioned card file as a model of this). Assume that there is only one “title” card for each book in the catalog, but there can be many physical copies of that book on the shelves. Call the title card class a “CatalogEntry” and the physical book class a “BookOnShelf.” • You might think of the book’s publisher as a simple attribute of the catalog entry—but in fact, the library will probably want to know more than just the publisher’s name (for example, the phone number where they can contact a sales representative). • Describe each class in English. • Draw the class diagram. • Describe each association in English (both directions). • Draw the relation scheme. The solution to this exercise will be discussed in class or online at a later date.

Discussion: more about keys Let’s look again at the relation scheme diagram for Customers and Orders.

Other views of this diagram: Large image - Data dictionary (text) Remember that a super key is any set of attributes whose values, taken together, uniquely identify each row of a table—and that a primary key is the specific super key set of attributes that we picked to serve as the unique identifier for rows of this table. We are showing these attributes in the scheme diagram for convenience, understanding that keys are a property of the table (relation), not of the scheme.

Candidate keys Before picking the pk, we need to identify any candidate key or keys that we can find for this table. A ck is a minimal super key; “minimal” means that if you take away any one attribute from the set, it is no longer a super key. If you add one attribute to the set, it is no longer minimal (but it’s still a super key). The word “candidate” simply means that this set of attributes could be used as the primary key, since we don’t want any more attributes in the pk than are necessary. Whether a set of attributes constitutes a ck or not depends entirely on the data in the table— not just on whatever data happens to be in the table at the moment, but on any set of data that could realistically be in this table over the life of the database. Does the Customers pk {cFirstName, cLastName, cPhone} meet this test? • A father and son with the same first and last names might be living together and have the same phone number. Adding the street address or the zip code wouldn’t help to distinguish them. On the other hand, we could assume that there is some other way by which the three attribute values can be made unique, perhaps with a middle name or initial in the cFirstName field. (U.S. father-son Presidents George Bush and George W. Bush are distinguished in this way.) With this assumption, our ck (and pk) works correctly. • The attribute set {cFirstName, cLastName, cStreet} might also be a candidate key for the Customers table, if we make the same assumption about first and last names. With more than one ck, we try to pick the one that is most descriptive of the individual and/or least likely to change over time. There’s not much difference in this example. Every table must have at least one candidate key; if you can’t find one, your design isn’t finished. Former heavyweight boxing champion George Foreman is said to have named every one of his sons “George Foreman.” If they were all customers of ours, living at the same address with the same phone number, we would need at least one more descriptive attribute—probably the birth date—to distinguish between them.

PK size might matter You’re probably thinking that it’s a real nuisance and waste of space to copy all three of the Customer pk attributes to make the fk in Orders. If so, you’re right. Remember that we purposely designed the Customers table without considering its association with Orders. Now that it’s the parent in a one-to-many association, we have to ask if the pk is small enough to be copied into the child table. In some tables, it will be— so we’re done. In this one, it isn’t— so we’ll have to make up a pk that is small enough. There are two types of “made up” primary keys: • A surrogate PK is a single, small attribute (such as a number) that has no descriptive value—it doesn’t tell us anything about the real-world individual. Most ID numbers are like this. Surrogate keys are created for the convenience of the database designer (only). They are a nuisance for database users, and should normally be hidden by the user interface of a database system. • A substitute PK is a single, small attribute (such as an abbreviation) that has at least some descriptive value. Examples of substitute keys include the two-letter postal codes for the states of the United States and the three-letter codes for worldwide airports. Substitute keys are also created for the convenience of the database designer. They are frequently still a nuisance for database users, although perhaps less so than surrogate PKs. Do not automatically add “ID numbers” (surrogate keys) or substitute keys to a table until you are sure that: • There is at least one candidate key (before the surrogate is added), • the table is a parent in at least one association, and • there is no candidate key small enough for its values to be copied many times into the child table. These rules apply to surrogate and substitute keys that you (and your co-workers) add to your own tables. However, you might find that a class type already has an attribute that appears to be a surrogate or substitute key, but has been defined by someone else—usually a standardssetting organization or a government agency. We call this attribute an external key. In the external organization’s database, there is a candidate key for it, whether or not you have access to it or include its value(s) in your own database. You may use it as a descriptive attribute in both UML and the relation scheme diagram. Like other descriptive attributes, it might or might not become part of a candidate key in your database. We will encounter many of these (such as the zip code, UPC, and ISBN) in our examples. • There is one special case of an external key that requires careful handling in your database design: the United States social security number (SSN). Originally intended for use only to identify social security participants, it has now become so over-used as an identifier that access to it poses risks of serious damage to individuals, even including identity theft. Please do not ever use the SSN in your database unless you are required to do so by law (for example, to file tax information). Even then, do not use it as a primary key that would be viewable to everyone who can access your database.

Revising the relation scheme All of what we have done here applies to the relation scheme and tables only—the UML class diagram doesn’t change (unless we have to add descriptive attributes). Our revised scheme, with surrogate pk customer ID, looks like this:

Other views of this diagram: Large image - Data dictionary (text) In the Orders scheme, the custID fk still represents the ck attributes from Customers—it tells us which customer placed the order. In the child table, it can be used as if it were a descriptive attribute in a candidate key or primary key, as shown here. The Customers and Orders tables can be joined exactly as before, only this time the sole join attribute (intersection of the schemes) is the custID: Customers custid cfirstname clastname cphone cstreet czipcode 1234 Tom Jewett 714-555-1212 10200 Slater 92708 90840 5678 Alvaro Monge 562-333-4141 2145 Main 9012 Wayne Dick 562-777-3030 1250 Bellflower 90840 Orders custid orderdate soldby 5678 2003-07-14 Patrick 9012 2003-07-14 Patrick 5678 2003-07-18 Kathleen 5678 2003-07-20 Kathleen Customers joined to Orders custid cfirstname clastname cphone cstreet czipcode orderdate soldby 5678 Alvaro Monge 562-333-4141 2145 Main 90840 2003-07-14 Patrick 9012 Wayne Dick 562-777-3030 1250 Bellflower 90840 2003-07-14 Patrick 5678 Alvaro Monge 562-333-4141 2145 Main 90840 2003-07-18 Kathleen 5678 Alvaro Monge 562-333-4141 2145 Main 90840 2003-07-20 Kathleen

Design pattern: many-to-many (order entry)

There are some modeling situations that you will find over and over again as you design real databases. We refer to these as design patterns. If you understand the concepts behind each one, you will in effect be adding new “tools” to your design toolbox that you can use in building the model of an enterprise. Our sales database represents one of these patterns. So far, we have customers and orders. To finish the pattern, we need products to sell. We’ll first describe what the Product class means, and how it is associated with the Order class: “A product is a specific type of item that we have for sale. Each product has a descriptive name; we distinguish similar products by the manufacturer's name and model number. For each product, we need to know its unit list price and how many units of this product we have in stock.” • It is important to understand exactly what this class means: an example product might be named “Blender, Commercial, 1.25 Qt.”, manufactured by Hamilton Beach, model number 908. This is a type of product, not an individual boxed blender that is sitting on our shelves. The same manufacturer probably has different blender models (909, 918, 919), and there are probably blenders that we stock that are made by other companies. Each would be a different instance of this class. “Each Order contains one or more Products.” (At least one because it doesn't make sense to place an order for no products.) “Each Product is contained in zero or more Orders.” (Zero because we might not have sold any of this product yet.) Since the maximum multiplicity in each direction is “many,” this is called a many-to-many association between Orders and Products. Each time an order is placed for a product, we need to know how many units of that product are being ordered and what price we are actually selling the product for. (The sale price might vary from the list price by customer discount, special sale, etc.) These attributes are a result of the association between the Order and the Product. We show them in an association class that is connected to the association by a dotted line. If there are no attributes that result from a many-to-many association, there is no association class.

Class diagram

Other views of this diagram: Large image - Data dictionary (text)

There are two additional attributes shown in the class diagram that we haven’t talked about yet. We need to know the subtotal for each order line (that is, the quantity times the unit sale price) and the total dollar value of each order (the sum of the subtotals for each line in that order). Since these values can be computed, they don’t need to be stored in the database, and they are not included in the relation scheme. Their names in the class diagram are preceeded by a “/” to show that they are derived attributes.

Relation scheme diagram We can’t represent a many-to-many association directly in a relation scheme, because two tables can’t be children of each other—there’s no place to put the foreign keys. So for every many-to-many, we will need a junction table in the database, and we need to show the scheme of this table in our diagram. If there is an association class (like OrderLines), its attributes will go into the junction table scheme. If there is no association class, the junction table (sometimes also called a join table or linking table) will contain only the FK attributes from each side of the association.

Other views of this diagram: Large image - Data dictionary (text) The many-to-many association between Orders and Products has turned into a one-to-many relationship between Orders and Order Lines, plus a many-to-one relationship between Order Lines and Products. You should also describe these in English, to be sure that you have the fk's in the right place: “Each Order is associated with one or more OrderLines.” “Each OrderLine is associated with one and only one Order.” “Each OrderLine is associated with one and only one Product.” “Each Product is associated with zero or more OrderLines.”

With Orders now a parent of OrderLines, we might have decided that it needs a surrogate key (order number) to be copied into the OrderLines. In fact, most sales systems do this, as you know if you’ve ever tried to check on the status of something you’ve ordered from a company. For this example, it seems to be just as easy to stick with the existing pk of Orders, since it already has a surrogate key from Customers, and the order date doesn’t add much size. The UPC (Universal Product Code) is an external key. UPCs are defined for virtually all grocery and manufactured products by a commercial organization called the Uniform Code Council, Inc.® We will use it as the primary key of our Products table, which also has a candidate key here: {mfgr, model}. To uniquely identify each order line, we need to know both which order this line is contained in, and which product is being ordered on this line. The two fk's, from Orders and Products, together form the only candidate key of this relation and therefore the primary key. There is no need to look for a smaller pk, since OrderLines has no children.

Data representation The key to understanding how a many-to-many association is represented in the database is to realize that each line of the junction table (in this case, OrderLines) connects oneline from the left table (Orders) with one line from the right table (Products). Each pk of Orders can be copied many times to OrderLines; each pk of Products can also be copied many times to OrderLines. But the same pair of fk's in OrderLines can only occur once. In this graphic example, we show only the pk and fk columns for sake of space.

Other views of this diagram: Large image - Description (text)

Exercise: building a database with notecards This exercise is designed to increase your understanding of how relationships are represented by data, before you start actually entering data into tables. It is best done in groups of two or three people. • We’re using the order entry design pattern here, but you might also use a model from one of our other exercises.

• Divide a stack of blank notecards into three groups. On one group, write data that represents orders (one order per card). On the second group, write data that represents products (one product per card). On the third group, write data that represents order lines (one order line per card). Remember that you can’t make an order line without an order and a product (the parents). You might also make a stack of Customer cards, and show how one customer can place many orders. • Use the format shown below. This is actually the UML notation for an object instance of a class; however, we are adding the primary and foreign keys from the relation scheme (just as you will do in the database). This may be a tedious bit of writing, but it will emphasize the fact that object instances and database rows both are actually functions that assign domain constants to attribute names. • Use realistic data, but don’t just copy what we already have in the tables. Make enough cards to show at least a few orders that include multiple order lines, and a few items that were purchased on more than one order. Your cards should look like this:

Other views of this diagram: Large image - Description (text) When you are finished, arrange the cards so that they show how order lines and products are matched to the orders (or how orders and order lines are matched to products).

Design pattern: many-to-many with history (the library loan) Remember that the UML association class represents the attributes of a many-to-many association, but can only be used if there is at most one pairing of any two individuals in the relationship. This means, for example in order entry, that there can be only one order line for each item ordered. This constraint is consistent with the enterprise being modeled. • There are times when we need to allow the same two individuals in a many-to-many association to be paired more than once. This frequently happens when we need to keep a history of events over time.

Example: In a library, customers can borrow many books and each book can be borrowed by many customers, so this seems to be a simple many-to-many association between customers and books. But any one customer may borrow a book, return it, and then borrow the same book again at a later time. The library records each book loan separately. There is no invoice for each set of borrowed books and therefore no equivalent here of the Order in the order entry example. (You have already seen other parts of the library model in exercises.) • The loan is an event that happens in the real world; we need a regular class to model it correctly. We’ll call this the “library loan” design pattern. First, we need to understand what the classes and associations mean: “A customer is any person who has registered with the library and is elegible to check out books.” “A catalog entry is essentially the same as an old-fashioned index card that represents the title and other information about books in the library, and allows the customers to quickly find a book on the shelves.” “A book-on-the-shelf is the physical volume that is either sitting on the library shelves or is checked out by a customer. There can be many physical books represented by any one catalog entry.” “A loan event happens when one customer takes one book to the checkout counter, has the book and her library card scanned, and then takes the book home to read.” “Each Customer makes zero or more Loans.” “Each Loan is made by one and only one Customer.” “Each Loan checks out one and only one BookOnShelf.” “Each BookOnShelf is checked out by zero or more Loans.” “Each BookOnShelf is represented by one and only one CatalogEntry (catalog card).” “Each CatalogEntry can represent one or more physical copies of the same book-on-theshelf.”

Class diagram

Other views of this diagram: Large image - Data dictionary (text)

Relation scheme diagram As in the order entry example, the Customers table will need a surrogate key (added by us) to save space when it is copied in the Loans. The CatalogEntries scheme already has two external keys: the call number and the ISBN (International Standard Book Number). The first of these is defined by the Library of Congress Classification system, and contains codes that represent the subject, author, and year published. The second of these is defined by an ISO (International Standards Organization) standard, number 2108. We’ll use the callNmbr as the primary key, since it has more descriptive value than the ISBN and is smaller than the descriptive CK {title, pubDate}.

Other views of this diagram: Large image - Data dictionary (text) • As we would do in a junction table scheme, we’ll copy the primary key attributes from both the Customers and the BooksOnShelf into the Loans scheme. This tells us which customer borrowed which book, but it doesn't tell when it was borrowed; we have to know the dateTimeOut in order to pair a customer with the same book more than once. We can call this a discriminator attribute, since it allows us to discriminate between the multiple pairings of customer and book. If you refer back to the UML class diagram, you’ll see that the loan, which would have been a many-to-many association class between customers and books, has become a “real” class because of the discriminator attribute. • In most cases like this, we would use both FKs plus the discriminator attribute dateTimeOut as PK of the Loans; here we need only the FK from the BooksOnShelf and the dateTimeOut (since it is physically impossible to run the same book through the scanner more than once at a time). Notice that there is actually another CK for loans: {dateTimeOut, scannerID}, since it is also physically impossible for the same scanner to read two different books at exactly the

same time. We chose {callNmbr, copyNmbr, dateTimeOut} because it has just a bit more descriptive value and because we don’t care about size here (since the Loan has no children).

Exercise: employee timecards In many businesses, employees may work on a number of different projects. Each week, they will submit a time card that lists each project on a separate line, along with the number of hours that they have worked that week on that project. • Describe each class in English. • Draw the class diagram, including association classes if required. • Describe each association in English (both directions). • Draw the relation scheme. The solution to this exercise will be discussed in class or posted online at a later date. [

Design pattern: subkeys (the zip code) (Wayne Dick and Tom Jewett) One of the major goals of relational database design is to prevent unnecessary duplication of data. In fact, this is one of the main reasons for using a relational database instead of a “flat file” that stores all information in one table. Sometimes we will design a class that seems to be correct, only to find out in the relation scheme or in the table itself that we have a problem. Example: Almost every personal productivity program today includes some sort of contact manager. A “contact” is a person who could be a business associate or simply a friend or family member. Many of these programs have a very simplistic one-table model for the contact information, which probably looks something like this (ignoring phone numbers for the moment):

Other views of this diagram: Large image - Data dictionary (text) • It may not be obvious that this model has a problem, until you look at the Contacts table with some typical data filled in:

Contacts street zipCode city state first,ame last,ame George Barnes 1254 Bellflower 90840 Long Beach CA Susan Noble 1515 Palo Verde 90840 Long Beach CA Erwin Star 17022 Brookhurst 92708 Fountain Valley CA Alice Buck 3884 Atherton 90836 Long Beach CA Frank Borders 10200 Slater 92708 Fountian Valley CA Hanna Diedrich 1699 Studebaker 90840 Long Beach CA • Notice the repeated information in the city and state attributes. This is not only redundant data; it might also be inconsistent data. (Can you spot the “typo” above?)

Functional dependencies, subkeys, and lossless join decomposition To understand why we have a problem, we first have to understand the concept of a functional dependency (FD), which is simply a more formal term for the super key property. If X and Y are sets of attributes, then the notation X→Y is read “X functionally determines Y” or “Y is functionally dependent on X.” This means that if I’m given a table filled with data plus the value of the attributes in X, then I can uniquely determine the value of the attributes in Y. • A super key always functionally determines all of the other attributes in a relation (as well as itself). This is a “good” FD. A “bad” FD happens when we have an attribute or set of attributes that are a super key for some of the other attributes in the relation, but not a super key for the entire relation. We call this set of attributes a subkey of the relation. • In our example above, the zipCode is a subkey of the Contacts table. It is not a super key for the entire table, but it functionally determines the city and state. (If you know the zip code, you can always find the city and state, although you might need all nine digits instead of the five we show here.) The opposite is not true, because many cities have more than one zip code, like Long Beach in this example. We can show this in the relation scheme:

Other views of this diagram: Large image - Data dictionary (text) • There is a very simple 3-step way to fix the problem with the relation scheme. 1. Remove all of the attributes that are dependent on the subkey. Put them into a new scheme. In this example, the dependent attributes are the city and state.

2. Duplicate the subkey attribute set in the new scheme, where it becomes the primary key of the new scheme. In this example, the sole subkey attribute is the zipCode. 3. Leave a copy of the subkey attribute set in the original scheme, where it is now a foreign key. It is no longer a subkey, because you’ve gotten rid of the attributes that were functionally dependent on it, and you’ve made it the primary key of its own table. The revised model will have a many-to-one relationship between the original scheme and the new one:

Other views of this diagram: Large image - Data dictionary (text) • The new Contacts table will look like the old one, minus the city and state fields. The new ZipLocations table, shown below, contains only one row per zip code. Joining this table to the Contacts (on matching zipCode pk-fk pairs) will produce the same information that was in the original table. What we have done is formally called lossless join decomposition of the original table. Zip locations city state zipCode 90840 Long Beach CA 90836 Long Beach CA 92708 Fountain Valley CA

Subkeys and normalization ,ormalization means following a procedure or set of rules to insure that a database is well designed. Most normalization rules are meant to eliminate redundant data (that is, unnecessary duplicate data) in the database. Subkeys always result in redundant data, so we need to eliminate them using the procedure outlined above. • If there are no subkeys in any of the tables in your database, you have a well-designed model according to what is usually called third normal form, or 3NF. Actually, 3NF permits subkeys in some very exceptional circumstances that we won’t discuss here; the strict nosubkey form is formally known as Boyce-Codd normal form, or BC,F. • Some textbooks use the terms partial FDs and transitive FDs. Both of these are subkeys— the first where the subkey is part of a primary key, the second where is isn’t. Both can be eliminated by the procedure that we’ve shown here.

Correcting the UML class diagram When we find a subkey in a relation scheme or table, we also know that the original UML class was badly designed. The problem, always, is that we have actually placed two conceptually different classes in a single class definition. • In this example, a zipCode is not just an attribute of the Contact class. It is part of a ZipLocation class, which we can describe as “a geographical location whose boundaries have been uniquely identified by the U.S Postal Service for mail delivery.” • The zipCode is an external key, created by the USPS for the convenience of its sorting machinery (not the postal customers). The ZipLocation class has the additional attributes of the city and state where it is located; in fact, it also has the attributes needed to precisely describe its boundaries, although we certainly do not need to represent these in our database. The geographical boundaries would form the “real” descriptive CK if they were included. As always, we need to describe the association between ZipLocations and Contacts: “Each Contact lives in one and only one ZipLocation” “Each ZipLocation is home to zero or more Contacts” • As with all one-to-many associations, the association itself identifies which Contact lives in which ZipLocation. If we had started with this class diagram, we would have produced exactly the same relation scheme that we developed with the normalization process above!

Other views of this diagram: Large image - Data dictionary (text)

Exercise: plant species You are working on a database for a company that grows and sells plants. One important table contains a list of the plant species that they grow, which are identified botanically by their genus and specie name, family, and common name. Even if you have never heard of these terms, you can analyze the table by looking at the data given below: Plant species genus specie family Ardesia japonica Myrsinaceae Beaucarnea recurvata Agavaceae Centaurea cineraria Asteraceae Centaurea gymnocarpa Asteraceae Centaurea montana Asteraceae

commonname Marlberry Ponytail Dusty Miller Dusty Miller

genus Dracaena Dracaena Echeveria Kalanchoe Kalanchoe Pseudosasa Senecio

Plant species specie family commonname draco Agavaceae Dragon Tree marginata Agavaceae elegans Crassulaceae Hen and Chicks beharensis Crassulaceae Felt Plant pinnata Crassulaceae Air Plant japonica Poaceae Arrow Bamboo cineraria Asteraceae Dusty Miller

• Draw the relation scheme for this table as it is shown above. Identify the primary key. • Draw the relation scheme for a lossless join decomposition of this table.

Design pattern: repeated attributes (the phone book) The contact manager example from our preceeding discussion of subkeys is also an excellent illustration of another problem that is found in many database designs. • Obviously, the contacts database will need to store phone numbers in addition to addresses. A typical simplistic model, even after fixing the zip code problem, might look something like this:

• Again, this might seem like a reasonable design until you look at the data (omitting the street and zip to reduce table width): Contact phones first,ame last,ame homePhone workPhone cellPhone 562-874310-999George Barnes 1234 3628 714-847562-975Susan Noble 3366 3388 Erwin

Star

fax

714-9975885

pager

714-9972428

Contact phones first,ame last,ame homePhone workPhone cellPhone 562-577562-561Alice Buck 1200 1921 714-968Frank Borders 8201 562-786Hanna Diedrich 7727

fax

pager

• There are at least two very large problems here: - None of our contacts are going to have all of the phone numbers that we've provided for. In fact, most contacts will have only one or two numbers—leaving most of these fields blank, or NULL. Remember that ,ULL is a special constant value in database systems that means “this field doesn’t have any value assigned to it.” It’s not the same as a zero length string or the number zero. In general, we want to eliminate unnecessary NULL values that might occur as a result of our design—and the NULL values in this table are definitely unnecessary. - No matter how many phone number fields we provide (five of them here), sooner or later someone will think of another kind of phone number that we need. With the model above, we would have to actually change the table structure to add another field. Any information like this that might change should be represented by data in a table rather than by the table structure. • In fact, we don’t have five single attributes here. We have a repeated attribute, phone, that also has an attribute of its own that tells us what type of number it is (home, work, cell, and so on). In effect, it’s a class within a class. Some database textbooks call this structure a weak entity, since it can’t exist without the parent entity type. • In UML, we can show multiplicity of attributes the same way we show multiplicity of an association (for example [0..*]). We can also show the data type of an attribute, which in this case is a structure (PhoneNumber). We have listed the structure attributes below in parentheses.

If we try to represent information this way in the Contacts table, we’ll end up with a subkey that we obviously don’t want. We have to create a new table in much the same way as we did for the zip locations. 1. Remove all of the phone number fields from the Contacts relation. Create a new scheme that has the attributes of the PhoneNumber structure (phoneType and number).

2. The Contacts relation has now become a parent, so we should add a surrogate key, contactID. Copy this into the new scheme, so that we can associate the phone number with the person it belongs to. There is now a one-to-many relationship between Contacts and PhoneNumbers. Notice that this is the opposite of the relationship between Contacts and ZipLocations. 3. To identify each phone number, we need to know at least who it belongs to and what type it is. However, to allow for any combination of contacts, phone types, and numbers, we will use all three attributes of the new scheme together as the primary key. It’s not a parent, so we aren’t concerned about size of the PK.

• The Contacts table now looks like it did before we added the phone numbers (with the addition of the contactID). The new PhoneNumbers table can be joined to the Contacts on matching contactID pk-fk pairs to provide all of the information that we had before. Contacts contactid firstname lastname street zipcode 1639 George Barnes 1254 Bellflower 90840 5629 Susan Noble 1515 Palo Verde 90840 3388 Erwin Star 17022 Brookhurst 92708 5772 Alice Buck 3884 Atherton 90836 1911 Frank Borders 10200 Slater 92708 4848 Hanna Diedrich 1699 Studebaker 90840 Phone numbers contactid phonetype number 1639 Home 562-874-1234 1639 Cell 310-999-3628 5629 Home 562-975-3388 5629 Work 714-847-3366 3388 Fax 714-997-5885 3388 Pager 714-997-2428 5772 Work 562-577-1200 5772 Cell 562-561-1921 1911 Home 714-968-8201 4848 Cell 562-786-7727

Employee dependents The modeling technique shown above is useful where the parent class has relatively few attributes and the repeated attribute has only one or a very few attributes of its own. However, you can also model the repeated attribute as a separate class in the UML diagram. One classic textbook example is an employee database. The employee class represents “a person who works for our company”; each employee has zero or more dependents. The dependent is conceptually a repeated attribute of the employee, but can be described separately as “a person who is related to the employee and may receive health care or other benefits based on this relationship.” We can represent this fact in the class diagram:

• The relation scheme is a standard one-to-many; the PK of the many-side relation will have to include both the fk from the parent and one or more local attributes to guarantee uniqueness.

Exercise: monthly publication usage (source: Mark Olson) A large task for computer manufacturers is to keep track of their publications: product specification sheets, marketing brochures, user manuals, and so on. One major manufacturer had developed a spreadsheet to record the monthly usage of each publication (that is, how many copies of the publication were distributed each month). Then they decided to export the spreadsheet into a database. The export program naturally converted each column of the spreadsheet into a database table attribute, so the result looked something like this:

• Revise the class diagram to correct any problems that you find in this design. Then draw the relation scheme for your corrected model.

Design pattern: multivalued attributes (hobbies) Attributes (like phone numbers) that are explicitly repeated in a class definition aren’t the only design problem that we might have to correct. Suppose that we want to know what hobbies each person on our contact list is interested in (perhaps to help us pick birthday or holiday presents). We might add an attribute to hold these. More likely, someone else has already built the database, and added this attribute without thinking about it.

• We’ve made this example obvious by using a plural name for the attribute, but this won’t always be the case. We can only be sure that there’s a design problem when we see data in a table that looks like this: Contact hobbies hobbies contactid firstname lastname 1639 George Barnes reading 5629 Susan Noble hiking, movies 3388 Erwin Star hockey, skiing 5772 Alice Buck 1911 Frank Borders photography, travel, art 4848 Hanna Diedrich gourmet cooking • In this case, the hobby attribute wasn’t repeated in the scheme, but there are many distinct values entered for it in the same column of the table. This is called a multivalued attribute. The problem with doing it is that it is now difficult (but possible) to search the table for any particular hobby that a person might have, and it is impossible to create a query that will

individually list the hobbies that are shown in the table. Unlike the phone book example, NULL values are probably not part of the problem here, even if we don’t know the hobbies for everyone in the database. • In UML, we can again use the multiplicity notation to show that a contact may have more than one hobby:

As you should expect by now, we can’t represent the multivalued attribute directly in the Contacts relation scheme. Instead, we will remove the old hobbies attribute and create a new scheme, very similar to the one that we created for the phone numbers.

• The relationship between Contacts and Hobbies is one-to-many, so we create the usual pkfk pair. The new scheme has only one descriptive attribute, the hobby name. To uniquely identify each row of the table, we need to know both which contact this hobby belongs to and which hobby it is—so both attributes form the pk of the scheme. • With data entered, the new table looks similar to the PhoneNumbers. It can also be joined to Contacts on matching pk-fk contactID pairs, re-creating the original data in a form that we can now conveniently use for queries. Hobbies contactid hobby 1639 reading 5629 hiking 5629 movies 3388 hockey

Hobbies hobby contactid 3388 skiing 1911 photography 1911 travel 1911 art 4848 gourmet cooking

Exercise: software list Sometimes it takes more than just a glance at the class diagram to spot problems with a design. Consider the following class type that might be used by a software vendor to list software titles that are available.

• There is nothing obviously wrong with this design. However, the users of this database might enter data that would cause problems, as shown in this table: Software TITLE VE,DOR PLATFORM VERSIO, Wordy Macrosoft Win, Mac 9.4, 6.7 Visual B-- Macrosoft Win 6.0 10.4.5.2, 10.3.1.7 Cherokee Open Source Linux, Solaris Inlook Hinkysoft Win, Linux, Palm OS 0.5 Mac 22.1 Corral Draw Corral • Revise the class diagram to correct any problems that you find in this design. Then draw the relation scheme for your corrected model.

Discussion: more about domains You learned earlier that a domain is the set of legal values that can be assigned to an attribute. Each attribute in a database must have a well-defined domain; you can’t mix values from different domains in the same attribute. (See below for some examples.) One goal of database developers is to provide data integrity, part of which means insuring that the value entered in each field of a table is consistent with its attribute domain. Sometimes we can devise a validation rule to separate good from bad data; sometimes we can’t. Before you design the data type and input format for an attribute, you have to understand the characteristics of its domain.

• Some domains can only be described with a general statement of what they contain. These are difficult or impossible to analyze precisely; the best we can do is to make them VARCHAR strings that are long enough to hold any expected value. Examples include: - Names of people. We typically show these broken into First (which might include a middle name or initial) and Last (which is really the family name—some languages write this first). Depending on your application, you might have to add attributes for a courtesy title (Mr., Ms., Dr., etc.), a suffix (Jr., III, etc.) or a nickname ('Tom' for 'Thomas' and so on). - Names of businesses or organizations. These typically fit into a single character field, and are not in the same domain as people’s names. Different domains require different attributes (which sometimes can even mean different class/entity types). - Street addresses. Even the format of these can vary widely: '3201 Main St., Apt. 3', 'Sohnmattstraße 14', and 'P.O. Box 8259' are all valid address strings. • Some domains have at least some pattern in their permitted values. These might be recognizable in code, for example with a regular expression, although it is still impossible to insure that every value that passes a validity check is actually correct. Examples include: - Email addresses. To be valid, an email address must contain a single @ sign that separates the user name from the server and domain names. It cannot contain any spaces. Unfortunately, that’s about all we can check. - Web addresses (URLs). These form a different domain from email addresses or telephone numbers, no matter how tempting it might be to permit any of them to be entered in the same attribute field. Other than checking for spaces and other illegal characters, there’s again not much way to be sure a URL is valid. - North American telephone numbers. Many databases, and most wireless phones, require numbers to be exactly ten digits—perhaps providing formatting such as (800)-555-1212. Problem: you can't store extension numbers, access codes, or other data that has to be transmitted along with the number itself. Solution: make this an unformatted character string long enough to hold all the information that is needed. • A very few domains conform to a precise pattern that can be analyzed or specified exactly. Examples include: - United States Social Security numbers. These are always of the form 999-99-9999, where 9 represents any digit. - United States Zip codes, which are always of the form 99999 or 99999-9999. United States state abbreviations are always of the form AA, where A is an upper-case letter; however, just checking the format won’t insure a valid abbreviation. There is a much better way to model this kind of domain, which we will explain separately. - Definitely NOT international postal codes or phone numbers. Don’t ever over-specify a domain or data entry field in any way that would prevent users from entering a valid real-life domain value.

• Easy domains to handle are those which can be specified by a well-defined, built-in system data type. These include integers, real numbers, and dates/times. You might have to rangecheck these data types to insure that realistic values are entered. In most systems, a boolean data type is also available; oddly, Oracle® doesn’t provide this. (Oracle developers typically use a CHAR(1) data type, and assign it values of 'T' or 'F'). • Finally, there are many domains that may be specified by a well-defined, reasonably-sized set of constant values. We’ll look at these in a separate page. In general, your user interface should provide any necessary format or range checking. If done well, this can help the user with data entry, increase data integrity, and prevent the user from having to deal with cryptic and frustrating error messages from the database itself.

Design pattern: enumerated domains Attribute domains that may be specified by a well-defined, reasonably-sized set of constant values are called enumerated domains. You might know all of the values of the domain at design time, or you might not. In either case, you should keep the entire list of values in a separate table. Tables that are created for this purpose might be called enumeration tables, dictionary tables, lookup tables, or domain-control tables or entities in some textbooks and database software systems. Once in the database, they are no different from any other table; PKs and FKs link them to other tables as always. There are a number of ways to design these tables, from which the designer can choose the most appropriate for a particular attribute in a particular class. • Many students ask if this technique will create too many tables and query joins. The answer is: “No.” Design the database as well as you can—if you have to “break the rules” later for faster performance, you can always do so. In most designs, the drawbacks of any additional table or tables are overwhelmed by their advantages: - You can read the values from the table into a combo box, list box, or similar input control on either a Web page or a GUI form. This allows the user to easily select only values that are valid in this domain at this time. - You can always update the table if new values are added to the domain, or if existing values are changed. This is much easier than modifying your user-interface code or your table structure. 1. In our earlier ZipLocations example, the state attribute clearly fits the definition of an enumerated domain. In UML, we can simply use a data type specification to show this, without adding a new class type.

• The relation scheme will show the table that contains the enumerated domain values. This table might have a single attribute, or it might have two attributes: one for the true values and one for a substitute key. Notice that the true values always form a candidate key of the table. • For states, the second approach provides both the full name of the state and the U.S. Postal Service abbreviation—a real benefit when it is time to design the user interface. Any time that you can find an existing enumeration (external key), you should use it instead of making up your own values. Besides the USPS state codes, examples include international airport designators (like LAX, defined by the International Civil Aviation Organization) and web top-level domains for countries (like CH or DE, called ccTLDs and defined by the Internet Assigned Numbers Authority).

2. Multivalued attributes (for example, hobbies) might also have enumerated domains. We can show this in the class diagram exactly as we did with a single-valued attribute:

• In the scheme, the relationship between Contacts and Hobbies has become many-to-many, instead of one-to-many. This is shown in the scheme by linking an enumeration table to the previous Hobbies table (which now functions like an association class).

Exercise: a pizza shop .ote:This exercise might be started now and finished after you have covered subclasses, or simply delayed until after that topic. You are designing a database for a pizza shop that wants to get into Web-based sales. Your client has given you the transcript of a typical phone order conversation: Pizza shop associate (Lori): “Thank you for calling the Pizza Shop; this is Lori. How may I help you?” Caller (Rick): “What toppings do you put on your all-meat special?” Lori: “Italian sausage, pepperoni, ground beef, salami, and bacon.” Rick: “OK, I’d like a large one, but without the bacon.” Lori: “Do you want regular crust, extra-thin, or whole wheat?” Rick: “Regular is fine. And a medium wheat crust with just cheese.” Lori: “We have mozzarella, parmesan, romano, smoked cheddar, and jalapeño jack.” Rick: “Uhhh…just mozzarella and romano. What kind of sauce is on that?” Lori: “Marinara, spicy southwestern, tandoori masala, or pesto—your choice.” Rick: “Pesto sounds good.” Lori: “You can add a large order of breadsticks for just 99 cents.” Rick: “Sure, why not? And I'd like three small salads…” (muffled) “…make two of’em with Italian dressing and one with ranch.” Lori: “The dressing comes on the side; we’ll give you an extra one of each flavor. What would you like to drink?” Rick: “Keg’a beer, maybe?” Lori: “Sorry, we just have soft drinks.” Rick: (laughs) “Just kidding—how about two medium diet colas and a large iced tea.” Lori: “That’s one large regular crust all-meat special, no bacon, one medium wheat crust with pesto sauce, mozzarella and romano, one large order of breadsticks, three small salads, three Italian dressing, two ranch, two medium diet colas and one large iced tea. Just a minute, please…(cash register clicks several times)…your total with tax is 27 dollars and 39 cents. Is this for pickup or delivery?”

Rick: “I’ll pick it up.” Lori: “And your name?” Rick: “Rick.” Lori: “Thank you for your order, Rick. It’ll be ready in about 20 minutes.” Rick: “See’ya then.” • Develop a class diagram which will accommodate at least the information contained in this conversation, then draw the relation scheme. Remember to describe each class, and the associations between them, in English. Your system will have to let the sales associate (or Web customer) select from the available current menu choices, and also let the sales associate (or Web script) record the order.

Design pattern: subclasses Top down design As you are developing a class diagram, you might discover that one or more attributes of a class are characteristics of only some individuals of that class, but not of others. This probably indicates that you need to develop a subclass of the basic class type. We call the process of designing subclasses from “top down” specialization; a class that represents a subset of another class type can also be called a specialization of its parent class. Example: we will model the graduate students at a university. Some are employed by the university as teaching associates (TAs); some are employed as research associates (RAs); some are not employed by the university at all. For the TAs, we need to know which course they are assigned to teach; for the RAs, we need to know the grant number of the research project to which they are assigned. A first listing of the student attributes might look like this:

• At least one of the two unique attributes will always be null; frequently they both will be null. We can fix the problem by showing two specialized classes of students: TAs and RAs. The UML symbol for subclass association is an open arrowhead that points to the parent class.

• Unique attributes are now contained in the subclass types. Attributes that are common to all students remain in the superclass (parent). • The verbs to describe a subclass association are implied by the diagram. In this case, we would say that each grad student may be either a TA, an RA, or neither; each TA or RA is a grad student.

Specialization constraints Rather than the usual cardinality/multiplicity symbols, the subclass association line is labeled with specialization constraints. Constraints are described along two dimensions: incomplete versus complete, and disjoint versus overlapping. • In an incomplete specialization, also called a partial specialization, only some individuals of the parent class are specialized (that is, have unique attributes). Other individuals of the parent class have only the common attributes. • In a complete specialization, all individuals of the parent class have one or more unique attributes that are not common to the generalized (parent) class. • In a disjoint specialization, also called an exclusive specialization, an individual of the parent class may be a member of only one specialized subclass. • In an overlapping specialization, an individual of of the parent class may be a member of more than one of the specialized subclasses.

Relation scheme diagram We create a table for each of the subclasses, linked to the parent class with a pk-fk pair as always. Since the relationships are one-to-one, only the fk is needed to form the pk of the subclass table. There is no way to enforce the specialization constraints in the table structure—this has to be done by the data entry system. Notice that there is no attribute in the parent table to tell us if a student is a TA, an RA, or neither—the union of two outer join queries will produce a table with all of the information that we need.

Bottom up design Sometimes, instead of finding unique attributes in a single class type, you might find two or more classes that have many of the same attributes. This probably indicates that you need to develop a superclass of the classes with common attributes. We call the process of designing subclasses from “bottom up” generalization; a class or entity that represents a superset of other class types can also be called a generalization of the child types. .ote: if you have two or more class types with exactly the same set of attributes, you probably have only one class type instead of many! Example (thanks to Martin Malolepszy): A student of mine had a summer job with a brushclearing service. This is a fairly specialized business but an essential one in southern California, where dried plant growth (brush) can present a severe fire hazard if it is not cleared from around houses and other structures. In addition to his exhausting physical work, Martin built a small database to help the owner manage this business. • One important class type was the lot (or property) to be cleared. Some lots were in the city, with a standard street-and-number address. Other lots were not on a city street, but were described by the county surveyor's section and tract number. It seemed as if there were two class types:

• Actually, a few of the lots were identified by both address schemes. A closer look at the city and county lot classes also shows two common descriptive attributes (the owner and the lot size). The common attributes should go in a generalization or superclass that is simply called a “lot.” The relation scheme is identical in structure to the previous example.

Exercise: sales commission The training material for a well-known database system includes an example employee class with standard attributes such as employee ID, name, and so on. The same entity also includes both a salary attribute and a commission attribute, even though only sales representatives earn a commission. (Everyone earns a salary.) In their lesson plans, this vendor emphasizes various ways that their software can handle the inevitable null values of the commission attribute. I believe that their model is simply wrong.

• Design a better way to preserve the salary and commission information but preclude null values in any of the attributes. Draw the class diagram and the relation scheme.

Design pattern: aggregation Sometimes a class type really represents a collection of individual components. Although this pattern can be modeled by an ordinary association, its meaning becomes much clearer if we use the UML notation for an aggregation. Example: a small business needs to keep track of its computer systems. They want to record information such as model and serial number for each system and its components.

• A very naïve way to do this would be to put all of the information into a single class type. You should recognize that this class contains a set of repeated attributes with all of the problems of the “phone book” pattern. You could fix it as we show here:

Incorrect model (with improvement)

• The improved model will accommodate the addition of more types of components (a scanner, perhaps), a system with more than one monitor or printer, or a replacement component on the shelf that don’t belong to any system right now. But UML allows us to show the association in a more semantically correct way.

Better model (with UML aggregation)

• The system is an aggregation of components. In UML, aggregation is shown by an open diamond on the end of the association line that points to the parent (aggregated) class. There is an implied multiplicity on this end of 0..1, with multiplicity of the other end shown in the diagram as usual. To describe this association, we would say that each system is composed of one or more components and each component is part of zero or one system. • Since the component can exist by itself in this model (without being part of a system), the system name can’t be part of its PK. We'll use the only candidate key {type, mfgr, model, SN} as PK, since this class is not a parent. The system name, just an FK here, will be filled in if this component is installed as part of a system; it will be null otherwise.

In UML, there is a stronger form of aggregation that is called composition. The notation is similar, using a filled-in diamond instead of an open one. In composition, component instances cannot exist on their own without a parent; they are created with (or after) the parent and they are deleted if the parent is deleted. The implied multiplicity on the “diamond” end of the association is therefore 1..1.

Design pattern: recursive associations A recursive association connects a single class type (serving in one role) to itself (serving in another role). Example: In most companies, each employee (except the CEO) is supervised by one manager. Of course, not all employees are managers. This example is used in almost every database textbook, since the association between employees and managers is relatively easy to understand. Perhaps the best way to visualize it is to start with two class types:

Incorrect model

• The problem with this model is that each manager is also an employee. Building the second (manager) table not only duplicates information from the employee table, it virtually guarantees that there will be mistakes and conflicts in the data. We can fix the problem by eliminating the redundant class and re-drawing the association line.

Correct model

• Normally, we wouldn’t show an fk in the class diagram; however, including the manager as an attribute of the employee here (in addition to the association line) can help in understanding the model. In the relation scheme, we can explicitly show the connection between the surrogate pk (employeeID) and the managerID (which is an fk, even though it is in the same scheme).

In some project-oriented companies, an employee might work for more than one manager at a time. We also might want to keep a history of the employees’ supervision assignments over time. We can model either case by revising the class diagram to a many-to-many pattern:

• The relation scheme for this model looks exactly like other many-to-many applications, with the exception that both foreign keys come from the same pk table.

Retrieving data To produce a list of employees and their managers, we have to join the employees table to itself, using two different aliases for the table name. An outer join is needed to include any employee who is not managed by anyone. SELECT E.lastName AS "Employee", M.lastName AS "Manager" FROM Employees E LEFT OUTER JOIN Employees M ON E.managerID = M.employeeID ORDER BY E.lastName

In effect, the SQL statement works as if there were two copies of the employees table, as in the first (incorrect) UML diagram. You can visualize rows being joined this way:

The many-to-many structure is handled similarly in SQL. (Note the explicit ordering of the joins specified by parentheses.): SELECT E.lastName AS "Employee", M.lastName AS "Manager" FROM Employees E LEFT OUTER JOIN (Supervisions S INNER JOIN Employees M ON S.managerID = M.employeeID) ON E.employeeID = S.employeeID ORDER BY E.lastName

Exercise: team games You are modeling a sports league—it could be any team sport. Certainly one important class is the team, which will have a name (unique within the league), a coach, a home field or venue, and probably more attributes. • Obviously, teams play games. In each game, there is a designated “home team” and “away team” (even if the game is played at a neutral venue). Teams may play each other more than once during a season, possibly even with the same home and away roles (for example, during playoffs). • Design a class diagram that shows the teams, the games that they play, and the score of each game. Then draw the relation scheme for your model.

Appendix: traditional normalization Normalization is usually thought of as a process of applying a set of rules to your database design, mostly to achieve minimum redundancy in the data. Most textbooks present this as a three-step process, with correspondingly labeled “normal forms,” which could be done in an almost algorthmic sequence. • In theory, you could start with a single relation scheme (sometimes called the universal scheme, or U) that contains all of the attributes in the database—then apply these rules recursively to develop a set of increasingly-normalized sub-relation schemes. When all of the schemes are in third normal form, then the whole database is properly normalized. In practice, you will more likely apply the rules gradually, refining each relation scheme as you develop it from the UML class diagram or ER model diagram. The final table structures should be the same no matter which method (or combination of methods) you’ve used. • Since most developers will use traditional terms, you should know how the design patterns that you have learned will lead to the same (normalized) results, as shown in the following table: Normal forms ,ormal form Traditional definition As presented here • All attributes must be First normal form • Eliminate multi-valued attributes, and atomic, and • Eliminate repeated attributes (1NF) • No repeating groups

Normal forms ,ormal form Traditional definition As presented here • First normal form, and Second normal • Eliminate subkeys (where the subkey is part • No partial functional of a composite primary key) form (2NF) dependencies • Second normal form, and Third normal form • Eliminate subkeys (where the subkey is not • No transitive functional (3NF) part of the primary key) dependencies • Some textbooks discuss “higher” normal forms, such as BCNF (Boyce-Codd), 4NF, 5NF, and DKNF (domain-key). These topics are properly covered in a more advanced course or tutorial.

Denormalization One major premise of this tutorial is that you should learn to develop the “best” possible design—which really focuses on the database structure itself. By doing this, you should be able to avoid many of the problems, bugs, inconsistencies, and maintenance nightmares that frequently plague actual systems in use today. • However, your database will always be part of a larger system, which will include at least a user interface and reporting structure, perhaps with a large amount of application code written in a language such as Java or C++. Your database could also be the back-end of a Web site, with both middle-tier business logic and front-end presentation code dependent on it. It is not uncommon for developers to “break the rules” of database design in order to accommodate other parts of a system. • An example of denormalization, using our “phone book” problem, would be to store the city and state attributes in the basic contacts table, rather than making a separate zip codes table. At the cost of extra storage, this would save one join in a SELECT statement. Although this would certainly not be needed in such a simple system, imagine a Web site that supports thousands of “hits” per second, with much more complicated queries needed to produce the output. With today’s terabyte disk systems, it might be worth using extra storage space to keep Web viewers from waiting excessively while a page is being generated. On the other hand, similarly-increasing processor power makes it less likely that this tradeoff will actually have to be made in practice. • The key to successful denormalization is to make sure that end users of the system never have to manually duplicate or maintain the redundant data. Possible techniques for doing this include using materialized views, writing triggers (code executed by the database itself—not available on all systems), or writing application code that takes care of it at data-entry time.

Basic queries: SQL and RA Retrieving data with SQL SELECT To look at the data in our tables, we use the select (SQL) statement. The result of this statement is always a new table that we can view with our database client software or use with programming languages to build dynamic web pages or desktop applications. Although the result table is not stored in the database like the named tables are, we can also use it as part of other select statements. The basic syntax (which is not case sensitive) consists of four clauses: SELECT FROM WHERE ORDER BY ;

• Of the four clauses, only the first two are required. When you are learning to build queries, it is helpful to follow a specific step-by-step sequence, look at the data after each modification to the query, and be sure that you understand the results at each step. We’ll build a list of customers who live in a specific zip code area, showing their first and last names and phone numbers and listing them in alphabetical order by last name. 1. Look at all of the relevant data—this is called the result set of the query, and it is specified in the FROM clause. We have only one table, so the result set should consist of all the columns (* means all attributes) and rows of this table. SELECT * FROM customers;

Customers cfirstname clastname cphone cstreet czipcode Tom Jewett 714-555-1212 10200 Slater 92708 Alvaro Monge 562-333-4141 2145 Main 90840 Wayne Dick 562-777-3030 1250 Bellflower 90840 2. Pick the specific rows you want from the result set (for example here, all customers who live in zip code 90840). Notice the single quotes around the string you’re looking for—search strings are case sensitive! SELECT * FROM customers WHERE cZipCode = '90840';

Customers in zip code 90840 cfirstname clastname cphone cstreet czipcode Alvaro Monge 562-333-4141 2145 Main 90840 Wayne Dick 562-777-3030 1250 Bellflower 90840 3. Pick the attributes (columns) you want. Notice that changing the order of the columns (like showing the last name first) does not change the meaning of the data.

SELECT cLastName, cFirstName, cPhone FROM customers WHERE cZipCode = '90840';

Columns from SELECT cLast,ame cFirst,ame cPhone Monge Alvaro 562-333-4141 Dick Wayne 562-777-3030 4. In SQL, you can also specify the order in which to list the results. Once again, the order in which rows are listed does not change the meaning of the data in them. SELECT cLastName, cFirstName, cPhone FROM customers WHERE cZipCode = '90840' ORDER BY cLastName, cFirstName;

Rows in order cLast,ame cFirst,ame cPhone Dick Wayne 562-777-3030 Monge Alvaro 562-333-4141

Why SQL works: the RA select and project Like all algebras, RA applies operators to operands to produce results. RA operands are relations; results are new relations that can be used as operands in building more complex expressions. We’ll introduce two of the RA operators following the example and sequence above. 1. To represent a single relation in RA, we only need to use its name. We can also represent relations and schemes symbolically with small and capital letters, for example relation r over scheme R. In this case, r = customers and R = the Customers scheme. 2. The select (RA) operator (written σ) picks tuples, like the SQL WHERE clause picks rows. It is a unary operator that takes a single relation or expression as its operand. It also takes a predicate, θ, to specify which tuples are required. Its syntax is σθr, or in our example: σcZipCode='90840'customers. The scheme of the result of σθr is R—the same scheme we started with—since we haven’t done anything to change the attribute list. The result of this operation includes all tuples of r for which the predicate θ evaluates to true. 3. The project (RA) operator (written π) picks attributes, confusingly like the SQL SELECT clause. It is also a unary operator that takes a single relation or expression as its operand. Instead of a predicate, it takes a subscheme, X (of R), to specify which attributes are required. Its syntax is πXr, or in our example: πcLastName, cFirstName, cPhonecustomers.

The scheme of the result of πXr is X. The tuples resulting from this operation are tuples of the original relation, r, cut down to the attributes contained in X. • For X to be a subscheme of R, it must be a subset of the attributes in R, and preserve the assignment rule from R (that is, each attribute of X must have the same domain as its corresponding attribute in R). • If X is a super key of r, then there will be the same number of tuples in the result as there were to begin with in r. If X is not a super key of r, then any duplicate (non-distinct) tuples are eliminated from the result. Just as in the SQL statement, we can apply the project operator to the output of the select operation to produce the results that we want: πXσθr or πcLastName, cFirstName, cPhone σcZipCode='90840'customers. Since RA considers relations strictly as sets of tuples, there is no way to specify the order of tuples in a result relation.

Basic SQL statements: DDL and DML In the first part of this tutorial, you’ve seen some of the SQL statements that you need to start building a database. This page gives you a review of those and adds several more that you haven’t seen yet. • SQL statements are divided into two major categories: data definition language (DDL) and data manipulation language (DML). Both of these categories contain far more statements than we can present here, and each of the statements is far more complex than we show in this introduction. If you want to master this material, we strongly recommend that you find a SQL reference for your own database software as a supplement to these pages.

Data definition language DDL statements are used to build and modify the structure of your tables and other objects in the database. When you execute a DDL statement, it takes effect immediately. • The create table statement does exactly that: CREATE TABLE ( , ... );

The data types that you will use most frequently are character strings, which might be called VARCHAR or CHAR for variable or fixed length strings; numeric types such as NUMBER or INTEGER, which will usually specify a precision; and DATE or related types. Data type syntax is variable from system to system; the only way to be sure is to consult the documentation for your own software.

• The alter table statement may be used as you have seen to specify primary and foreign key constraints, as well as to make other modifications to the table structure. Key constraints may also be specified in the CREATE TABLE statement. ALTER TABLE ADD CONSTRAINT PRIMARY KEY ();

You get to specify the constraint name. Get used to following a convention of tablename_pk (for example, Customers_pk), so you can remember what you did later. The attribute list contains the one or more attributes that form this PK; if more than one, the names are separated by commas. • The foreign key constraint is a bit more complicated, since we have to specify both the FK attributes in this (child) table, and the PK attributes that they link to in the parent table. ALTER TABLE ADD CONSTRAINT FOREIGN KEY () REFERENCES ();

Name the constraint in the form childtable_parenttable_fk (for example, Orders_Customers_fk). If there is more than one attribute in the FK, all of them must be included (with commas between) in both the FK attribute list and the REFERENCES (parent table) attribute list. You need a separate foreign key definition for each relationship in which this table is the child. • If you totally mess things up and want to start over, you can always get rid of any object you’ve created with a drop statement. The syntax is different for tables and constraints. DROP TABLE ; ALTER TABLE DROP CONSTRAINT ;

This is where consistent constraint naming comes in handy, so you can just remember the PK or FK name rather than remembering the syntax for looking up the names in another table. The DROP TABLE statement gets rid of its own PK constraint, but won’t work until you separately drop any FK constraints (or child tables) that refer to this one. It also gets rid of all data that was contained in the table—and it doesn't even ask you if you really want to do this! • All of the information about objects in your schema is contained, not surprisingly, in a set of tables that is called the data dictionary. There are hundreds of these tables most database systems, but all of them will allow you to see information about your own tables, in many cases with a graphical interface. How you do this is entirely system-dependent.

Data manipulation language DML statements are used to work with the data in tables. When you are connected to most multi-user databases (whether in a client program or by a connection from a Web page script), you are in effect working with a private copy of your tables that can’t be seen by

anyone else until you are finished (or tell the system that you are finished). You have already seen the SELECT statement; it is considered to be part of DML even though it just retreives data rather than modifying it. • The insert statement is used, obviously, to add new rows to a table. INSERT INTO VALUES (, ... );

The comma-delimited list of values must match the table structure exactly in the number of attributes and the data type of each attribute. Character type values are always enclosed in single quotes; number values are never in quotes; date values are often (but not always) in the format 'yyyy-mm-dd' (for example, '2006-11-30'). Yes, you will need a separate INSERT statement for every row. • The update statement is used to change values that are already in a table. UPDATE SET = WHERE ;

The update expression can be a constant, any computed value, or even the result of a SELECT statement that returns a single row and a single column. If the WHERE clause is omitted, then the specified attribute is set to the same value in every row of the table (which is usually not what you want to do). You can also set multiple attribute values at the same time with a comma-delimited list of attribute=expression pairs. • The delete statement does just that, for rows in a table. DELETE FROM WHERE ;

If the WHERE clause is omitted, then every row of the table is deleted (which again is usually not what you want to do)—and again, you will not get a “do you really want to do this?” message. • If you are using a large multi-user system, you may need to make your DML changes visible to the rest of the users of the database. Although this might be done automatically when you log out, you could also just type: COMMIT;

• If you’ve messed up your changes in this type of system, and want to restore your private copy of the database to the way it was before you started (this only works if you haven’t already typed COMMIT), just type: ROLLBACK;

Although single-user systems don’t support commit and rollback statements, they are used in large systems to control transactions, which are sequences of changes to the database. Transactions are frequently covered in more advanced courses.

Privileges If you want anyone else to be able to view or manipulate the data in your tables, and if your system permits this, you will have to explicitly grant the appropriate privilege or privileges (select, insert, update, or delete) to them. This has to be done for each table. The most common case where you would use grants is for tables that you want to make available to scripts running on a Web server, for example: GRANT select, insert ON customers TO webuser;

Basic query operation: the join SQL syntax In most queries, we will want to see data from two or more tables. To do this, we need to join the tables in a way that matches up the right information from each one to the other—in our example, listing all of the customer data along with all of the order data for the orders that each customer has placed. In the latest SQL standard, the join is specified in the FROM clause: SELECT * FROM customers NATURAL JOIN orders;

Customers joined to Orders cfirstname clastname cphone cstreet czipcode orderdate soldby Alvaro Monge 562-333-4141 2145 Main 90840 2003-07-14 Patrick Wayne Dick 562-777-3030 1250 Bellflower 90840 2003-07-14 Patrick 90840 2003-07-18 Kathleen Alvaro Monge 562-333-4141 2145 Main Alvaro Monge 562-333-4141 2145 Main 90840 2003-07-20 Kathleen • The NATURAL JOIN keyword specifies that the attributes whose values will be matched between the two tables are those with matching names; with very rare exceptions, these will be the pk/fk attributes, and they have to have matching data types (and domains) as well. The join attributes that are matched—here {cfirstname, clastname, cphone}—are shown only once in the result, along with all other attributes of both tables. • Notice that all of the customer information is repeated for each order that the customer has placed; this is expected because of the one-to-many relationship between Customers and Orders. Notice also that any customer who has not placed an order is missing from the results; this is also expected because there is no fk in the Orders table to match that customer's pk in the Customer's table.

• The NATURAL JOIN shown here matches the RA join definition precisely. Unfortunately, not all database systems support the NATURAL JOIN keyword, so you may have to use a different syntax and may see slightly different results. We will discuss other join types in a later page.

How it works The easiest way to understand the join is to think of the database software looking one-byone at each pair of rows from the two tables. Customers cfirstname clastname cphone cstreet czipcode Tom Jewett 714-555-1212 10200 Slater 92708 90840 Alvaro Monge 562-333-4141 2145 Main Wayne Dick 562-777-3030 1250 Bellflower 90840 Orders cfirstname clastname cphone orderdate soldby Alvaro Monge 562-333-4141 2003-07-14 Patrick Wayne Dick 562-777-3030 2003-07-14 Patrick Alvaro Monge 562-333-4141 2003-07-18 Kathleen Alvaro Monge 562-333-4141 2003-07-20 Kathleen • Look at the first row of Orders and the first row of Customers. Do the join attributes (pk/fk pair) match? No, so nothing happens; these rows are not part of the join results. Now look at the first row of Orders and the second row of Customers. Do the join attributes (pk/fk pair) match? Yes, they do. All of the attributes of each row are pasted together (with one copy of the matching pk/fk pair). You can see this in the first row of the SQL join example above. First row of Orders and third row of Customers? No match, so no resulting row. • Next, look at the second row of Orders and the first row of Customers. No match, no resulting row. Same with second row of Orders to second row of Customers. Second row of Orders to third row of Customers? Yes; these are pasted together and included in the result. • Repeat this for the third and fourth rows of Customers and all of the rows of Orders. Two matches are found, which results in two pasted rows in the result. Notice that none of the Orders rows have matched the first row of Customers. • If you have taken even the most basic course in algorithms, you will realize that SQL doesn’t really implement the join in the way that we’ve just described it. However, this mental model will help you to follow the RA definition (below), with the inside workings of the database software left to a more advanced course or tutorial.

RA syntax The RA join of two relations, r over scheme R and s over scheme S, is written r s, or in our example, customers orders. The scheme of the result, exactly as you have seen in the SQL

syntax, is the union of the two relation schemes, R∪S. The join attributes are found in the intersection of the two schemes, R∩S. Clearly, the intersection attributes must inherit the same assignment rule from R and S; this makes the two schemes compatible. • The result of the RA join consists of the pairwise paste of all tuples from the two relations, written paste(t,u) for any tuple t from relation r over scheme R and u from relation s over scheme S. The result of the paste operation is exactly as explained in the preceeding section. • In the very, very rare case where there is no intersection between schemes R and S (that is, R∩S = {null}), the schemes are still compatible and every tuple from relation r is pasted to every tuple from relation s, with all of the attributes from both schemes contained in the resulting tuples. In set theory, this is a Cartesian product of the two relations; in practice, it is almost always nonsense and not what you want. The Cartesian product can also be written in RA as r × s, or intentionally specified in SQL with the CROSS JOIN keyword. • You might want to visualize the three possible results of the paste operation using the graphic representation of schemes and tuples that we presented in an earlier page.

Other views of this diagram: Large image - Description (text)

SQL technique: multiple joins and the distinct keyword It is important to realize that if you have a properly designed and linked database, you can retrieve information from as many tables as you want, specify retrieval conditions based on any data in the tables, and show the results in any order that you like.

Example: We’ll use the order entry model. We’d like a list of all the products that have been purchased by a specific customer. • Unless you can memorize the exact spelling of every attribute in the database, along with all the PKs and FKs (I can’t), you should keep a copy of the relation scheme diagram handy when you build the query. We’ll show it again here, circling the attributes that we need for this query.

Other views of this diagram: Large image - Description (text) • Our query is still a simple one that could be expressed in RA using only the select, project, and join operators: πcLastName, cFirstName, prodName σcFirstName='Alvaro' and cLastName='Monge' (customers orders orderlines products) • The information that we need is found in the Products table. You could think of this as “output” from the query that will be part of the SELECT clause attribute list. The retrieval condition, or “input” to the query, is based on the Customers table—these attributes will be needed for the WHERE clause. You should include these attributes in the SELECT list, so that you are sure that your query is showing the data that you want. • We will also need to include the intervening tables in the FROM clause of the query, since this is the only way to correctly associate the Customers data with the Products data. (You can’t join Customers to Products, because they don’t have any common attributes.) • Another way to think about this is to simply “follow the PK-FK pairs” from table to table until you have completely linked all of the information you need. If you look carefully at the relation scheme for this query, you will realize that we could have bypassed the Orders table (since the custID is also part of the OrderLines scheme). If there were an orderID, or if we needed data from the Orders table, it would have to be included, so we’ll keep it here for

illustration. In SQL, we’ll build the query using the same step-by-step procedure that you have seen before. 1. Look at the result set (all of the linked data). We won’t show the result here because of space limitations on the web page. SELECT * FROM customers NATURAL JOIN orders NATURAL JOIN orderlines NATURAL JOIN products;

• Your database system might not support the NATURAL JOIN syntax that we show here. We’ll discuss this issue further when we look at join types. The multiple natural joins in our example work correctly because there are no non-pk/fk attributes in any of our tables that have the same name. In larger, more complicated databases, this might not be true. 2. Pick the rows you want, and be sure that all of the information makes sense and is really what you are looking for. (This one is still too wide for the page.) SELECT * FROM customers NATURAL JOIN orders NATURAL JOIN orderlines NATURAL JOIN products WHERE cFirstName = 'Alvaro' AND cLastName = 'Monge';

3. Now pick the columns that you want, and again check the results. Notice that we are including the retreival condition attributes in the SELECT clause, to be sure that this really is the right answer. SELECT cFirstName, cLastName, prodName FROM customers NATURAL JOIN orders NATURAL JOIN orderlines NATURAL JOIN products WHERE cFirstName = 'Alvaro' AND cLastName = 'Monge';

Products purchased cFirst,ame cLast,ame prod,ame Alvaro Monge Hammer, framing, 20 oz. Alvaro Monge Hammer, framing, 20 oz. Alvaro Monge Screwdriver, Phillips #2, 6 inch Alvaro Monge Screwdriver, Phillips #2, 6 inch Alvaro Monge Pliers, needle-nose, 4 inch

The distinct keyword Oops! We only wanted a list of the individual product names that this customer has purchased, but some of them are listed more than once. What went wrong?

• If the RA version of our query could have actually been executed, each row of the result table above would be distinct—remember that a relation is a set of tuples, and sets can’t have duplicates—and there would of course be fewer rows than you see here. • SQL doesn’t work the same way. The reason for the duplicates is that the SELECT clause simply eliminated the unwanted columns from the result set; it left all of the rows that were picked by the WHERE clause. • The real problem in SQL is that the SELECT attribute list is not a super key for the result set. Look again very carefully at the relation scheme to understand why this is true. Any time that this happens, we can eliminate the duplicate rows by including the DISTI,CT keyword in the SELECT clause. While making this revision, we’ll also list the product names in alphabetical order. SELECT DISTINCT cFirstName, cLastName, prodName FROM customers NATURAL JOIN orders NATURAL JOIN orderlines NATURAL JOIN products WHERE cFirstName = 'Alvaro' AND cLastName = 'Monge' ORDER BY prodName;

Distinct products prod,ame cFirst,ame cLast,ame Alvaro Monge Hammer, framing, 20 oz. Alvaro Monge Pliers, needle-nose, 4 inch Alvaro Monge Screwdriver, Phillips #2, 6 inch • A sloppy way to be sure that you never have duplicate rows would be to always use the DISTINCT keyword. Please don’t do this—it just keeps you from understanding what is really going on in the query. If the SELECT attribute list does form a super key of the FROM clause (result set), the DISTINCT keyword is not needed, and should not be used.

SQL technique: join types Inner join All of the joins that you have seen so far have used the natural join syntax—for example, to produce a list of customers and dates on which they placed orders. Remember that if this syntax is available, it will automatically pick the join attributes as those with the same name in both tables (intersection of the schemes). It will also produce only one copy of those attributes in the result table. SELECT cFirstName, cLastName, orderDate FROM customers NATURAL JOIN orders;

• The join does not consider the pk and fk attributes you have specified. If there are any nonpk/fk attributes that have the same names in the tables to be joined, they will also be included in the intersection of the schemes, and used as join attributes in the natural join. The results will certainly not be correct! This problem might be especially difficult to detect in cases

where many natural joins are performed in the same query. Fortunately, you can always specify the join attributes yourself, as we describe next. • Another keyword that produces the same results (without the potential attribute name problem) is the inner join. With this syntax, you may specify the join attributes in a USING clause. (Multiple join attributes in the USING clause are separated by commas.) This also produces only one copy of the join attributes in the result table. Like the NATURAL JOIN syntax, the USING clause is not supported by all systems. SELECT cFirstName, cLastName, orderDate FROM customers INNER JOIN orders USING (custID);

• The most widely-used (and most portable) syntax for the inner join substitutes an ON clause for the USING clause. This requires you to explicitly specify not only the join attribute names, but the join condition (normally equality). It also requires you to preface (qualify) the join attribute names with their table name, since both columns will be included in the result table. This is the only syntax that will let you use join attributes that have different names in the tables to be joined. Unfortunately, it also allows you to join tables on attributes other than the pk/fk pairs, which was a pre-1992 way to answer queries that can be written in better ways today. SELECT cFirstName, cLastName, orderDate FROM customers INNER JOIN orders ON customers.custID = orders.custID;

• You can save a bit of typing by specifying an alias for each table name (such as c and o in this example), then using the alias instead of the full name when you refer to the attributes. This is the only syntax that will let you join a table to itself, as we will see when we discuss recursive relationships. SELECT cFirstName, cLastName, orderDate FROM customers c INNER JOIN orders o ON c.custID = o.custID;

• All of the join statements above are specified as part of the 1992 SQL standard, which was not widely supported for several years after that. In earlier systems, joins were done with the 1986 standard SQL syntax. Although you shouldn’t use this unless you absolutely have to, you just might get stuck working on an older database. If so, you should recognize that the join condition is placed confusingly in the WHERE clause, along with all of the tests to pick the right rows: SELECT cFirstName, cLastName, orderDate FROM customers c, orders o WHERE c.custID = o.custID;

Outer join One important effect of all natural and inner joins is that any unmatched PK value simply drops out of the result. In our example, this means that any customer who didn’t place an order isn’t shown. Suppose that we want a list of all customers, along with order date(s) for

those who did place orders. To include the customers who did not place orders, we will use an outer join, which may take either the USING or the ON clause syntax. SELECT cFirstName, cLastName, orderDate FROM customers c LEFT OUTER JOIN orders o ON c.custID = o.custID;

All customers and order dates cfirstname clastname orderdate Tom Jewett Alvaro Monge 2003-07-14 Alvaro Monge 2003-07-18 Alvaro Monge 2003-07-20 Wayne Dick 2003-07-14 • Notice that for customers who placed no orders, any attributes from the Orders table are simply filled with NULL values. • The word “left” refers to the order of the tables in the FROM clause (customers on the left, orders on the right). The left table here is the one that might have unmatched join attributes— the one from which we want all rows. We could have gotten exactly the same results if the table names and outer join direction were reversed: SELECT cFirstName, cLastName, orderDate FROM orders o RIGHT OUTER JOIN customers c ON o.custID = c.custID;

• An outer join makes sense only if one side of the relationship has a minimum cardinality of zero (as Orders does in this example). Otherwise, the outer join will produce exactly the same result as an inner join (for example, between Orders and OrderLines). • The SQL standard also allows a FULL outer join, in which unmatched join attributes from either side are paired with null values on the other side. You will probably not have to use this with most well-designed databases.

Evaluation order Multiple joins in a query are evaluated left-to-right in the order that you write them, unless you use parentheses to force a different evaluation order. (Some database systems require parentheses in any case.) The schemes of the joins are also cumulative in the order that they are evaluated; in RA, this means that r1 r2 r3 = (r1 r2) r3. • It is especially important to remember this rule when outer joins are mixed with other joins in a query. For example, if you write: SELECT cFirstName, cLastName, orderDate, UPC, quantity FROM customers LEFT OUTER JOIN orders USING (custID) NATURAL JOIN orderlines;

you will lose the customers who haven’t placed orders. They will be retained if you force the second join to be executed first: SELECT cFirstName, cLastName, orderDate, UPC, quantity FROM customers LEFT OUTER JOIN (orders NATURAL JOIN orderlines) USING (custID);

Other join types For sake of completeness, you should also know that if you try to join two tables with no join condition, the result will be that every row from one side is paired with every row from the other side. Mathematically, this is a Cartesian product of the two tables, as you have seen before. It is almost never what you want. In pre-1992 syntax, it is easy to do this accidently, by forgetting to put the join condition in the WHERE clause: SELECT cFirstName, cLastName, orderDate FROM customers, orders;

• If your system is backward-compatible (most are), you might actually try this just to prove to yourself that the result is pure nonsense. However, if you ever have an occasion to really need a Cartesian product of two tables, use the new cross join syntax to prove that you really mean it. Notice that this example still produces nonsense. SELECT cFirstName, cLastName, orderDate FROM customers CROSS JOIN orders;

• It is possible, but confusing, to specify a join condition other than equality of two attributes; this is called a non-equi-join. If you see such a thing in older code, it probably represents a WHERE clause or subquery in disguise. • You may also hear the term self join, which is nothing but an inner or outer join between two attributes in the same table. We’ll look at these when we discuss recursive relationships.

SQL technique: functions Sometimes, the information that we need is not actually stored in the database, but has to be computed in some way from the stored data. In our order entry example, there are two derived attributes (/subtotal in OrderLines and /total in Orders) that are part of the class diagram but not part of the relation scheme. We can compute these by using SQL functions in the SELECT statement. There are many, many functions in any implementation of SQL—far more than we can show here. Unfortunately, many of the functions are defined quite differently in different database packages, so you should always consult a reference manual for your specific software.

Computed columns

We can compute values from information that is in a table simply by showing the computation in the SELECT clause. Each computation creates a new column in the output table, just as if it were a named attribute. Example: We want to find the subtotal for each line of the OrderLines table, just as shown in the UML class diagram. Obviously, the total of each line is simply the unit sale price times the quantity ordered, so we don’t even need a function yet—just the computation. We have included all three of the OrderLines PK attributes in the SELECT clause attribute list, to be sure that we show the subtotal for each distinct line. SELECT custID, orderDate, UPC, unitSalePrice * quantity FROM orderlines;

Order line subtotals custid orderdate upc unitsaleprice * quantity 5678 2003-07-14 51820 33622 11.95 9012 2003-07-14 51820 33622 23.90 9012 2003-07-14 11373 24793 21.25 5678 2003-07-18 81809 73555 18.00 5678 2003-07-20 51820 33622 23.90 5678 2003-07-20 81809 73555 9.00 5678 2003-07-20 81810 63591 24.75 Computations are not limited just to column names; they may also include constants. For example, unitsaleprice * 1.06 might be used to find the sale price plus sales tax. Notice that the computation itself is shown as the heading for the computed column. This is awkward to read, and doesn’t really tell us what the column means. We can create our own column heading or alias using the AS keyword as shown below. (In fact, we could simply write the new name of the column without saying AS. Please don’t do this—it hurts readability of your code.) If your want your column alias to have spaces in it, you will have to enclose it in double quote marks. SELECT custID, orderDate, UPC, unitSalePrice * quantity AS subtotal FROM orderlines;

Order line subtotals custid orderdate upc subtotal 5678 2003-07-14 51820 33622 11.95 9012 2003-07-14 51820 33622 23.90 9012 2003-07-14 11373 24793 21.25 5678 2003-07-18 81809 73555 18.00 5678 2003-07-20 51820 33622 23.90 5678 2003-07-20 81809 73555 9.00 5678 2003-07-20 81810 63591 24.75

Aggregate functions

SQL aggregate functions let us compute values based on multiple rows in our tables. They are also used as part of the SELECT clause, and also create new columns in the output. Example: First, let’s just find the total amount of all our sales. To compute this, all we need is to do is to add up all of the price-times-quantity computations from every line of the OrderLines. We will use the SUM function to do the calculation. The output table, as you should expect, will contain only one row. SELECT SUM(unitSalePrice * quantity) AS totalsales FROM orderlines;

Sales totalsales 132.75 Next, we’ll compute the total for each order (the derived attribute shown in the UML Order class). We still need to add up order lines, but we need to group the totals for each order. We can do this with the GROUP BY clause. This time, the output will contain one row for every order, since the customerID and orderDate form the PK for Orders, not OrderLines. Notice that the SELECT clause and the GROUP BY clause contain exactly the same list of attributes, except for the calculation. In most cases, you will get an error message if you forget to do this. SELECT custID, orderDate, SUM(unitSalePrice * quantity) AS total FROM orderlines GROUP BY custID, orderDate;

Order totals custid orderdate total 5678 2003-07-14 11.95 5678 2003-07-18 18.00 5678 2003-07-20 57.65 9012 2003-07-14 45.15 Other frequently-used functions that work the same way as SUM include MIN (minimum value of those in the grouping), MAX (maximum value of those in the grouping, and AVG (average value of those in the grouping). The COUNT function is slightly different, since it returns the number of rows in a grouping. To count all rows, we can use the * (for example, to find out how many orders were placed). SELECT COUNT(*) FROM orders;

Orders COU,T(*) 4 We can also count groups of rows with identical values in a column. In this case, COUNT will ignore NULL values in the column. Here, we’ll find out how many times each product has been ordered.

SELECT prodname AS "product name", COUNT(prodname) AS "times ordered" FROM products NATURAL JOIN orderlines GROUP BY prodname;

Product orders product name times ordered Hammer, framing, 20 oz. 3 Pliers, needle-nose, 4 inch 1 Saw, crosscut, 10 tpi 1 Screwdriver, Phillips #2, 6 inch 2 A WHERE clause can be used as usual before the GROUP BY, to eliminate rows before the group function is executed. However, if we want to select output rows based on the results of the group function, the HAVI,G clause is used instead. For example, we could ask for only those products that have been sold more than once: SELECT prodname AS "product name", COUNT(prodname) AS "times ordered" FROM products NATURAL JOIN orderlines GROUP BY prodname HAVING COUNT(prodname) > 1;

Other functions Most database systems offer a wide variety of functions that deal with formatting and other miscellaneous tasks. These functions tend to be proprietary, differing widely from system to system in both availability and syntax. Most are used in the SELECT clause, although some might appear in a WHERE clause expression or an INSERT or UPDATE statement. Typical functions include: • Functions for rounding, truncating, converting, and formatting numeric data types. • Functions for concatenating, altering case, and manipulating character data types. • Functions for formatting dates and times, or retrieving the date and time from the operating system. • Functions for converting data types such as date or numeric to character string, and viceversa. • Functions for supplying visible values to null attributes, allowing conditional output, and other miscellaneous tasks.

SQL technique: subqueries Sometimes you don’t have enough information available when you design a query to determine which rows you want. In this case, you’ll have to find the required information with a subquery.

Example: Find the name of customers who live in the same zip code area as Wayne Dick. We might start writing this query as we would any of the ones that we have already done: SELECT cFirstName, cLastName FROM customers WHERE cZipCode = ???

• Oops! We don’t know what zip code to put in the WHERE clause. No, we can’t look it up manually and type it into the query—we have to find the answer based only on the information that we have been given! • Fortunately, we also know how to find the right zip code by writing another query: SELECT cZipCode FROM Customers WHERE cFirstName = 'Wayne' AND cLastName = 'Dick';

Zip code czipcode 90840 • Notice that this query returns a single column and a single row. We can use the result as the condition value for cZipCode in our original query. In effect, the output of the second query becomes input to the first one, which we can illustrate with their relation schemes:

• Syntactically, all we have to do is to enclose the subquery in parenthses, in the same place where we would normally use a constant in the WHERE clause. We’ll include the zip code in the SELECT line to verify that the answer is what we want: SELECT cFirstName, cLastName, cZipCode FROM customers WHERE cZipCode = (SELECT cZipCode FROM customers WHERE cFirstName = 'Wayne' AND cLastName = 'Dick');

Customers cfirstname clastname czipcode Alvaro Monge 90840 Wayne Dick 90840 A subquery that returns only one column and one row can be used any time that we need a single value. Another example would be to find the product name and sale price of all products whose unit sale price is greater than the average of all products. We can see that the

DISTINCT keyword is needed, since the SELECT attributes are not a super key of the result set: SELECT DISTINCT prodName, unitSalePrice FROM Products NATURAL JOIN OrderLines WHERE unitSalePrice > the average unit sale price

• Again, we already know how to write another query that finds the average: SELECT AVG(unitSalePrice) FROM OrderLines;

Average AVG(unitsaleprice) 10.621428 • We can visualize the combined queries with their relation schemes, and write the full syntax as before:

SELECT DISTINCT prodName, unitSalePrice FROM Products NATURAL JOIN OrderLines WHERE unitSalePrice > (SELECT AVG(unitSalePrice) FROM OrderLines);

Above average prodname unitsaleprice Hammer, framing, 20 oz. 11.95 Saw, crosscut, 10 tpi 21.25 Subqueries can also be used when we need more than a single value as part of a larger query. We’ll see examples of these in later pages.

SQL technique: union and minus Set operations on tables

Some students initially think of the join as being a sort of union between two tables. It’s not (except for the schemes). The join pairs up data from two very different tables. In RA and SQL, union can operate only on two identical tables. Remember the Venn-diagram representation of the union and minus operations on sets. Union includes members of either or both sets (with no duplicates). Minus includes only those members of the set on the left side of the expression that are not contained in the set on the right side of the expression.

• Both sets, R and S, have to contain objects of the same type. You can’t union or minus sets of integers with sets of characters, for example. All sets, by definition, are unordered and cannot contain duplicate elements. • SQL and RA set operations treat tables as sets of rows. Therefore, the tables on both sides of the union or minus operator have to have at least the same number of attributes, with corresponding attributes being of the same data type. It’s usually cleaner and more readable if you just go ahead and give them the same name using the AS keyword.

Union For this example, we will add a Suppliers table to our sales data entry model. “A supplier is a company from which we purchase products that we will re-sell.” Each supplier suppliers zero to many products; each product is supplied by one and only one supplier. The supplier class attributes include the company name and address, plus the name and phone number of the representative from whom we make our purchases. • We would like to produce a listing that shows the names and phone numbers of all people we deal with, whether they are suppliers or customers. We need rows from both tables, but they have to have the same attribute list. Looking at the relation scheme, we find corresponding first name, last name, and phone number attributes, but we still need to show what company each of the supplier representatives works for.

• We can create an extra column in the query output for the Customers table by simply giving it a name and filling it with a constant value. Here, we’ll use the value 'Customer' to distinguish these rows from supplier representatives. SQL uses the column names of the first part of the union query as the column names for the output, so we will give each of them aliases that are appropriate for the entire set of data.

• Build and test each component of the union query individually, then put them together. The ORDER BY clause has to come at the end. SELECT cLastName AS "Last Name", cFirstName AS "First Name", cPhone as "Phone", 'Customer' AS "Company" FROM customers UNION SELECT repLName, repFName, repPhone, sCompanyName FROM suppliers ORDER BY "Last Name";

Phone list Phone Company Last ,ame First ,ame Bradley Jerry 888-736-8000 Industrial Tool Supply Dick Wayne 562-777-3030 Customer Jewett Tom 714-555-1212 Customer Monge Alvaro 562-333-4141 Customer O'Brien Tom 949-567-2312 Bosch Machine Tools

Minus Sometimes you have to think about both what you do want and what you don’t want in the results of a query. If there is a WHERE clause predicate that completely partitions all rows of interest (the result set) into those you want and those you don’t want, then you have a simple query with a test for inequality. • The multiplicity of an association can help you determine how to build the query. Since each product has one and only one supplier, we can partition the Products table into those that are supplied by a given company and those that are not. SELECT prodName, sCompanyName FROM Products NATURAL JOIN Suppliers WHERE sCompanyName 'Industrial Tool Supply';

• Contrast this to finding customers who did not make purchases in 2002. Because of the optional one-to-many association between Customers and Orders, there are actually four possibilities: 1. A customer made purchases in 2002 (only). 2. A customer made purchases in other years, but not in 2002. 3. A customer made purchases both in other years and in 2002. 4. A customer made no purchases in any year. • If you try to write this as a simple test for inequality, SELECT DISTINCT cLastName, cFirstName, cStreet, cZipCode FROM Customers NATURAL JOIN Orders WHERE TO_CHAR(orderDate, 'YYYY') '2002';

you will correctly exclude group 1 and include group 2, but falsely include group 3 and falsely exclude group 4. Please take time to re-read this statement and convince yourself why it is true! • We can show in set notation what we need to do: {customers who did not make purchases in 2002} = {all customers} − {those who did} There are two ways to write this in SQL. • The easiest syntax in this case is to compare only the customer IDs. We’ll use the NOT IN set operator in the WHERE clause, along with a subquery to find the customer ID of those who did made purchases in 2002. SELECT cLastName, cFirstName, cStreet, cZipCode FROM Customers WHERE custID NOT IN (SELECT custID FROM Orders WHERE TO_CHAR(orderDate, 'YYYY') = '2002');

• We can also use the MINUS operator to subtract rows we don’t want from all rows in Customers. (Some versions of SQL use the keyword EXCEPT instead of MINUS.) Like the UNION, this requires the schemes of the two tables to match exactly in number and type of attributes. SELECT cLastName, cFirstName, cStreet, cZipCode FROM Customers MINUS SELECT cLastName, cFirstName, cStreet, cZipCode FROM Customers NATURAL JOIN Orders WHERE TO_CHAR(orderDate, 'YYYY') = '2002';

Other set operations SQL has two additional set operators. UNION ALL works like UNION, except it keeps duplicate rows in the result. INTERSECT operates just like you would expect from set theory; again, the schemes of the two tables must match exactly.

SQL technique: views and indexes A view is simply any SELECT query that has been given a name and saved in the database. For this reason, a view is sometimes called a named query or a stored query. To create a view, you use the SQL syntax: CREATE OR REPLACE VIEW AS SELECT ;

• The view query itself is saved in the database, but it is not actually run until it is called with another SELECT statement. For this reason, the view does not take up any disk space for data

storage, and it does not create any redundant copies of data that is already stored in the tables that it references (which are sometimes called the base tables of the view). • Although it is not required, many database developers identify views with names such as v_Customers or Customers_view. This not only avoids name conflicts with base tables, it helps in reading any query that uses a view. • The keywords OR REPLACE in the syntax shown above are optional. Although you don’t need to use them the first time that you create a view, including them will overwrite an older version of the view with your latest one, without giving you an error message. • The syntax to remove a view from your schema is exactly what you would expect: DROP VIEW ;

Using views A view name may be used in exactly the same way as a table name in any SELECT query. Once stored, the view can be used again and again, rather than re-writing the same query many times. • The most basic use of a view would be to simply SELECT * from it, but it also might represent a pre-written subquery or a simplified way to write part of a FROM clause. • In many systems, views are stored in a pre-compiled form. This might save some execution time for the query, but usually not enough for a human user to notice. • One of the most important uses of views is in large multi-user systems, where they make it easy to control access to data for different types of users. As a very simple example, suppose that you have a table of employee information on the scheme Employees = {employeeID, empFName, empLName, empPhone, jobTitle, payRate, managerID}. Obviously, you can’t let everyone in the company look at all of this information, let alone make changes to it. • Your database administrator (DBA) can define roles to represent different groups of users, and then grant membership in one or more roles to any specific user account (schema). In turn, you can grant table-level or view-level permissions to a role as well as to a specific user. Suppose that the DBA has created the roles managers and payroll for people who occupy those positions. In Oracle®, there is also a pre-defined role named public, which means every user of the database. • You could create separate views even on just the Employees table, and control access to them like this: CREATE VIEW phone_view AS SELECT empFName, empLName, empPhone FROM Employees; GRANT SELECT ON phone_view TO public; CREATE VIEW job_view AS SELECT employeeID, empFName, empLName, jobTitle, managerID FROM Employees; GRANT SELECT, UPDATE ON job_view TO managers;

CREATE VIEW pay_view AS SELECT employeeID, empFName, empLName, payRate FROM Employees; GRANT SELECT, UPDATE ON pay_view TO payroll;

• Only a very few trusted people would have SELECT, UPDATE, INSERT, and DELETE privileges on the entire Employees base table; everyone else would now have exactly the access that they need, but no more. • When a view is the target of an UPDATE statement, the base table value is changed. You can’t change a computed value in a view, or any value in a view that is based on a UNION query. You may also use a view as the target of an INSERT or DELETE statement, subject to any integrity constraints that have been placed on the base tables.

Materialized views Sometimes, the execution speed of a query is so important that a developer is willing to trade increased disk space use for faster response, by creating a materialized view. Unlike the view discussed above, a materialized view does create and store the result table in advance, filled with data. The scheme of this table is given by the SELECT clause of the view definition. • This technique is most useful when the query involves many joins of large tables, or any other SQL feature that could contribute to long execution times. You might encounter this in a Web project, where the site visitor simply can’t be kept waiting while the query runs. • Since the view would be useless if it is out of date, it must be re-run, at the minimum, when there is a change to any of the tables that it is based on. The SQL syntax to create a materialized view includes many options for when it is first to be run, how often it is to be rerun, and so on. This requires an advanced reference manual for your specific system, and is beyond the scope of this tutorial.

Indexes An index, as you would expect, is a data structure that the database uses to find records within a table more quickly. Indexes are built on one or more columns of a table; each index maintains a list of values within that field that are sorted in ascending or descending order. Rather than sorting records on the field or fields during query execution, the system can simply access the rows in order of the index. Unique and non-unique indexes: When you create an index, you may allow the indexed columns to contain duplicate values; the index will still list all of the rows with duplicates. You may also specify that values in the indexed columns must be unique, just as they must be with a primary key. In fact, when you create a primary key constraint on a table, Oracle and most other systems will automatically create a unique index on the primary key columns, as well as not allowing null values in those columns. One good reason for you to create a unique index on non-primary key fields is to enforce the integrity of a candidate key, which otherwise might end up having (nonsense) duplicate values in different rows.

Queries versus insertion/update: It might seem as if you should create an index on every column or group of columns that will ever by used in an ORDER BY clause (for example: lastName, firstName). However, each index will have to be updated every time that a row is inserted or a value in that column is updated. Although index structures such as B or B+ trees allow this to happen very quickly, there still might be circumstances where too many indexes would detract from overall system performance. This and similar issues are often covered in more advanced courses. Syntax: As you would expect by now, the SQL to create an index is: CREATE INDEX ON (, ...);

To enforce unique values, add the UNIQUE keyword: CREATE UNIQUE INDEX ON (, ...);

To specify sort order, add the keyword ASC or DESC after each column name, just as you would do in an ORDER BY clause. To remove an index, simply enter: DROP INDEX ;

Glossary [ Skip navigation ] Show: [ all keywords ] [ A–C ] [ D–H ] [ I–N ] [ O–R ] [ S–Z ]

keyword aggregate function

aggregation

alias ALTER TABLE AS assignment rule association association class

Glossary definition (SQL) Function that operates on a group of rows, for example, SUM. (UML) An association in which one class represents an assembly of components from one or more other class types. Components may also exist without being part of the assembly. (SQL) An alternate, short name for a table in the FROM clause of a SELECT statement. (SQL) Statement to change structure, constraints, or other properties of a table. (SQL) Keyword to specify a column alias in the SELECT clause. (RM) The association of each attribute in a scheme with its domain. (UML) The way that two classes are functionally connected to each other. (UML) A class that contains attributes which are the

page_link functions.php

aggregate.php

jointypes.php tables.php functions.php class.php association.php manymany.php

keyword

Glossary definition properties of an association rather than a regular class.

association, manySee many-to-many association. to-many association, one-toSee one-to-many association. many association, See recursive association. recursive (UML,ER,RM) One piece of information that attribute characterizes each member of a class. attribute, derived See derived attribute. attribute, descriptive See descriptive attribute. attribute, See discriminator attribute. discriminator attribute, See multivalued attribute. multivalued attribute, repeated See repeated attribute. base table (SQL) A table that is referenced by a view definition. (RM) Boyce-Codd normal form: a database with no BCNF subkey in any relation (with no exceptions). (RM) A minimal super key, candidate to become candidate key (CK) primary key. cardinality (ER) See multiplicity. (RA) The result of the join of two relations with no join Cartesian product attributes specified, as defined in set theory. See also cross join. (RM,TM) The relation on the "many" (FK) side of a child one-to-many association. (UML) Any "thing" in the enterprise that is to be class represented in the database. column (TM) See attribute. COMMIT (SQL) Statement to make changes to data permanent. (RA) Two schemes are compatible if their intersection compatible is null or if the intersection attributes inherit the same assignment rule from their respective schemes. complete (UML) All members of a superclass must also be specialization members of at least one subclass. (UML) Stronger form of aggregation in which composition components cannot exist without being part of the assembly. (RM,TM) Any restriction on the values that can be constraint entered in a table. CREATE TABLE (SQL) Statement to do just what the name implies.

page_link manymany.php association.php recursive.php class.php manymany.php class.php loan.php hobbies.php phone.php views.php subkeys.php keys.php association.php join.php association.php class.php tables.php ddldml.php join.php subclass.php aggregate.php tables.php class.php

keyword cross join data definition language (DDL) data dictionary data integrity data manipulation language (DML)

Glossary definition (SQL) Paste of every pair of tuples from each relation, disregarding join attributes. See also Cartesian product. (SQL) Statements to create and modify tables and other database objects. (SQL) System tables that hold information about the structure of the database. (TM) In part, the value entered in each field of a table is consistent with its attribute domain. (SQL) Statements to work with data in a table.

(SQL) As in programming languages, the data type that data types an attribute can hold in a table. This is not the same as the attribute's domain. (SQL) Statement to remove some or all rows from a DELETE table. (RM) Intentionally "breaking the rules" of normal denormalization forms. (UML) An attribute that can be computed from data derived attribute stored elsewhere in the database. (UML) An attribute that provides real-world descriptive attribute information, relevant to the enterprise, about the class that we are modeling. (UML) Modeling situations that you will encounter design pattern frequently in database design. (RM,TM) An attribute that allows us to discriminate discriminator between multiple pairings of the same two individuals attribute from each side of a many-to-many association. disjoint (UML) Each member of a superclass may be a member specialization of no more than one subclass (SQL) Optional clause of the SELECT statement. Use DISTINCT when the SELECT attributes do not form a super key of the FROM clause. (RM) The set of legal values that may be assigned to an domain attribute. domain, enumerated See enumerated domain. DROP (SQL) Optional clause of the ALTER TABLE CONSTRAINT statement. DROP TABLE (SQL) Statement to delete a table and all of its contents. entity (ER) See class. entity-relationship An enterprise modeling tool used in database design. (ER) model (RM) A domain that may be specified by a wellenumerated domain defined, reasonably-sized set of constant values.

page_link jointypes.php ddldml.php ddldml.php domains.php ddldml.php ddldml.php ddldml.php normalize.php manymany.php class.php manymany.php loan.php subclass.php multijoin.php class.php enum.php ddldml.php ddldml.php class.php models.html enum.php

keyword exclusive specialization external key FD, partial FD, transitive foreign key (FK) functional dependency (FD) generalization grant HAVING incomplete specialization index inner join INSERT INTO join (RA)

join (SQL) join attributes join, cross join, natural join, non-equijoin, outer join, self junction table key key, candidate key, external

Glossary definition (UML) See disjoint specialization. (UML,RM) A surrogate or substitute key that has been defined by an external organization. May be treated as a descriptive attribute in your model. See partial FD. See transitive FD. (RM,TM) A set of attributes that is copied from the PK of a parent table into the scheme of a child table. (RM) Formal definition of the super key property. (UML) (noun) A superclass. (verb) The process of designing superclasses from "bottom up." (SQL) Statement to assign privileges to a user. (SQL) Optional clause that selects aggregated group information in a SELECT statement. (UML) Some members of a superclass might not be members of any subclass. (SQL) A data structure that the database uses to find records within a table more quickly. (SQL) Join of two tables with join attributes specified by the programmer. (SQL) Statement to add a row of data to a table. (RA) Formal definition on which the SQL join is based, the pairwise paste of tuples from the relations being joined. Please see full definition on the page. (SQL) Operation that links two tables, specified in the FROM clause of a SELECT statement. Please see full definition on the page. Attributes whose values are matched during the join operation. See cross join. See natural join. See non-equi-join. See outer join. See self join. (RM,TM) The table created to hold the linking attributes (FKs) from both sides of a many-to-many association. See super key. See candidate key. See external key.

page_link subclass.php keys.php subkeys.php subkeys.php association.php subkeys.php subclass.php ddldml.php functions.php subclass.php views.php jointypes.php tables.php join.php

join.php join.php jointypes.php jointypes.php jointypes.php jointypes.php jointypes.php manymany.php tables.php keys.php keys.php

keyword key, foreign key, primary key, sub key, substitute key, super key, surrogate

Glossary definition

See foreign key. See primary key. See subkey. See substitute key. See super key. See surrogate key. Decomposition (breaking apart) of a relation scheme to lossless join eliminate a subkey. The original scheme’s information decomposition can be recreated with the join in a query. many-to-many (UML) An association with maximum multiplicity of association *..* (SQL) A view that creates and stores a result table in materialized view advance of being used. (RA,SQL) Operation as defined in set theory, returning MINUS the difference of the tuples/rows of two relations/tables over the same scheme. (UML) The minimum and maximum number of multiplicity individuals of one class that may be associated with a single member of another class. (UML) An attribute that contains more than one value multivalued attribute from its domain. This is a design error. named query (SQL) A view. (SQL) Join of two tables with the intersection of their natural join schemes used as join attributes. (SQL) Join based on some condition other than equality non-equi-join of the join attributes. normal form, third See third normal form. (RM) A progression of rules for well-structured normal forms database design. (RM) Following a set of rules to insure that a database normalization is well designed. See also normal forms. (SQL) A special constant value, compatible with any NULL data type, that means "this field doesn't have any value assigned to it." (UML) The instantiation of a class in OO programming object languages. one-to-many (UML) An association with maximum multiplicity of association 1..* (SQL) Join of two tables that retains unmatched join outer join attributes from one or both sides. overlapping (UML) Any member of a superclass may be a member specialization of more than one subclass.

page_link association.php tables.php subkeys.php keys.php tables.php keys.php subkeys.php manymany.php views.php setops.php

association.php hobbies.php views.php jointypes.php jointypes.php subkeys.php normalize.php subkeys.php phone.php class.php association.php jointypes.php subclass.php

Glossary keyword definition (RM,TM) The relation on the "one" (PK) side of a oneparent to-many association. (RM) A subkey that is part of the primary key of a partial FD relation. partial specialization (UML) See incomplete specialization. (RA) Tuple operation on which the RA join is based. paste Please see full definition on the page. (RA) Conditional statement used in the RA select predicate operstion. (RM,TM) The SK set of attributes that the designer primary key (PK) picks as the unique identifier for a database table. (RA) Unary operator that picks attributes from a project (in RA) relation. property (UML) See attribute. A request to retrieve data from the database. See query SELECT (SQL). query, stored See stored query. query, sub See subquery. (UML) An association between a single class type (in recursive association one role) and itself (in another role). (RM,TM) A constraint on a table that does not allow an referential integrity FK value to be entered without a matching PK value. relation (RM) A set of tuples over the same scheme. relational algebra The formal language used to sympolically manipulate (RA) objects of the RM. relational model The formal mathematical model of a relational (RM) database. relationship (ER) See association. (UML) An attribute that occurs more than once in the repeated attribute same class definition; it may also have attributes of its own. This is a design error. (SQL) The intermediate table that results from result set execution of the FROM clause of a SELECT statement. (SQL) A name for a group of database users who have role been granted the same privileges. (SQL) Statement to discard changes that have not been ROLLBACK made permanent. (TM) The information about one individual in a row database table. See also tuple. scheme (RM) A set of attributes, with an assignment rule. select (in RA) (RA) Unary operator that picks tuples from a relation. SELECT (in SQL) (SQL) Statement used to retrieve data from a table.

page_link association.php subkeys.php subclass.php join.php queries.php tables.php queries.php class.php queries.php views.php subqueries.php recursive.php association.php tables.php models.html models.html association.php phone.php queries.php views.php ddldml.php tables.php class.php queries.php queries.php

keyword self join specialization specialization constraints specialization, complete specialization, disjoint specialization, exclusive specialization, incomplete specialization, overlapping specialization, partial stereotype stored query structured query language (SQL) subclass subkey subquery subscheme substitute PK SUM super key (SK) superclass surrogate PK

Glossary definition (SQL) Join of one table to itself. (UML) (noun) A subclass. (verb) The process of designing subclasses from "top down." (UML) Description of subclass membership (incomplete versus complete, disjoint versus overlapping).

page_link jointypes.php subclass.php subclass.php

See complete specialization.

subclass.php

See disjoint specialization.

subclass.php

See disjoint specialization.

subclass.php

See incomplete specialization.

subclass.php

See overlapping specialization.

subclass.php

See incomplete specialization.

subclass.php

(UML) A designator before a class name that specifies the type of the class. (SQL) A view. The language used to build and manipulate relational databases. (UML) A class that inherits common attributes from a parent class, but contains unique attributes of its own. See also superclass and specialization. (RM) A set of attributes that is a super key for some, but not all, of the attributes in a relation. (SQL) A query (SELECT statement) that is embedded in another query. (RA) A subset of the attributes in a scheme, preserving the assignment rule. Used in the RA project operation. (RM) An artificial, somewhat meaningful, primary key made up by the database designer under certain limited conditions. See also surrogate PK. (SQL) Aggregate function to sum the values in a column. (RM,TM) Any set of attributes whose values, taken together, uniquely identify each row of a table. (UML) A class that contains attributes common to one or more child classes. See also subclass and generalization. (RM) An artificial, meaningless, primary key made up

enum.php views.php models.html subclass.php subkeys.php subqueries.php queries.php keys.php functions.php tables.php subclass.php keys.php

keyword

table table model (TM) third normal form (3NF) transitive FD tuple unified modeling language (UML) UNION UPDATE validation rule view view, materialized weak entity

Glossary definition by the database designer under certain limited conditions. See also substitute PK. (TM) A set of rows. See also relation. An informal set of terms for RM objects. (RM) A database with no subkey in any relation (with rare exceptions). See also normal forms. (RM) A subkey that is not part of the primary key of a relation. (RM) A function that assigns a constant value from the attribute domain to each attribute in a relation scheme. An enterprise modeling tool used in database design and software engineering. (RA,SQL) Operation as defined in set theory, returning the union of the tuples/rows of two relations/tables over the same scheme. (SQL) Statement to change existing data in a table. (TM) An algorithm or procedure to separate good from bad data. (SQL) Any SELECT query that has been given a name and saved in the database. See materialized view. (ER) An entity (UML class) type that can't exist without its parent entity type.

page_link

tables.php models.html subkeys.php subkeys.php tables.php models.html setops.php tables.php domains.php views.php views.php phone.php

Database Design With UML and SQL

Short Description

Description

Comments

We need your help!