DBMS Cheatsheet
Short Description
Download DBMS Cheatsheet...
Description
CS631 Quick Reference Author: Ramdas Rao
Data Storage
RAID
Overview of Physical Storage Media
Redundant Array of Independent (Inexpensive) Disks; RAID improves reliability via redundancy • How RAID improves performance via parallelism: Increases the number of I/O requests hanbdled per second or the transfer rate or both
In decreasing order of cost and performance / speed: • Cache - Is volatile, L1 Cache operates at processor speed (e.g., if processor is 3GHz, then the memory access time is 1/3 ns) • Main Memory: Access speed is about 10 to 100 ns; 300 times slower than cache; volatile • Flash Memory: Read access speed about 100 ms (same as memory); Writing is slower 4 to 10 µs; Limited number of erase cycles supported; NOR flash; NAND flash uses page-at-a-time read/write; nonvolatile • Magnetic disk storage: Non-volatile; 10 ms access time; order of magnitude slower • Optical Storage: CD, DVD (Digital Versatile Disk); capacities of 700 MB to 17 GB; Write-once, read-many (WORM); optical disk jukebox • Tape Storage: Sequential access; mostly used for backup and archival; High capacity (40 GB to 300 GB); Tape jukebozes (libraries) - 100s of TB and PB
Magnetic Disks • Platter - Tracks - Sectors - Blocks • A cylinder is a set of tracks one below the other on each platter • Concept of zones: The number of sectors in outer zones is greater than the number of sectors in the inner zones (e.g., 1000 v/s 500 sectors) • Disk Controller Interfaces: ATA (PATA), IDE, SATA (Serial ATA), SCSI, Fiber Channel, Firewrite, SAN (Storage Area Network) - storage on network made to appear as one large disk, NAS (Network Attached Disks) - NFS or CIFS • Performance Measures of disks: Access time, capacity, Data transfer rate, Reliability • Access time: Time from when a red or write is issues to the time when the data transfer begins • Access time = Seek Time (arm positioning) + Latency (waiting for sector to rotate under head) • Average Seek Time = 1/2 of Worst Case Seek Time = 4 to 10 ms • Average Latency Time = 1/2 of time for full rotation = 4 to 10 ms • Average Access Time = 8 to 20 ms • Data Transfer Rate = Rate at which date can be transferred = 25 MB/s to 100 MB/s The transfer rate on the inner tracks are significantly lower (30 MB/s) than the outer tracks (since the umber of sectors on the inner tracks is lesser than the outer) • Mean Time to Failure (MTTF): – For a single disk, it is about 57 to 136 years – If multiple disks are used, the MTTF reduces significantly with 1000 new disks, MTTF is 1200 hours = 47 days – If 100 disks are in an array and each has a MTTF of 100000 hours, then the MTTF of the array is 100000/100 = 1000 hours – If 2 disks have MTTF of 100000 hours and MTTR of 10 hours, 2 then the M T T DataLoss = 100000 = 500x106 hours 10+10 • Mean Time to Data Loss = MTTF + MTTR, (MTTR is Mean Time to Repair) • Optimization of Disk Block Access: – Scheduling of access of blocks (e.g., Elevator algo) – File organization (File Systems, fragmentation, sequential blocks, etc.) – Non-volatile write buffers ∗ NVRAM to speed-up writes ∗ Log disk (since access to log disk is sequential) ∗ Log file (no separate disk - like in journaling file systems)
– Bit-level striping: ∗ The bits of each byte are split across several disks ∗ For an 8-disk configuration, transfer rate is 8 times that of single disk, and number of I/Os are same as that for single disk; Bit i of each byte does to disk i ∗ For a 4-disk config, bits i and 4+i of each byte do to disk i – Block-level striping (most commonly used): ∗ Stripes blocks across multiple disks (one block on each disk) ∗ Logical block i goes to disk (i mod n) + 1 and it uses the floor(i/n)th physical block of the disk (forumlae assume disk number starts from 1 and blocks from 0) ∗ For large reads (multiple blocks), the data transfer rate is n times that of single disk (n is the number of disks) ∗ For single block read, transfer rate is the same as that of single disk, but other disks are free to process other requests – Other forms of striping: Bytes of a sector, sectors of a block • 2 main goals of parallelism in a disk system are: – Load-balance smaller disk requests so that the throughput is increased – Parallelize large accesses so that the response time of large accesses is reduced • RAID Levels: – RAID Level 0: No redundancy, block striping; used when backup is easily restorable – RAID Level 1: Mirroring w/ block striping (aka level 1 + 0 or 10); Mirroring without block striping is called Level 1 (2M number of disks required; used when number of writes are more (e.g., log disk) – RAID Level 2: Memory Style ECC (w/ parity bits); Fewer number of disks required than level 1; Some disks store parity (e.g., 3 Parity disks for 4 disks of data); Subsumed by level 3 – RAID Level 3: Bit-interleaved parity; a single parity bit can be used for error detection as well as correction (e.g., 1 P disk for 3 disks of data) – RAID Level 4: Block-interleaved parity; separate disk for parity (at block level); Parity disk will be involved for every read / write; A single write requires 4 disk accesses: 2 to read the 2 old blocks and 2 to write the new blocks (parity and data); Subsumed by level 4 – RAID Level 5: Block interleaved distributed parity; all disks store parity for the other disks; subsumes level 4 – RAID Level 5: P+Q redundancy (like RAID Level 5, but stores extra redundant information to guard against multiple disk failures); ECC such as Reed-Solomon used; 4-bits of parity instead of 2 can tolerate upto 2 disk failures • Choice of RAID: – RAID Level 0 : Use where data safety is not critical (and backup easily restorable) – RAID Level 1: Offers best write performance, use for High I/O requirements for moderate storage (e.g., log files in a database system) – RAID Level 5: Storage intensive apps such as vide data storage; More frequent reads and rare writes – RAID Level 6: Use when data safety is very important • Hardware RAID v/s Software RAID: Hot-swapping, etc.
Buffer Manager • Buffer Replacement Strategy: LRU (not good for nested loop), MRU (depends on the join strategy), Toss Immediate • Pinning - pinning a block in memork (block that is not allowed to be written back to disk) • Forced output of blocks: even if space is not required (used when xact log records need to go to stable storage)
– Sparse index: An index record appears for only some of the search-key values (some sequential scan required to locate the records) – Main disadvantages of Indexed Sequential File: Performance degrades (both for sequential scans as well as for index lookups) as file grows; can be remedied by periodic reorg of the file, but expensive • Multi-level Index:
File Organization • Fixed Length Records: – On deleting, move the records (expensive) – On deleting, move final record to free space (requires additional disk access) – Store header and pointers to link (chain) free space for records • Variable Length Records: – Use slotted page structure for each block – Each block has a header that stores: ∗ Number of record entries in the header ∗ End of free space in the block ∗ An array whose entries contain the location and the size of each record in the block • Organization of the records within a file: – Heap file organization – Seq. file organization (may require overflow blocks or periodic reorganization) – Hashing file organization – Multi-table clustering file organization: For example, for join of depositor nad customer, after on depositor record, store the customer records for that depositor
Data Dictionary Storage • Store like a miniature database • Types of information stored: – Names of the relations – Names and attributes of each relation – Domains and lengths of each attribute – Names and definitions of views – Integrity constraints – User, Auth and Accounting info, Passwords – Statistical data – Info on indices
Indexing and Hashing • 2 basic types of indices: Ordered Indices (based on a sorted ordering of values) and Hash Indices (based on a uniform distribution of values across a range of buckets) • Indexing techniques must be evaluated on these factors: – Access time – Access type (range of values or point-based) – Insertion time – Deletion time – Space overhead for storing the index • Ordered Indices: – Clustering or Primary Index: If the file containing the records is sequentially ordered, then a clustering or primary index is an index whose search key also defines the sequential order of the file – Non-clustering or Secondary Index: Indices whose search key specifies an order different from the sequential order of the file • Indexed Sequential File: A file with a clustering index on the search key is called an Indexed Sequential File. – Dense index: An index record appears for every search-key value in the file
– n-level sparse indices – If an index occupies 100 blocks, using binary search requires ceiling (log2 (b)) disk accesses – Closely related to tree structures Read update of indices pseudo code from notes / book • Secondary Indices: – Cannot be sparse (must be dense) – Pointers in a secondary index (on search keys that are not candidate keys) do not point directly to the file (instead, each points to a bucket that contains pointers to the file) – Disadvantages: Sequential scan in secondary-key order is very slow; they impose significant overhead on the modification of the file (note that when a file is modified, every index must be updated)
B+-Tree Index Files P1 K1 P2 K2 . . . Kn−1 Pn • Most widely used index structure • Maintains its efficiency despite insertions and deletions of data • A b+-tree index takes the form of a balanced tee in which every path from the root of the tree to a leaf of the tree is of the same length • Each non-lead node has between ceil(n/2) and ceil(n) children, where nis fixed for a particular tree • Each leaf must have at least ceil((n-1)/2) values and at most ceil(n-1) values • Each non-leaf must have at least ceil(n/2) pointers and at most ceil(n) pointers • Imposes performance overhead for insertion and deletion and space overhead (as much as half of a node maybe empty), but is still preferred (since periodic file reorg is not needed) • A B+-tree index is like a multi-level search index • Queries on B+-trees: – If there are K-search key values in a file, then the path from the root to the leaf is no larger than ceil(logn/2 (K)). For example, if K = 1000000 and n = 100, then ceil(log50 (1000000))= 4. Therefore, at most 4 nodes need to be accessed. For binary search, it would require ceil(log2 (1000000)) = 20 nodes. Algo for B+tree from book
B+-Tree File Organization • Leaf node stores records than pointers • Need to consider capacity of mode while splitting and coalescing (since records are stored in the leaf nodes) • SQL CLOBs and large objects are split into sequence of smaller records and organized in a B+-Tree file organization • Forumla here • Indexing Strings: Strings are variable length (need to consider capacity); prefix compression can be used to reduce the size • Advantages: No repetition of search-key values; Non-leaf nodes contain pointer to data (so additional ptr is required in non-leaf nodes height of the tree may increase compared to B+-tree); not much gain compared to B+-tree since anyway majority of the data is in the leaf node • Disadvantages: Deletion and other ops more complex
Other properties • Mutliple Key Access: 1. Use multiple-single key indices: Perform intersection. Performance is poor if there are many records satistying condition individually, but few satisfying both conditions 2. Use indices on multiple keys: composite search key; queries with conjunction predicates with equality on primary key is still okay (since we can treat as (P, -inf) to (P, inf); but if inequality for first, then inefficient 3. Bitmap indices can be used: existence bitmaps and presence of NULLs need to be handled 4. R-tree (extension of B+-tree) to handle indexing on multiple dimensions (e.g., for geographical data) • Non-unique search keys: Use unique record if to prevent buckets and extra page lookups; Search for customer name = ’X’ internally becomes (’X’, -inf) to (’X’, inf) • Covering indices: Store multiple (extra) attributes alongwith pointer to records (e.g., balance can be stored if it is required frequently); saves one disk access • Secondary indices and index relocation: – Some file organaiztions (such as B+-tree) change the location of records even when the records may not have been updated – To overcome problems due to this in Secondary indices, we can store the values of the search-key indices (instead of pointers) in the secondary index and use the primary index to lookup – Cost of access increases, but no change is required on file reorg
Hashing • Hash function must be chosen so that: distribution is uniform and is random • Hashing can be used for: – Hash file organization: compute address of block directly – Hash index organization: organizes index into a hash file structure • Example hash function: s[0[ to s[n-1] is a string of n characters long s[0] ∗ 31n−1 + s[1] ∗ 31n−2 + . . . + s[n − 1] mod number of buckets • Bucket overflows can still occur due to: insufficient buckets, and skew • Overflow can be handled through Overflow chaining, or Open hashing (linear or quadratic probing, etc.); open hashing is not good for db since deletion in this is troublesome • Dynamic hashing: One form of extendable hashing; use of hash prefix; increasing and decreasing of bucket address table • Advantages of dynamic hashing: No space reservation required; Performance does not degrade as file grows or shrinks • Disadvantages of dynamic hashing: Additional lookup for bucket address table required • Linear hashing avoids extra level of indirection at the possible cost of more buckets • Ordered Indexing: Can handle range queries better • Hashing: Bad for range queries; suitable for single-value comparisons; good for temporary files during query processing Bitmap Index Structure Bitmaps and B+-trees can be combined
Query Processing • Steps in Query Processing: Parser and Translator (gives rel. algebra expression), optimizer (also consults statistics about data and give execution plan), Evaluation Engine (evaluates plan and outputs results of the query) Mention about join and sorting techniques here
Cost of Selections ts : Seek time, br : number of blocks in file, tT : Transfer time for one block. Index structures are called access paths since they provide a path through which data can be located and accessed. • Linear Search (A1): Can be applied to any file – Cost = ts + (br ∗ tT ) (one seek + search all blocks) – For key attributes, we can stop after finding the match. Average Cost is : ts + b2r ∗ tT . Worst-case cost is: ts + br ∗ tT • Binary Search (A2): – Cost for key searches: dlog2 (br )e ∗ (ts + tT ) – Cost for non-key searches: dlog2 (br )e ∗ (bn + tT ), where n is the number of items with duplicate keys • Primary Index, equality on key (A3): For a B+-tree, if hi is the height of the tree, then – Cost = (hi + 1) ∗ (tT + ts ) • Primary Index, equality on non-key (A4): – Cost = hi ∗ (tT + ts ) + ts + b ∗ tT , where b is the number of blocks containing the matched duplicate keys • Secondary Index, equality on key (A5): – For key: Cost is same as that for A3 – For non-key: Cost = (hi + n) ∗ (tT + ts ), n is the number of blocks contianing matching keys • Primary Index, comparison (A6): – For A > v, first locate V and then sequential access. Cost is similar to A4 – For A < v or A ≤ v, no index is used. Similar to A1. • Secondary Index, comparison (A7): – Searching index is similar to A6 – But, retrieving each block may require access a different block – Therefore, linear search may be better • Conjunctive Selection using one index (A8): – Use one index to retrieve (use A2 through A7) – Compare each for satisfying the other condition – To reduce the cost, we choose a θi and one of A1 through A7 for which the combination results in the least cost for σthetai (r) – Cost is cost of chosen algo • Conjunctive Selection using composite index (A9): – Use composite index (same as A3, A4 or A5) • Conjunctive Selection by using intersection of identifiers (A10): – Cost is sum of (cost of individual index scans) + (cost of retrieval of records in the intersection) – Sorting can be used so that all pointers in a block come together; blocks are read in sorted physical order to minimize disk arm movement • Disjunctive Selection by using union of pointers (A11): – If access paths are available on all conditions, each index is scanned to get the pointers, union is taken and records are retreived – Even if one of the condition does not have an access path, the most efficient method could be a linear scan
Cost of Joins • Two reasons why sorting is important: the query may require output to be sorted, and joins and some other operations can be implemented efficiently, if the input relations are first sorted. Sorting physically is more important than sorting logically (to reduce disk arm movement) • Natural Join can be expressed as a θ join followed by elimination of repeated attributes by a projection • nr : number of tuples in r, ns : number of tuples in s, br : number of blocks of r, bs : number of blocks of s • Nested loop Join: – If both relations can be read into memory, cost = (br + bs ) – Else, if only one block of each relation fits into memory, cost = nr ∗ bs + br , assuming “r” is the outer relation • Block Nested loop Join: – Assumption: M+1 is the total blocks available (1 for o/p); else the denom will be (M-2)
r – r is the outer relation: Cost = d Mb−1 ∗bs +br , where M blocks are allocated to r • Sort Merge Join – Soting cost of each reltion assuming they are not sorted is (M is the number of pages available for sorting - 1 is for o/p and M-1 for input): ∗ for r, br (2dlogM −1 (br /M )e + 1 + 1 ∗ for s, bs (2dlogM −1 (bs /M )e + 1 + 1 – After sorting, assuming that all the tuples with the same value for the join attributs fir in memory, the cost is: (Sorting Cost) + br + bs • Hash Join – Assume no overflow occurs – Use smaller relation (say r) as the build relation and larger relation (say s) as the probe relation – If M > br /M , no need of recursive paritioning and cost is: Cost = 3(br + bs ) – Else, if recursive partioning occurs: Cost = 2(br + bs )dlogM −1 (br )−1e+br +bs (included is the cost for reading and writing partitions) • A B-tree organization has b(m − 1)n/mc entries for each node
Query Optimization Equivalence Rules (Set Version) • Cascade of σ for conjunction selections σθ1 ∧θ2 (E) = σθ1 (σθ2 (E)) • Commutativity of selection operations σθ1 (σθ2 (E)) = σθ2 (σθ1 (E)) • Only final projection in a sequence of projections πL1 (πL2 (. . . πLn ))) = πL1 (E) • Selections can be combined with Cartersian products and theta joins: – σθ (E1 xE2 ) = E1 o n θ E2 – σθ1 (E1 σθ2 E2 ) = E1 o nθ1 ∧θ2 E2 • Theta joins and natural joins are commutative E1 o n E2 = E2 o n E1 • Associativity of joins – Natural Joins are associative: (E1 o n E2 ) o n E3 = E1 o n (E2 o n E3 ) – Theta joins are associative in the following manner: If θ2 has only attributes from E2 and E3 , then: (E1 o n θ1 E 2 ) o nθ1 ∧θ3 E3 = E1 o nθ1 ∧θ3 (E2 o n θ2 E 3 ) – Cartesian products are also associative • Distributivity of selections – If θ0 only involves E1 , then σθ0 (E1 o nθ E2 ) = (σθ0 (E1 )) o n θ E2 – If θ1 only involves E1 and θ2 only involves E2 , then σθ1 ∧θ2 (E1 o nθ E2 ) = (σθ1 (E1 )) o nθ σθ2 (E2 )) • Distributivity of projections If L1 are only attributes of E1 and L2 only of E2 , then πL1 ∪L2 (E1 o nθ E2 ) = πL1 (E1 ) o nθ (πL2 (E2 ))
• Set operations of union and intersection are commutative E1 ∪ E2 = E 2 ∪ E 1 E1 ∩ E2 = E 2 ∩ E 1 However, Set Difference is not commutative. • Set union and intersection are associative (E1 ∪ E2 ) ∪ E3 = E1 ∪ (E2 ∪ E3 ) (E1 ∩ E2 ) ∩ E3 = E1 ∩ (E2 ∩ E3 ) • Selection operation distributes over union, intersection and set difference operations: σP (E1 − E2 ) = σP (E1 ) − σP (E2 ) Also, σP (E1 − E2 ) = σP (E1 ) − E2 (this does not hold for intersection???) • Projection operation distributes over union: πL (E1 ∪ E2 ) = (πL (E1 )) ∪ (πL (E2 )) Join ordering: choose such that the size of th temporary results are reduced Enumeration of Equivalent Expressions: • Space requirements can be optimized by pointing to shared sub expressions • Time requirements can be reduced by optimization (dynamic programming, etc.)
Estimating Statistics of Expr Results • Catalog Information: – nr = number of tuples in r – br = number of block containing tuples of r – lr = size of a tuple of r in bytes – fr = blocking factor of r (= number of tuples of r that fit in one block) – V (A, r) = number of distinct values that appear in r for attribute A – If A is a key for r, then V (A, r) = nr – If tuples of r are phsyically stored together, br = d nfrr e – Histogram for a range of values of attribute can be used for estimating (histograms can be equi-width or equi-height) • Selection Size Estimation: – Equality (σA=a (r)) ∗ Assuming equi-probable, N um = nr /V (A, r) ∗ With histogram, num = nrange /V (A, range) – Comparison (σA≤v (r)) ∗ If v < min(A, r), num = 0 ∗ If v ≥ max(A, r), N um = nr v−min(A,r) ∗ Else, num = nr . max(A,r)−min(A,r) (this can be modified to use a histogram, where available - use the number in the ranges, instead of in the entire relation) ∗ If v is not known as in the case of stored procedures, assume num = nr /2 – Complex selections n ∗ Conjunctions (σθ1 ∧θ2 ...∧n (r)): num = nr . s1 ∗sn2 ∗...s , n r where si is the number of tuples that satisfy the selection σθ1 (r). si /nr is called the selectivity of the selection σθ1 (r) ∗ Disjunctions (σθ1 ∨θ2 ...θn (r)): num = nr ∗ [1 − (1 − s1 sn )(1 − ns2r ) . . . (1 − n ) nr r ∗ Negations: Compute num as: num = nr − num(σθ (r). If NULLs are present, compute as: num = nr − num(σθ (r) − num(N U LLs). • Join Size Estimation: – Cartesian Product: N um(rxs) = nr xns – Natural Joins
∗ R ∩ S = φ: same as Cartesian Product ∗ R ∩ S is a key for R: num ≤ ns . Similarly, when it is a key for S. ∗ R ∩ S is a foreign key of S, referencing R: N um = ns ∗ R ∩ S is neither a key for R nor S: Choose minimum of the following, where R ∩ S = {A} · N um = nr ∗ ns /V (A, s) · N um = nr ∗ ns /V (A, r) • Size Estimation for Other Operations: – Projection (πA (r)): N um = V (A, r) (since projection eliminates duplicates) – Aggregation (A GF (r)): N um = V (A, r) (since one tuple in output for each distinct value of A) – Set Operations ∗ Same relation operations: Rewrite as conjunctions, disjunctions or negations and use previous results (e.g., σθ1 (r) ∪ σθ2 (r) = σθ1 ∨θ2 (r) ∗ Different relation operations: Inaccurate, but provides upper bound · N um(r ∪ s) = nr + ns · N um(r ∩ s) = min(nr , ns ) · N um(r − s) = nr – Outer Joins: Inaccurate, but provides upper bound ∗ N um(rlef touters) = N um(r o n s) + nr . Similarly, for (r right outer s) ∗ N um(routers) = N um(r o n s) + nr + ns • Estimation of number of distinct values: – If selection condition θ forces A to take a single value, N um = V (A, σθ (r)) = 1 – If range of values, then Num = Number of specified values in the selection condition – If selection condition of the form (A op V), then V (A, σθ (r)) = V (A, r) ∗ s, where s is the selectivity of the selection – In all other cases, num = min(V (A, r), nσθ (r)) – For joins: ∗ If all attrs. in A are from r, then V (A, r o n s) = min(V (A, r), nro ns ) ∗ If A has A1 from r and A2 from S, then N um = min(V (A1 , r) ∗ V (A2 − A1 , s), V (A2 , s) ∗ V (A1 − A2 , r), nro ns ) – For projections: N um = nr – For aggregates like sum, count, avg, it is same as nr – For min(A) and max(A), N um = min(V (A, r), V (G, r)), where G denotes the grouping attributes
Choice of Evaluation Plans • To choose the best overall algo, we must consider even nonoptimal algos for individual operations • Cost-based optimization: With n relations, there are 2(n−1)! different (n−1)! join orders. • Time complexity is O(3n ) • Dynamic Programming Algo: Outline algo here
Heuristics in Optimization 1. Perform selection operations as early as possible (may cause problems if no index on selection attribute and r relation is small in σθ (r o n s) 2. Perform projections early (similar problems as in 1 above) 3. Left-Deep Join Orders: convenient for pipelining 4. Avoid Cartesian products 5. Cached plan can be reused
Optimizing Nested Queries “where exsits” type of query can be optimized by using “decorrelation”; rewrite as join of temporary table (remember to use select distinct and to take care of NULL values to preserve the number of tuples)
Materialized Views Normally, only query defintion is stoed. In materialized views, we compute the contents of the view and store. View Maintenance is required to keep the materialized view up-to-date. View Maintenance can be relegated to the programmer or be taken care by the system (can be immediate or deferred) • Incremental View Maintenance: Update can be treated conceptually to Delete followed by Insert • Join Operation: – For inserts, vn ew = vo ld ∪ (ir o n s) – For deletes, vn ew = vo ld − (dr o n s) • Selection and Projection Operation: – For selection inserts, vn ew = vo ld ∪ σθ (ir ) – For selection deletes, vn ew = vo ld − σθ (dr ) – For projection, need to handle duplicates: ∗ Keep count for each tuple in projection πA (r) ∗ Decrement count on delete and delete record from view when count is 0 ∗ Increment count on insert or add to view if not present • Aggregation Operations: – Count: Similar to projection – Sum: Similar to count (but need to keep sum as well as count) – Avg: Keep sum as well as count – Min, Max: Insetion - easy; deletion - expensive - need to find new min, max • Other Operations: – Set Intersection: (r ∩ s) ∗ On insertion in r, check if it is in s. If so, add to view ∗ On deletion in r, check if it is in s. If so, delete from view – Outer Joins: (routerjoins) ∗ On insertion in r, check if it is in s. If so, add to view. If not is s, still add to view, but padded with NULLs ∗ On deletion from s, pad with NULLs if it is in r and no longer in S ∗ On deletion from r, remove from view
Query Optimization using Materialized Views • Optimizer may need to substitute the query (or sub-query) by materialized view, if it exists • Replacing a use of a materialized view by the view definition. For example, if σA=10 (V ), where V is defined as r o n s and there is an index on A in r, but not in r o n s.
Transactions Transaction: a set of operations that form a single logical unit of work. • ACID properties of a transaction: – A - Atomicity: all or none (handled by Transaction Mgmt. component) – C - Consistency: if db was consistent before xact, then it should be consistent after the xact (handled by programmer or constraints) – I - Isolation: an xact does not see the effects of a concurrent running xact (handled by the Concurrency Control component) – D - Durability: once committed, stays committed (handled by the Recovery Mgmt. component) • Transaction States: Active, Failed, Aborted (perform rollback), Partially committed, Committed • Shadow-copy technique: Ensures atomicity and durability, and is used by text editors. Disadvantage: Very expensive to make copies of entire db; no support for concurrent xacts • Need for Concurrent Executions: Improved throughput (tps), Improved resource utilization, Reduced waiting time (e.g., smaller xacts queued up behind a large xact), Reduced average response time • Schedules: represent the chronological order in which the instructions are executed in the system. For a set for n transactions, there exist n! different serial schedules
• Consistency of the db under concurrent execution can be ensured by making sure that any schedule that is executed has the same effect as a serial schedule (that is, one w/o concurrent execution)
Conflict Serializability: • Instructions I1 and I2 conflict if they are operations by different xacts on the same data item and at least one of them is a write operation • If a schedule S can be transformed into a schedule S 0 by a series of swaps of non-conflicting instructions, S and S 0 are said to be conflict equivalent • A schedule S is said to be conflict serializable if it is conflict equivalent to some serial schedule • This prohibits certain types of schedule even though there would be no problem (e.g., ops that simply add and subtract). However, these cases are harder to analyze.
View Serializability: • Less stringent than Conflict Serializability • View Equivalence: 2 schedules S and S 0 are view equivalent if ALL 3 conditions mentioned below are met: – For each data item Q, if xact Ti read the initial value of Q in S, then xact Ti in S 0 must also read the initial value of Q – For each data item Q, if xact Ti executes read(Q) in S and if that value was produced by xact Tj , then that read(Q) op of xact Ti in S 0 must also read the value of Q that was produced by that same write op of xact Tj – For each data item Q, the xact (if any) that performs the final write(Q) op in S, must also perform the final write(Q) op in S 0 • A schedule is said to be view serializable if it is view equivalent to some serial schedule • Blind Writes: Writing a value w/o reading it first • Blind Writes appear in any view serializable schedule that is not conflict serializable
Other properties • Recoverable Schedule: is one where, for each pair of xacts Ti and Tj such that Tj reads a data item previously written by Ti , then the commit operation of Ti appears before the commit operation of Tj • Cascading rollback is undesirable since it can lead to undoing a significant amount of work • Cascadeless Schedule: is one where, for each pair of xacts Ti andTj such that Tj reads a data item previosuly written by Ti , the commit operation of Ti occurs before the read operation of Tj . • A cascadeless schedule is also recoverable, but not vice-versa. • The goal of concurrency control schemes is to provide a high degree of concurrency, while ensuring that all schedules that can be generated are conflict or view serializable, and are cascadeless. • Testing for Conflict Serializability: (to show that the generated schedules are serializable) – Construct precedence graph for a schedule S (vertices are xacts, edges indicate read/write dependencies) – If the graph contains no cycles, then the schedule S is conflict serializable – A serializability order of the xacts can be obtained through topological sorting of the precedence graph – Cycle detection algos are O(n2 ) • Testing for View Serializability: – NP-complete problem – Sufficient conditions can be used – If sufficient conditions are satisfied, the schedule is viewserializable – But, there may be view-serializable schedules that do not satisfy the sufficient conditions See examples and exercises of schedules from the book
Concurrency Control Shared locks (S) and Exclusive locks (X): Compatibility matrix: (S,S) true, (S,X) false, (X,S) false, (X,X) false Starvation can be avoided: by processing the lock requests in the order in which they were made
2PL • • • •
Ensures serializability Growing phase, Shrinking phase Does not prevent deadlock Cascading rollbacks may occur (e.g., if T7 reads a data item that was written by T5 and then T5 aborts – To avoid cascading rollbacks, strict 2PL can be used where exclusive locks must be held till the xact aborts or commits (prevents xacts from reading uncommitted writes) – rigorous 2PL can be used where ALL locks are held till the xact aborts or commits; xacts are serialized in their commit order • Upgrading and Downgrading of locks can be done; upgrading should be allowed only in the growing phase, while downgrading only in the shrinking phase (e.g., series of reads followed by write to a data item - in other forms of 2PL above, the xact must obtain an X lock on the data item to be updated, even if it is much later)
Implementation of locking: Hash table for data items with linked list (of xacts that have been granted locks for that data item plus those that are waiting). Overflow chaining can be used.
Graph-based Protocols • Acyclic graph of data item locking order • A data item can be locked by Ti only if its parent is currently locked by Ti • Locks can be released earlier; so shorter waiting times and increased concurrency • Deadlock free; so no rollbacks are required • Disadvantages: may need to lock more data items than needed (locking overhead and increased waiting time), w/o prior knowledge of which data items to lock, xacts may have to lock the root of the tree and that can reduce concurrency greatly • Cascadelessness can be obtained by tracking commit dependencies such that a transaction is not allowed to commit until the ones that it had read values written by have not commited
Timestamp-based Protocols • • • •
Determines the serializability order by selecting the order in advance Using timestamps: could be the system clock or a logical counter Each xact is given a timestamp when it enters the system Each data item has 2 timestamps: W-timestamp (the largest ts of any xact that wrote the data item successfully) and R-timestamp (the largest ts of any xact that read the data item successfully) • Timestamp-Ordering Protocol is: – If Ti issues read(Q) ∗ If T S(Ti ) < W − timestamp(Q), reject the read and rollback Ti ∗ If T S(Ti ) ≥ W − timestamp(Q), execute the read and set the R-timestamp of Q to maximum of T Si and R-timestamp(Q) – If Ti issues write(Q) ∗ If T S(Ti ) < R − timestamp(Q), reject the write and rollback Ti ∗ If T S(Ti ) < W − timestamp(Q), reject the write and rollback Ti ∗ In all other cases, execute the write and set the Wtimestamp of Q to T Si – Rolled-back TS get a new timestamp when they are restarted – Freedom from deadlocks
– However, xacts could starve (e.g., long duration xact getting restarted repeatedly due to conflicts with short duration xacts) – Recoverability and cascadelessness can be ensured by: ∗ Performing all writes together at the end of the xact; no xact is permitted to access any of the data items that have been written ∗ Using a limited form of locking, whereby uncommitted reads are postponed until the xact that updated the item commits – Recoverability alone can be guaranteed by using commit dependencies, that is, tracking uncommited writes and allowing a xact Ti to commit only after the commit of all xacts that wrote a value that Ti read. • Thomas’ Write Rule: – Allows greater potential concurrency – Ignores writes if T Si < W − timestamp(Q), instead of rolling it back
Validation-based Protocols • Also called optimistic concurrency control • Each xact goes through 3 phases (for update xacts and 2 for read-only xacts): – Read phase: The system executes the xact Ti ; it reads all data items and performs all write operations on temporary local variables, w/o updates to the actual db – Validation phase: Checks if the updates can be copied over to the db w/o conflict – Write phase: Done only if the xact succeeds in the validation phase. If so, the system applies the updates to the db; otherwise the xact is rolled back • Validation test for xact Tj requires that for all xacts Ti with T S(Ti ) < T S(T j), one of the following conditions must hold: – F inish(Ti ) < Start(Tj ) – The set of data item written by Ti does not intersect with the set of data items read by Tj and Ti completes its write phase before Tj starts its validation phase. (Start(Tj ) < F inish(Ti ) < V alidation(Tj ). This ensures that the writes of Ti and Tj do not overlap
Multiple Granularity • Hierarchy of granularity: DB, Areas, Files, Records; visualize as a tree with the DB at the root of the tree • Explicit locking at one level will mean implicit locking at all nodes below it • Care must be taken not to grant explicit lock at a level above which another lock has been granted already (e.g., cannot lock a record explicitly, if the file has been locked). The tree must be traversed from the root to the required level to find out. • Also, a db cannot be locked, if someone else is holding a lock at a lower level. Instead of searching the entire tree to determine this, intention lock modes are used. – When an xact locks a node, it acquires an intention lock on all the nodes from the root to that node. – IS (Intention-Shared) lock: If a node is locked in IS mode, then explicit shared locking is being at the lower level – IX (Intention-Exclusive) lock: If a node is locked in IX mode, then explicit exclusive or shared locking is being at the lower level – SIX (Shared and Intention-Exclusive) lock: If a node is locked in SIX mode, then the subtree rooted at that node is being locked in explicitly shared mode and explicit exclusive locking is being at the lower level – Compatibility Matrix: IS IX S SIX X IS true true true true false IX true true false false false S true false true false false SIX true false false false false X false false false false false
– Multiple-granularity protocol: ∗ The compat matrix above must be followed for granting locks ∗ It must lock the root of the tree first, and it can lock it in any mode ∗ It can lock a node Q in S or IS mode only if it currently has the parent of Q locked in either IS or IX mode ∗ It can lock a node Q in X, SIX, or IX mode only if it currently has the parent of Q locked in either IX or SIX mode ∗ It can lock a node only if it has not previously unlcoked any node (that is, Ti is 2P) ∗ It can unlock a node Q only it it currently has none of the children of Q locked ∗ Locking is done top-down, whereas unlocking is done bottom-up – This protocol enhances concurrency and reduces lock overhead and is good for apps that include a mix of: ∗ Short xacts that access only a few data items ∗ Long xacts that produce reports from the entire file or set of files – Deadlock is possible
Multiversion Schemes Instead of delaying the reads or aborting an xact, these schemes use old copies of the data. Each write produces a new version of a data item, and read is given one of the versions of the data item. The protocol must ensure that the version given ensures serializability and that an xact be able to easily determine which version to read. • Multiversion Timestamp Ordering: – Each xact has unique ts as before (for the TS Scheme) – Each version of data item has content of the data item, R-ts and W-ts – Whenever an xact writes to Q, a new version of Q is produced whose R-ts and W-ts are initialized to T S(Ti ). – Whenever an xact reads Q, the R-ts of Q is set to T S(Ti ) only if R − ts(Q) < T S(Ti ) – The protocol is (an xact Ti wants to read or write Q): ∗ Find a version Qk whose w-ts is the largest ts ≤ T S(Ti ) ∗ If xact Ti issues read(Q), the value returned is the content of Qk ∗ If xact Ti issues write(Q) and T S(Ti ) < R − ts(Qk ), then rollback Ti (some other xact already read the value and so we cannot change it now). On the other hand, if T S(T i) = W − ts(Qk ), overwrite the contents of Qk (w/o creating a new version); else, create a new version. – Older versions of a data item are removed by: If there are 2 versions of a data item with W-ts less than the oldest transaction in the system, the older of these 2 versions can be removed – A read request never fails and is never made to wait – Disadv: Reading requires updating of R-ts (2 disk accesses than one), and conflicts between xacts are resolved through rollbacks rather than waits (Multiversion 2PL solves the rollback problem). – Does not ensure recoverability and cascadelessness; can be extended in the same manner as the basic TS-ordering scheme • Multiversion 2 PL: Attempts to combine the adv. of multiversion with 2PL; it differentiates between read-only xacts and update xacts. TODO
Deadlock Handling 2 methods to deal with deadlocks: Deadlock prevention, and Deadlock detection and recovery. Deadlock prevention is used if the probability of deadlocks is relatively high; otherwise detection and recovery are more efficient. Detection scheme requires overhead to maintain information while running to detect deadlocks as well as losses that can occur due to recovery from deadlocks. • Deadlock Prevention using partial ordering: Use partial ordering technique like tree protocol
• Deadlock Prevention using total ordering and 2PL: Use total ordering and 2PL; in this case, the xact cannot request locks on items that precede that item in the ordering • Deadlock Prevention using wait-die: Using xact rollback; older xacts are made to wait; younger ones are rolled back if the lock is currently held by an older one; the older the xact gets, the more it must wait • Deadlock Prevention using wound-wait: Pre-emptive technique; younger xact is wounded by older one; younger one is made to wait, if older xact has a lock on the item; there may be fewer rollbacks in this scheme Both wait-die and wound-wait avoid starvation and both may cause unnecessary rollbacks • Timeout-based schemes: In between deadlock prevention and detection schemes; allow an xact to wait for sometime; if timeout, assume that deadlock may have occurred and rollback xact; Easy to implement, but difficult to determine the correct duration of time to wait; suitable for short xacts • Deadlock Detection and Recovery: Must check periodically if a deadlock has occurred (detection); Can be done to see if cycles exist in a wait-for graph; Selection of a victim can be done on the basis of minimum cost (how many xacts will be involved; how many data items have been used; how much longer to complete, etc.); Total rollback or partial rollback (just enough rollback to the point where the appropriate lock is released that breaks the deadlock); Starvation can be prevented by including the number of times an xact has been rolled back in the cost factor while deciding the victim
Insert and Delete Operations • Delete operation similar to write (X lock for delete op in 2PL; treated similar to write op in TS-ordering) • Insert operation: X-lock in 2PL; in TS-ordering, assign TS of the xact that is inserting the item to the R-ts and W-ts of the data item
Phantom Phenomenon Consider computation of sum by using a select and an insert statement; this can result in a non-serializable schedule if locking is done at the granularity of the data item; neither access any tuple in common - so the conflict would go undetected. Can be alleviated by: • Associating a virtual data item with every relation and having the xacts lock this (in addition to the tuples), if they are updating or reading info about the relation • Index-locking protocol using 2PL can be used: nodes of the index must be locked in shared mode for lookups; writes must lock the appropriate nodes of the index in exlcusive mode • Variants of index-locking can be used to implement the other schemes (apart from 2PL)
Weak Levels of Consistency Serializability allows programmers to ignore issues related to concurrency when they code xacts. • Degree-Two Consistency: Purpose is to avoid cascading aborts w/o necessarily ensuring serializability; S-locks may be acquired and released at any time; X-locks can be acquired at any time, but cannot be released until the xact aborts or commits; results in non-serializable schedules; therefore, this approach is undesirable for many apps • Cursor Stability: Form of two-degree consistency for programs written in host languages where iteration of tuples is done using a cursor; Instead of locking the entire relation, the tuple that is currently being processed is locked in S-mode; Any modified tuples are locked in X-mode until the xact commits; 2PL is not used, Serializability is not guaranteed; Heavily accessed relations gain increased concurreny and improved system performance. Programmers must take care at the app level so that db consistency is ensured. • Weak Levels of Consistency in SQL: SQL-92 levels: – Serializable (default)
– Repeatable Read (xact may not be serializable wrt other xacts; e.g., when an xact is searching for records satisfying some conditions, the xact may find some records inserted by a committed xact, but not others) – Read committed – Read uncommitted (lowest level of consistency allowed in SQL-92)
Concurrency in Index Structures Since indices are accessed frequently, they would become a point of great lock contention, leading to a low degree of concurrency. It is acceptable to have nonserializable concurrent access to an index, as long as the accuracy of the index is maintained. 2 technique: • Crabbing Protocol: – When searching, lock root node in shared mode, then the child node. After acquiring lock on child, release lock on parent. – When inserting, traverse tree as in search mode. Then, lock the node affected in X-mode. If coalescing, splitting or redistribution is required, lock the parent in X-mode; then perform the operations on the node(s) and release the locks on the node and the siblings; retain lock on parent, if parent needs further splitting, coalescing, or resitribution. – Progress of locking goes from top to bottom while searching and bottom to up when splitting, coalescing, redistributing • B-link-tree locking protocol: Achieves more concurrency by avoiding holding the lock on one node while holding lock on another node, by using a modified version of B+-trees called B-link trees; these require that every node including the internal nodes and leaf nodes maintain a pointer to its right sibling – Lookup: Each node must be locked in S mode before accessing it; Split may occur concurrently with lookup, so the search value may have moved to the right node; Leaf nodes are locked in 2PL to avoid phantom phenomenon – Insertion and deletion: Follows the rules to locate the lead node into which the insertion or deletion will take place; Upgrades the shared lock to X lock on the affected leaf; Leaf nodes are locked in 2PL to avoid phantom phenomenon – Split: Create the new node (split); change the right-sibling pointers accordingly; release X-lock on the original node (if it is non-leaf; leaf nodes are locked in 2PL to avoid phantom phenomenon) – Coalescing: Node into which coalescing will be done should be locked in X mode; once coalescing has been done, parent node is locked in X mode to remove the deleted node; then, xact releases the locks on the coalesced nodes, if parent is not to be coalesced, lock on parent can be released – Note: An insertion or deletion may lock a node, ulock it, and subsequently relock it. Furthermore, a lookup that runs concurrently with a split or coalescence operation may find that the desired value has shifted to the right-sibling node by the split or coalescence operation; this can be accessed by following the right-sibling pointer. – Coalescing of nodes can cause inconsistencies; lookups may have to restart – Instead of 2PL on leaf nodes, key-value locking on individual key-values can be done. However, must be done carefully; else, phantom phenomenon can occur for range lookups; can be taken by locking one more key value than the range (next-key value).
Recovery System • Fail-stop assumption: Hardware errors and bugs in software bring the system to a halt, but do not corrupt the nonvolatile storage contents. • Stable Storage Implementation: Keep 2 physical blocks for each logical database block. Write the info onto the first physical block.
When the first write completes successfully, write the same info onto the second physical block. The o/p is completed only after the second write completes successfully. During recovery, the system examines each pair of physical blocks. If contents same, nothing to be done. If error in one, replace with the other. If contents differ, replace first’s contents with the contents of the second block. Number of blocks to compare can be reduced by keeping list of ongoing writes in NVRAM (so that only these need to be compared).
• Dump database procedure: Output all log records to stable, then the buffer blocks, copy the contents of the db to stable storage, then output a log record ¡dump¿ onto the stable storage. To recover, only records after the ¡dump¿ record must be redone. But copying of entire db is impractical and the xact processing must be halted during the dump. Fuzzy dumps can be used to allows xacts to be active while the dump is in progress.
Advanced Recovery Techniques
Log-Based Recovery
Using logical logging for undo process for achieving more concurrency (faster release of locks on certain structures such as B+-tree index pages)
Recovery is used for rolling back transactions as well as for crash recovery. Update log record has: Xact Id, Data-item id, Old Value, New Value
Fuzzy Checkpointing:
• , or records are written at start, commit or abort of a transaction • Deferred Database Modification: Only new values need to be stored (for redo; no need for undo) • Immediate Database Modification: Requires to store both old and new values (for undo and redo) – Undo is performed before redo • Checkpoints: Helps in reducing scanning the log after a crash to locate the transactions to be undone and redone; also helps in reducing the time to redo (since the changes before the checkpoint would have been applied already). Transactions are not allowed to perform any update actions, such as writing to a buffer block or writing a log record, while a checkpoint is in progress.
• Normal checkpointing may halt the xact processing for a long time if the number of pages to be written is large • Allows xacts to modify buffer blocks once the checkpoint record has been written • While performing fuzzy checkpointing, the xact processing is halted only briefly to make a list of buffers modified. The checkpoint is record before the buffers are written out. • The locks are released and xacts can modify the buffer blocks; the checkpointing process proceeds to output the modified blocks in its list in parallel. However, the block being written out by the checkpointing process still needs to be locked; other blocks need not be. • Concept of “last-checkpoint” record at a fixed position on disk can be used to guard against failures. This record should be updated only after ALL the buffers in the checkpoint’s list have been written to stable storage.
– Output all log records to stable storage – Output all modified buffer blocks to stable storage – Write the checkpoint record to stable storage For recovery, the log must be scanned backward to find the most recent checkpoint record. It needs to further continue searching backward until it finds all the transactions that have some record after the most recent checkpoint record. Only these transactions need to be redone / undone. No commit record, do undo; else do redo. • Recovery with Concurrency Control: – List of active transactions are stored as part of the checkpoint record – The log can be used to rollback even failed xacts. – If strict 2PL is used (that is, excl. locks till end of xact), the locks held by an xact may be released only after the xact has been rolled back. So, when an xact is being rolled back, no other xact may have updated the same data item (the xact should have locked the data item since it was to update it in the first place). Therefore, restoring the old value of a data item will not erase the effects of any other xact. – Undo must be done by processing the log backward – Redo must be done by processing the log forward – For recovery: Scan the log backward until it finds the ¡checkpoint L¿ record performing the following steps as it reads each record while scanning backward: ∗ If ¡Ti commit¿ record found, add Ti to the redo list ∗ If ¡Ti start¿ record found and Ti is not on the redo list, add to the undo list ∗ Finally, for all Ti in the checkpoint record list, that does not appear in the redo list, add to the undo list. This is to take care of long running xacts that may not have updated anything since the checkpoint record was written. ∗ Undo must be done prior to redo • WAL (Write-ahead logging): Before a block of data in main memory can be output to the database (in non-volatile storage), all log records pertaining to data in that block must have been output to stable storage. Strictly speaking, the WAL rule requires only that the undo info in the log have been output to stable storage, and permits the redo info to be written later. This is relevant only in systems where undo and redo info are stored in separate log records.
ARIES • Features: – Uses LSN (log sequence number) – Physiological Redo – Dirty Page Table – Fuzzy Checkpointing (allows dirty pages to be written continuously in the background, removing in bottle necks when all pages need to be written at once) • LSN: – Every log record has a unique LSN that uniquely identifies the log record – LSN is most often file number and an offset within that file – Each page has an LSN that indicates the LSN of the last record that modified that page. – PageLSN is essential to ensure idempotence in the presence of phsyiological redo operations – Physiological redo cannot be reapplied to a page since it would result in incorrect changes on the page – Each log record contains a field called ”PrevLSN” that points to the previous log record for this transactioon (helps in locating transaction log records easily without reading the whole log) – CLRs (Compenstation Log Records) have an additional field UndoNextLSN that is used in the case of the operation-abort log record to point to the log record that is to be undone next • Dirty Page Table: – Stores the list of pages that have been updated in the buffer – For each page, the PageLSN and the RecLSN is also stored – RecLSN indicates which log records have already been applied to the disk version of the page – Intially, when the page is brought in from the disk, the RecLSN is set to the current end of the log • Checkpointing: – A checkpoint log record contains the Dirty Page Table and the list of active transactions – For each transaction, the checkpoint record also stores the last LSN for that transaction – A fixed position on the disk notes the LSN of the last complete checkpoint log record • Recovery: 3 Phases (in recovery):
– Analysis Pass: Determines which xacts to undo, which pages were dirty at the time of the crash, and the LSN from which the redo pass should start. – Redo Pass: Starts from a position determined during the analysis phase, and performs a redo, repeating history, to bring the database to a state it was in before the crash. – Undo Pass: Rolls back all xacts that were incomplete at the time of the crash. Need to elaborate here about CLRs, etc. While undoing, if a CLR is found, it uses the UndoNextLSN to locate the next record to be undone; else it undoes the record whose number is found in the PrevLSN field • Advantages of ARIES: – Recovery is faster (no need to reapply already redone records; pages need not even be fetched if the changes are already applied) – Lesser data needs to be stored in the log – More concurrency is possible – Recovery Independence (e.g., for pages that are in error, etc.) – Savepoints (e.g., rolling back to a point where deadlock can be broken) – Allows fine-grained locking – Recovery optimizations (fetch-ahead of pages, out-of-order redo)
Remote Backup Systems Several issues must be addressed: • Detection of failure: Using “heartbeat” messages and multiple links of communication • Transfer of control: When original comes back up, it must update itself (by receiving the redo logs from the old backup site and replaying them locally). The old backup can then fail itself to allow the recovered primary to take over. • Time to recover: Hot-spare configuration can be used. • Time to commit: – One-safe: Commit as soon as commit log record is written to stable storage at primary – Two-very safe: Commit only when both primary and secondary have written the log records to stable storage (problem is when secondary is down) – Two-safe: Same as Two-very safe when both primary and secondary are up; when secondary is down, proceed as One-safe)
Database System Architectures Main Types: Client-Server, Parallel, Distributed
Centralized Systems: • Coarse-grained parallelism: A single query is not partitioned among multiple processors. Such systems support a higher throughput; that is, they allow a greater number of transactions to run per second, although individual transactions do not run any faster. • Fine-grained parallelism: Single tasks are parallelized (split) among multiple processors
∗ Database Writer Process ∗ Process Monitoring Process ∗ Checkpoint Process – The shared memory contains all the shared data: ∗ Buffer Pool ∗ Lock Table ∗ Log Buffer ∗ Cached Query Plans – Semaphores or ”Test and Set” atomic operations must be used to ensure concurrent access to the shared memory – Even if the system handles lock requests through shared memory, it still uses the lock manager process for deadlock detection • Data-server systems (aka query-server systems): – This architecture is used typically when: ∗ High-speed connection between clients and servers ∗ Client systems have comparable computational power as those of servers ∗ Tasks to be executed are computationally intensive – The client needs to have full backend functionality
Parallel Systems • Speedup v/s Scaleup • Factors affecting Scaleup / Speedup • Interconnection Networks: – Bus √ √ – Mesh (max. distance is 2( n − 1) or n, if wrapping is allowed from the ends) – Hypercube (max. distance is log n) • Parallel System Architectures: Shared-memory, Shared-disks, Shared nothing, Hierarchical – Hierarchical: Share nothing at the top-level(???), but internally each node has either shared-memory or shared-disk architecture)
Distributed Systems • Reasons: Sharing data, Autonomy, Availability • Multidatabase or heterogeneous distributed database systems • Issues in distributed database systems: Software development cost, Greater potential for bugs, Increased processing overhead • Local-Area Networks, Storage Area Networks (SAN) • Wide-Area Networks: Disconintuous Connection WANs v/s Continuous Connection WANs
Distributed DB • Each site may participate in the execution of transactions that access data at one site, or several sites. • The difference between centralized and distributed databases is that, in the centralized case, the data reside in one location, whereas in the distributed case, the data reside in several locations. • Homogeneous Distributed DB: All sites have identical dbms software, are aware of one another, and agree to cooperate in processing users’ requests • Heterogeneous Distributed DB: Different sites may use different schemas and different dbms software, and may provide only limited facilities for cooperation in transaction processing
Client-Server Systems: Clients access functionality through API (JDBC, ODBC, etc.) or transactional remote procedure calls
Server System Architectures: 2 types: Transaction-server v/s Data-server systems • Transaction-server systems (aka query-server systems): – Components of a Transaction-server system include: ∗ Server Processes ∗ Lock Manager Process ∗ Log Writer Process
Distributed Data Storage Two approaches to storing a relation in a distributed db: • Replication: Several identical copies of a relation are stored; each replica at a different site. Full replication: a copy is stored at every site – Advantages: Availability, Increased parallelism (minimizes movement of data between sites) – Disadvantages: Increased overhead on update – In general, replication increases the performance of and the availability of data for read operations; but update transactions incur greater overhead
•
• • •
– Concept or primary copy of a relation Fragmentation: The relation is partitioned into several fragments, and each fragment is stored at a different site – Horizontal Fragmentation: Each tuple to one or more sites ri = σPi (r). r is reconstructed using: r = r1 ∪r2 ∪r3 . . .∪rn – Vertical Fragmentation: Decomposition of scheme of relation (so that columns are at one or more sites) ri = πRi (r). The original relation can be obtained by taking the natrual join of all the fragmented relations. Primary key (e.g., tuple id) needs to exist in each fragment. – For privacy reasons, vertical fragmentation can be used for hiding columns. Fragmentation and Replication can be combined Transparency: Users should get: Fragmentation transparency, Replication transparency, Location transparency To prevent name clashes: a name server can be used (single pointof-failure) or site id prepended to each relation name. Aliases can be used to map aliases to real names stored at each site. This helps when the administrator decides to move a data item from one site to another.
Distributed Transactions • Need to worry about failure of a site or failure of communication link while participating in a transaction • Transaction Manager (handles ACID for local) and Transaction Coordinator (coordinates the execution of both local and global transactions initiated at its site) • Transaction Coordinator Responsibilities: Start execution of a transaction, Break a xact into sub-parts and distribute to various sites, coordinate termination of xact (abort or commit) • Failures: failure of a link, loss of messages, network partition, failure of a site
2PC • Protocol: When all sites inform the coordinator that the transaction is complete: – P1: Coord sends all sites ¡prepare T¿, Sites reply with ¡ready T¿ or ¡no T¿ – P2: Coord sends ¡commit T¿ or ¡abort T¿ (based on whether all sites were ready to commit or not) – All such comm. must be logged to stable storage before it sends the msg. out so that recovery is possible – In some implementations, each site sends ¡ack T¿ msg to the coord. The coord records ¡complete T¿ after it receives ¡ack T¿ from all the sites • Handling of failures: – Failure of site: Handling by coordinator: ∗ If site failed before replying ¡ready¿, the coord treats it similar to a reply of ¡abort¿ ∗ If site failed after replying ¡ready¿, the coord ignores the failure and proceeds normally (the site will take care after it comes back up) Handling by site: When the site comes back up, it checks it log: ∗ If no control records in log, execute undo ∗ If commit in log, commit the xact ∗ If abort in log, execute undo ∗ If ¡ready¿ is present in log, it needs to find out from the coord about the status. If coord is down, it can ask the other sites. If this info is not available, then the site can neither commit nor abort T. It needs to postpone the decision for T until it gets the needed info. – Failure of coord: When coord fails, then the participating sites must try to determine the outcome (but cannot be done in all cases) ∗ If site has ¡commit T¿, then it needs to commit the xact ∗ If site has ¡abort T¿, then it needs to undo ∗ If site does not have ¡ready T¿, then it can undo
∗ Otherwise, site has a ¡ready T¿. In this case, it must wait for the coord to recover. This is the “blocking problem”. If locking is used, other transactions may be forced to wait. – Network Partition: ∗ If the coord and all participants are in the same partition, then no effect. ∗ Otherwise, the sites that are in the partition other than the coord, treat the failure as if the coord failed. Similarly, for the sites in the same partition as the coord and the coord, they treat the failure as if the sites in the other partition had failed. • To allow the recovered site to proceed, the list of items locked can also be recorded with the ¡ready T¿ message in the log. The recovery proceeds to relock those items, whereas other xact can proceed.
3PC • Tries to avoid blocking in certain cases by informing at least “k” other sites of its decision • It is assumed that no network partition occurs and not more than “k” sites fail, where “k” is a predetermined number • If the coord fails, then the sites elect a new coord. The new coord tries to find out if any site knows about the old coord’s intentions. If it finds any one site, then it starts the third phase (to commit or abort). If it cannot, the new coord aborts the xact. • If a n/w partition occurs, it may appear to be the same as “k” sites failing and blocking may occur. • 3PC has overheads; so it is not used widely. • Also, it should be implemented carefully; else the same xact may be committed in one partition and may be aborted in another
Alternative methods of xact processing Using persistent messaging; this requires complicated error handling (e.g., by using compensating xacts). Persistent messaging can be used for xacts that cross organizational boundaries. Implementation of persistent messaging: • Sending site protocol: Messages must be logged to persistent storage within the context of the same xact as the originating xact before sending it out; On receiving an ack from the receiver, this can be deleted. If no ack is recd., the site tries repeatedly. After predetermined number of failures, error is reported to the application (compensating xact must be applied). • Receiving site protocol: On receipt, the receiver must first log into persistent storage; Duplicates must be rejected; After the xact for logging the message to the log relation commits, the receiver send an ack. Ack is also sent for duplicates. Deleting received messages from the receiver must be done carefully, since the ack may not have reached the sender and a duplicate may be sent. Each message can be given a timestamp to deal with this problem. If the ts of a recd. msg. is older than some predetermined cutoff, then that msg is discarded and all other messages recorded that have ts older than the cutoff can be deleted.
Concurrency Control in Dist. DB • Each site participates in the execution of a commit protocol to ensure global transaction atomicity.
Locking Protocols: • Single Lock-Manager Approach: – A single lock manager (residing at a single site) for the entire system – Request for lock is delayed until it can be granted; message is sent to the site from which the lock request was initiated. – The xact can read from any of the site where the replica is available; but all sites where a replica of the data item exists must be involved in the writing.
•
•
•
•
•
– Advantages: Simple Implementation, Simple Deadlock Handling – Disadvantages: Bottleneck, Vulnerability (Single point-offailure) Distributed Lock Manager: – Lock Manager function is distributed over several sites – Each site maintains a lock manager whose function is to administer the lock and unlock requests for those data items that are stored at that site – this works as for the single case when the data item is not replicated – for replicated case, see the methods below – Advantages: Simple implementation; reduces the degree to which the coord is the bottleneck; reasonably low overhead requiring 2 messages for lock requests and one for unlock requests. – Disadvantages: Deadlock handling is more complex since the lock/unlock requests are not made at a single site. There may be intersite deadlocks even when there is no deadlock within a single site. Primary Copy: – Single primary copy for each replicated data item – Lock / unlock requests are always made to the site that has the primary copy – This is handled similar to the case for unreplicated data – Advantages: Simple Implementation – Disadvantages: Single point-of-failure (if the site that has the primary copy fails, the data item is inaccessible, although other sites containing a replica may be accessible) Majority Protocol: – If a data item Q is replicated at n different sites, then a lock request must be sent to more than one-half of the n sites in which Q is stored; the transaction proceeds only when more than onehalf of the n sites grant a lock on the data item Q; otherwise, it is delayed – Writes are performed on all replicas – Protocol can be extended to deal with site failures (see later points ¡which one¿) – Advantages: Distributed lock manager functionality – Disadvantages: More complicated to implement; requires at least 2(n/2 + 1) messages for handling lock requests and at least (n/2 + 1) messages for handling unlock requests; Deadlock handling is more complicated - deadlocks can occur even if a single data item is being locked (unless the requests are made to the sites in the same predetermined order by all the sites) Biased Protocol: – Requests for shared locks are given more favorable treatment – Shared locks: Request from one site that has a replica of Q – Exclusive locks: Request locks at all sites that have a replica of Q – Advantages: Lesser overhead on read operations than the majority protocol; savings are significant when reads are more – Disadvantages: Writes require more overhead; same complexity for deadlock handling as for the Majority Protocol Quorum Consensus Protocol: – Generalization of the majority protocol – Each site is assigned a nonnegative weight; read and write operations are assigned 2 integers called read quorum Qr and write quorum Qw – Following condition must be satisfied: Qr + Qw > S and 2 ∗ Qw > S, where S is the total weight of all the sites at which the data item exists – For read locks, enough replicas must be locked so that their total weight ≥ Qr – For write locks, enough replicas must be locked so that their total weight ≥ Qw
– Advantages: The cost of read or write locking can be selectively reduced by choosing the read and write quorums; by setting appropriate weights, this protocol can simulate the majority and biased protocols • Timestamping: – Each xact is given a unique timestamp that the system uses in deciding the serialization order – 2 methods for generating unique timestamps: (1). Centralized, or (2). Distributed (concatenate the site id at the end of the local unique timestamp - this is done to ensure that the global ts generated in one site are not always greater than those generated in other sites) – Handling of faster clocks: Use logical counter clock; Whenever a transaction with timestamp ¡x,y¿ visits a site and x is greater than the current value of local clock counter, set local clock counter to x + 1 (Similar technique can be use for system clock based timestamps)
Replication with weak degrees of consistency • Master-slave replication: updates only at a primary site; xacts can read from anywhere • Multi-master replication: Update-anywhere replication • Laxy propagation: instead of updating all replicas as part of the xact performing the update. • Updates at replicas as translated into updates at a primary site, which are then propagates lazily to all replicas. (This ensures updates to an item are ordered serially, although serializability problems can occur, since xacts may read an old value of some other data item and use it to perform an update) • Updates are performed at any replica and propagated to all other replicas. Can cause even more problems since the same data item may be updated concurrently at multiple sites.
Deadlock Handling • Deadlock can occur if the union of the local wait-for graphs contains a cycle (even though each local wait-for graph is acyclic) • Centralized deadlock detection: Global wait-for graph is maintained at the central site and is updated whenever a new edge is removed or inserted from one of the local wait-for graphs (or periodically or whenever the coord needs to invoke the cycle detection algo) • When a cycle is detected, the coord selects a victim to be rolled back and it must notify all sites about this. The sites, in turn, roll back the victim xact. • May produce unnecessary roll-backs: – False cycles: Message for adding edge arrives before message for removing edge – One of the xact was to be aborted: If an xact was to be aborted for reasons other than the deadlock, it may be possible that the deadlock would have been broken and there would not be the need to select (another) victim • Deadlock detection can be done in a distributed manner, but is more complicated.
Availability Multiple-links can be used between sites; however, multiple links may still fail. So there are cases where we cannot distinguish between site failure and network partition. Must take care to ensure that these situations are avoided: 2 or more central servers are elected in distinct partitions, and more than one partition updates a replicated data item • Majority-based Approach: – Each data item stores with it a version number to detect when it was last written to. This is updated on every write. – If a data item is replicated at n sites, then the xact will not proceed until it has obtained locks from majority of those n sites – Read operations look at all versions and choose the highest one (the sites will lower numbered versions can be informed of the new version)
•
•
•
•
– Write ops write to all replicas that have been locked; the version number is one more than the highest numbered one amongst them – This works even when a failed site comes back up (it will told about its stale data). Site reintegration is trivial - nothing needs to be done. This is since writes would have updated a majority, while reads will read a majority of the replicas and find at least one replica that has the highest version. – Same version numbering technique can be used with the quorum consensus to make it work in the presence of failures. However, failures may prevent xacts from proceeding if some sites are given higher weights. Read One, Write All Approach: – Unit weights to all sites, read quorum = 1 and write quorum = n (all sites) – No need of version number since even one site failed will not allow write to the data item to happen – To allow this to work in the case of failures, we could use “read one, write all available” - but there are severla complications that can arise in the case of network partitions or temporary site failures (the site will not know and may have to explicitly cathc up). Inconsistencies can arise in the case of network partitions. Site Reintegration: The recovering site must ensure that it gets the latest values and in addition must continue to receive updates as it is recovering. An easy solution is to halt entire system temporarily, but this is usually not feasible. Recovery of a link must be informed to all sites. Comparison with Remote Backup: In remote backup, concurrency control and recovery are performed at a single site (overhead with 2PC are avoided); only data and log records are shipped across. Transaction code is only at one site. Remote backup system offer a lower-cost approach to availability than replication. On the other hand, replication can provide greater availability by having multiple replicas and using the majority protocol. Coordinator Selection: – When there is not enough information avlbl to continue from the failed coord, the backup coord can abort all (or several) current xacts and restart them under the control of the new coord – bully algorithm: If some site is electing itself the coord, then the site must wait to hear the election message within a predetermined time interval. If it does not hear this message, this site will restart the election algo.
Distributed Query Processing Must take into account: cost of data transmission over the n/w as well as the hard disk access and the potential gain in performance from having several sites process parts of the xact in parallel • Query Transformation: Choose the replica for which the transmission cost is the lowest; can make use of the fact that the selection only fetches tuples from a (fragmented) replica. • Simple Join Processing: – Ship all copies to S1 – Ship to S1 ; compute join; ship result to S3 ; compute join; ship result to S1 (or roles interchanged) – Need to worry about the volume of data being shipped; also, indices may have to be re-created at the shipped site • Semijoin Strategy: Compute r1 o n r2 o n ΠR1 ∩R2 (r1 ) Semijoin: r1 n r2 = ΠR1 (r1 o n r2 ). That is, the semijoin selects those tuples of relation r1 that contributed to r1 o n r2 • Join Strategies that exploit parallelism: Pipelined-join technique can be used, for example, for r1 o n r2 o n r3 o n r4
Hetergeneous Dist. DB • Unified View of Data: Difficult because: endianness, ASCII v/s EBCDIC, units of measurement, Strings in different languages (“Cologne” v/s “Koln”) • Query Processing: Wrappers for local schema to global schema mapping and back; mediator system do not bother about xact processing
LDAP • Can be used for storing bookmarks, browser settings, etc. • Provides a simple mechanism to name objects in a hierarchical fashion • RDN=value can be collected to form the full distinguished name • Querying consists of just selections and projections; no joins • Distributed Directory Trees: A node in a DIT (directory info tree) may contain a referral to another node in another DIT; this helps in distributed trees. • Many LDAP implementations support master-slave replication and multimaster replication even though replication is not part of the current std.
Parallel Databases Parallelism is used to: speedup (queries are executed faster because more resources, such as processors and disks, are provided) and scaleup (increasing workloads are handled without increased response time, via an increase in the degree of parallelism)
I/O Parallelism • Horizontal Partioning: Tuples of a relation are divided (or declustered) among many disks, so that each tuple resides on one disk • Partioning Techniques – Round-robin (ith tuple to disk number Dimodn : ensures even distribution of tuples across disks (each disk has approx. the same number of tuples) – Hash partioning (hashing is on the chosen partioning attributes of the tuples)...if the has function returns i, the tuple is placed on disk Di – Range partitioning: Contiguous attribute-value ranges to each disk based on a partitioning attribute. • Comparison of partitioning technique based on the access technique: scanning entire relation, point queries, range queries – Round-robin: Good for sequential scan of entire data; bad for range and point queries (since each of the n disks must be searched) – Hash-partioning: Best for point queries; also suited for sequential scans of the entire relation (since the hash function could ensure that the data are evenly distributed); not good for range (since all disks must be searched) – Range-partitioning: Suited for range as well as point queries; point-queries can be answered by looking at the partition vector Range parititioning results in higher throughput while maintaining good response time when a query is sent to one disk (only a few tuples in the queried range - other disks can be used for other queries). On the other hand, when many tuples are to be fetched from a few disks, this could result in an I/O bottleneck (hostpot) at those disks. • Choice of partioning affects other operations such as joins; in general, range or hash partioning are preferred to round-robin. • If a relation consists of m disk blocks and there are n disks, then the relation should be allocated to min(m, n) disks (preferably, try to fit relations that fit in a block to a single disk). • Handling of skew: Attribute-value skew and partition skew – Attribute-value skew: all tuples with the same value for the partitioning attribute end up in the same partition; can occur regardless of whether range or hash partioning is used. – Partition-skew: Load imbalance in the partioning, even when there is no attr. skew; range partitioning may result in partition skew, if the partition vector is not chosen carefully; partition skew is less likely to occur with hash partitioning, if a good hash function is chosen. • Loss of speedup due to skew increases with parallelism • Techniques to overcome skew: – Balanced range-partitioning vector: Sort by partitioning attr. and distribute equally (1/n); can still result in skew; also cost for sorting is high
– Use histogram to reduce skew further – Use concept of virtual processors: mapping of real virtual processors to real in round-robin
Interquery Parallelism • Different queries or xacts execute in parallel with one another • Xact throughput can be increased, but the response times of individual xacts are no faster than they would be if the xacts were run in isolation • Pimary use of interquery parallelism is to scaleup a xact processing system to support a large number of xacts per second • Not useful for speeding up long running tasks, since each task is executed sequentially • Easiest form to support (esp. in shared-memory parallel system) • Cache-coherency problem: can be solved by locking a page in memory before any read or write access and flushes the page to the shared disk before it releases the lock • Other way is to access the latest value from the buffer pool of some other processor
•
•
Intraquery Parallelism • Execution of a single query in parallel on multiple processors and disks • Useful for speeding up long-running tasks • Intraoperation Parallelism: Speed up processing of a query by parallelizing the execution of each individual operation (such as sort, select, project, and join) • Interoperation Parallelism: Speed up processing by executing in parallel the different operations in a query expr • Both can be used simulatenously on a query • Since the number of ops in a typical query is small compared to the number of tuples processed by each operation, intraop parallelism scales better with increasing parallelism. However, with relatively small number of processors, both forms of parallelism are important
Intraoperation Parallelism: • Parallel Sort using Range-Partioning Sort: – Range partition the data as per the sorting attribute and send it to the respective processors – Each processor sorts within the range – The final merge is trivial since the range partitioning in the first phase ensures that all key values in processor Pi are less than those in Pj , for all i ≤ j. • Parallel Sort using Parallel External Sort-Merge: – Each processor locally sorts the data on its disk – Merging of the sorted runs is done similar to external sortmerge. Merging can be done by range-partitioning the sorted data at each processor and then each processor sending the values in each partition to the respective processor. Could result in execution skew where each processor will become a hot-spot when it is its turn to receive the tuples. To avoid this, each processor send the first block of every partition, then the second block of every partition, and so on. As a result, all processors receive data in parallel. • Parallel Join using Partitioned Join: – Works only for equi-joins or natural joins – Range-parition or hash partition can be used to partition the 2 relations (r and s) to be joined – Each processor can perform the join for the ith partition of r and s – To prevent skew, the range partitioning vector must be such that the sum of the sizes of ri and si is roughly equal over all i • Parallel Join using Fragment-and-replicate Join: – Works for any kind of join
•
•
– Asymmetric Frag-and-rep join: Fragment one of the relations “r” using any partitioning technique. Replicate the other relation “s”. Each processor performs the join of ri and s using any join technique. – (Symmetric) Frag-and-rep join: Fragment both of the relations using any partitioning technique; the paritions need not be of the same size. Each processor performs join of ri and sj . – Asymm. frag-and-rep join is useful when one of the relations “s” is smaller; it can be replicated to all processors. Parallel Join using Partitioned Parallel Hash Join: – Hash paritition each relation and send to the respective processor – As each processor receives the tuples, it performs a local hash join (build and probe relation) – Hybrid hash-join could also be used locally to cache the incoming tuples in memory, and thus avoid the costs of writing them and of reading them back in. Parallel Join using Parallel Nested Join: – Asymm. frag-and-rep. can be used along with indexed nest loop join at each processor – The indexed nested loop join can be overlapped with the distribution of the tuples in “s” to reduce the costs of writing the tuples of “s” to disk and to read them back. Other operations: – Selection for a range can proceed in parallel at each processor whose range partition overlaps with the specified range of values in the selection. – Duplicate elimination can be parallelized using parallel sorting technique or hash / range partitioning and eliminating the duplicates locally at each processor. – Projection w/o duplicate elimination can be done as tuples are read in from the disk in parallel. – Aggregation can be done in parallel by paritioning on the grouping attributes and then computing the aggregate locally at each processor. (Either range or hash partitioning can be used). The cost of transferring tuples can be reduced by partly computing the aggregate values before partitioning (and then using partitioning as before). Cost of Parallel Evaluation of Operations: Start-up costs, Skew, Contention for resources, Cost of assembling the final result T otalT ime = Tpart + Tasm + max(T0 , T1 , . . . , Tn−1 , where T part is the time for partitioning the relations. A paritioned parallel evaluation is only as fast as the slowest of the parallel executions.
Interoperation Parallelism: • 2 forms: Pipelined Parallelism and Independent Parallelism • Pipelined: Major advantage is that the intermediate results are not written to disk; they are just fed to the other processors in the pipeline • Independent: r1 join r2 can be computed independently of r3 join r4; has lower degree of parallelism
Query Optimization: • Avoid long pipelines (resources will be hoarded) and it will take time for the first input to reach the last processor in the pipeline • Advatnage of parallelism could get negated by the overhead of communication • Heuristic 1: Consider only evaluation plans that parallelize every operation across all processors, and that do not use any pipelining • Heuristic 2: Exchange-operator model: Exchange operators can be introduced into an evaluation plan to transform it into a parallel evaluation plan A large parallel databse system must also address these availability issues: Resilience to failure of some processors or disks; On line reorganization of data and schema changes. Online index construction : should not lock the entire relation in shared mode as it is done usually; instead it should keep track of updates that occur while it is active and incroporate the changes into the index being constructed.
XML • • • • •
Self-documenting (because of presence of tags) Format of the document is not rigid (e.g., extra tags) XML allows nested structures Wide-variety of tools available XML Schema Definitions: – DTD (Document Type Definition) – XSD (XML Schema Definition) – Relax NG • DTD: – ELEMENT, ATTLIST, default values supported, #PCDATA, (+, *, ? for repetitions), empty and any – Attributes can have #REQUIRED or #IMPLIED – ID, IDREF and IDREFS for uniqueness, references and list of references (e.g., “owns”) – Limitations of DTD: ∗ Text elements and attributes cannot be constrained to be of specific types ∗ Difficult to specify unordered sets of subelements ∗ Lack of typing in IDs and IDREF or IDREFS ∗ (?) No support for user-defined types • XSD: – Specified in XML syntax – Support for type checking (simple as well as user-defined (complexType and sequence) – Specification of keys and key references using xs:key and xs:keyref – Benefits over DTD: ∗ Text can be constrained to specific types or sequences ∗ Allows user-defined types ∗ Allows uniqueness and foreign-key constraints ∗ Integrated with namespaces to allow different parts of a document to conform to different schemas ∗ Allows maximum and minimum value checking ∗ Allows complex types to be inherited through a form of inheritance
Querying and Transformation: • XPath, XQuery (FLWOR expressions), XSLT • XPath: – Nodes are returned in the same order as they appear in the document – @ is used for attributes – /bank/account[bal > 400]/@account_no – count function for counting the nodes matched – — operator for union of the results – // for slipping multiple levels, .. specifies parent – function doc(name) allows to look into the document whose name is specified (e.g., doc("bank.xml")/bank/account) • XQuery: – Uses XPath and is based on XQL and XML-QL – Uses FLWOR Expressions: for, let, where, order by, return – for statement is like “from” in SQL – return statement treats everything as plain text to be output except for strings within which are treated as expressions to be evaluated – return can have nested queries – User-defined functions and types are allowed – some and every can be used for testing existential and universal qualification • XSLT: – Templates are used – “match” and “xsl:value-of select” are used – xsl:key and xsl:sort
• API for XML Processing: DOM (Document Object Model) and SAX (Simple API for XML) – SAX is useful when the application needs to create its own data representation of the data read from the XML document
Storage of XML Data • Non-relational Data Stores: Flat file and special XML database (suffers from no support for atomicity, transactions, concurrency, data isolation and security) • Relational Databases: – Store as a String: Database does not know the schema of the stored elements; searching is inefficient; additional fields may be stored (at the cost of redundancy) for indexing; function indices can be used A – Tree Representation: nodes(id, type, label, value) and child(child id, parent id); position can be added, if order must be preserved; many XML queries can be converted to relational ones; disadvantage of large number of joins – Map to Relations: All attributes are stored as string-valued attributes of the relation; if subelement of simple type, add as an attribute of the relation; else, add as a separate relation; parent id needs to be added; position can be added for position; relations that can occur at most once can be “flattened” into the parent relation by moving all their attributes into the parent relations – Publishing and Shredding XML Data: “Publish” means “to XML from relational”; “Shredding” means “to relational from XML” ∗ Publishing: An XML element for every tuple and every column of the relation as a subelement of the XML element (more complicated for nesting) ∗ Shredding: Similar to “Map to Relations” – Native Storage within a Relational Database: Using CLOB and BLOB; binary representations of the XML can be stored directly as a BLOB; some dbs provide xml data type; Xquery can be executed on a XML document within a row and a SQL query can be used for iterating over the required rows – SQL / XML: XML extensions to SQL; xmlelement, xmlattributes, xmlforest, xmlagg, xmlconcat
XML Applications: • Storing Data with complex structure (such as bookmarks) • Standardized Data Exchange Formats (e.g., ChemML, RosettaNet) • Web Services (SOAP) - Web Services provide a RPC call interface with XML as the mechanism for encoding parameters as well as results • Data Mediation (collecting data from various web sites / sources and presenting a single XML view to the user; e.g., showing user’s bank account details from various banks)
Additional Research Papers Mention about tree algo here...keeping id and pre-order numbering, etc.
Advanced Transaction Processing TP-Monitor Architectures: • Process-per-client model • Single-server model (multithreaded; a bug in one app can affect all other apps; not suited for parallel or distributed databases) • Many-server, single-router model (PostgreSQL, Web apps) • Many-server, many-router model (very high performance web systems, Tandem Pathway)
Main Memory DB: Since disk I/O is often the bottleneck for reads/writes, we can make the db system less disk bound by increasing the size of the database buffer. Since memory sizes are increasing and costs are decreasing, an increasing number of apps can be expected to have data fit into main memory. Larger main memories allow faster processing of transactions, since data are memory resident. But there are still disk-related limitations: • Log records must be written to stable storage (logging process will become a bottleneck). Could use NVRAM or group commit to reduce the overhead imposed by logging. • Buffer blocks marked as modified by committed xacts still have to be written so that the amount of log that needs to be replayed at recovery time is reduced. • After a crash recovery, even after recovery is complete, it takes some time before the db is fully loaded in main memory Opportunities for optimization: • Data Structures with pointers can be used across pages (unlike those on the disk) • There is no need to pin pages in memory before they are accessed, since buffer pages will never be replaced • Query-processing techniques should be designed to minimize space overhead (otherwise, main memory limits may be exceeded and swapping may take place slowing the query processing) • Operations such as locking and latching may become bottlenecks these should be improved. • Recovery algos can be optimized, since pages rarely need to be written out to make space for other pages. “Group commit” is to reduce the overhead of logging by delaying writes to a log until a batch is ready. This results in a slight delay in the commit of transactions that perform updates.
Thus, it appears that the enforcement of xact atomicity must either lead to an increased probability of long-duration waits or create a possibility of cascading rollback. • Concurrency Control: – Correctness may be achievable without serializability – Could split db into sub-dbs on which concurrency can be managed separately – Could use concurrency techniques that exploit multiple versions • Nested and Multilevel xacts: – A long-duration xact may be viewed as a set of sub xacts – If a sub xact of T is permitted to release locks on completion, T is called a “multilevel xact” – If locks held by a sub xact of T are automatically assigned to T on completion of the sub xact, it is called “nested xact” • For large data items: – Difficult to store both old and new values; therefore, we can use the concept of logical logging – Shadow-copy technique can be used to keep copies of pages that have been modified
Xact Mgmt in Multidb See Practice Exercise 25.5 and 25.8 Strong correctness Two-Level Serializability (2LSR): Ensure serializability at 2 levels:
Real-Time Xact Systems:
• Each local db system ensures local serializability among its local xacts, including those that are part of a global xact
Systems with deadlines are called “real-time systems”. • Hard deadline: Serious problems, such as system crash, may occur if a task is not completed by its deadline. • Firm deadline: The task has zero value if not completed by its deadline. • Soft deadline: The task has diminishing value if it completed after the deadline. • Pre-emption of lock or rolling back a xact may be required • Variance in xact execution time (disk access v/s in memory, locking, xact aborts, etc.) can cause difficulty in supporting real-time constraints.
• The mutidb system ensures serializability among the global xacts alone - ignoring the orderings induced by the locak xacts.
Long-Duration Xacts: • Properties: – Long duration (human interaction) – Exposure of uncommited data – Subtasks: User may want to abort a subtask only without rolling back the entire xact – Recoverability: Aborting a long-duration interactive xact because of a system crash is unacceptable – Performance: Fast response time is expeted in contrast to throughput (number of xacts per second) • Nonserializable Executions: Enforcement of serializability can cause problems for long-duration xacts – 2 PL: Longer waiting times (since data items locked are not released until no other data items are needed to be locked). This, in turn, leads to longer response time and increased chance of deadlock. – Graph-based protocols: An xact may have to lock more data than it needs. Long-duration lock waits are likely to occur. – Timestamp-based protocols: No waiting for locks, but xact could get aborted. Cost of aborting a long -duration xact may be prohibitive – Validation protocols: Same as that for timestamp-based protocols
• Global-read protocol: allows global xacts to read, but not to update local data item, while disallowing all access to global data by local xacts – Local xacts access only local data items – Global xacts may access global data items, and may read local data items (though they must not write local data items) – There are no consistency constraints between local and global data items • Local-read protocol: allows local xacts to read global data, but disallows all access to local data by global xacts – Local xacts may access local data items, and may read global data items stored at that site (though they must not write global data items) – Global xacts access only global data items – No xact may have a value dependency (A xact has value dependency if the value that it writes to a data item at one site depends on a value that it read for a data item on another site). • Global-read-write / local-read protocol: most generous; allows global xacts to read and write local data, and allows local xacts to read global data. – Local xacts may access local data items, and may read global data items stored at that site (though they must not write global data items) – Global xacts may read and write global as well as local data items – There are no consistency constraints between local and global data items – No xact may have a value dependency (A xact has value dependency if the value that it writes to a data item at one site depends on a value that it read for a data item on another site). Ticket-based systems can also be used.
Data Warehousing • A data warehouse is a repository of data gathered from multiple sources and stored under a common, unified database schema. • 2 types of db apps: Transaction Processing and Decision Support • Transaction Processing: Record info about transactions • Decision Support: Aim to get high-level of info from the detailed info stored in transaction-processing systems, and to use the highlevel info to make decisions • DSS aim to get high-level information from the detailed information stored in a transaction-processing system. • Issues related DSS: – OLAP deals with tools and techniques that can give nearly instantaneous answers to queries requesting summarized data, even though the database may be extremely large – Database query languages are not suited to the performance of detailed statistical analyses of data (SAS and S++ do much better) – For performance as well as for organization control, data sources will not permit other parts to retrieve data. DW gather data from multiple sources under a unified schema at a single site – Data Mining combined knowledge-discovery with efficient implementations that can be used on extremely large databases • Measure attributes: Those that can be measured (e.g., price, quantity sold) • Dimension attributes: Other attributes; these are the dimensions on which the measure attributes, and the summary of measure attributes, are viewed. • Multidimensional Data: Data that can be modeled as dimension attrs and measure attrs are called multi-dimensional data • Cross-tabulation (aka cross-tab or pivot table): A table where values for one attribute form the row headers, values for another attributes form the column headers, and where the cell values represent some aggregate. For example, a table that has the sum of quantity sold for item name as row headers and color as column headers for all sizes. • A change in the data may result in more columns being added to the cross-tab (e.g., when a new colored item is added to the data, it will appear as a new column in the above cross-tab) • Data Cube: Generalization of a cross-tab to “n” dimensions • For a table with n dimensions, aggregation can be performed with grouping on of the 2n subsets of the n dimensions. Grouping on the set of all n dimensions is useful only if the table may have duplicates. • Operations on a data cube: – Pivoting: The operation of changing the dimensions used in a cross-tab. – Slicing / Dicing: Viewing the data cube for a particular value of a dimension (aka called dicing particularly when the values for multiple dimensions are fixed) – Rollup: From finer to coarser granularity (e.g., rollup a table on the size attribute) Drill Down: From coarser to finer granularity • Hierarchy on dimensions: Date/Time (and Hour of day), Date, Month (and Day of Week), Quarter, Year • OLAP Implementation: – MOLAP: OLAP cube stored in “multi-dimensional arrays” – ROLAP: Relational OLAP (data stored in relational database) – HOLAP: Hybrid OLAP (some in memory and some in relational) • Simple Optimization: compute aggregates from an already computed aggregration, instead of from the original relation; this does not work for non-decomposable aggregate functions such as “median” • For n dimension attributes, there can be 2n groupings • SQL-1999 constructs: – rank, dense rank, stddev, variance group by cube, group by rollup, percent rank, ntile, cume dist – Windowing: rows unbounded preceding, between rows 1 preceding and 1 following range between 10 preceding and current row
Example: Finding the cumulative balance in an account, given a relation specifying the deposits and withdrawals on an account
Data Warehousing: • When and what data to gather: Source-based / Destination-based (push / pull) • What schema to use • Data transformation and cleansing: merge-purge, deduplication, householding, other types such as units of measurement • How to propagate updates: Same as view-maintenance problem • What data to summarize • ETL (Extract, Transform, Load) • Warehouse Schemas: – Fact tables: tables containing multi-dimensional data – Dimensional tables: To minimize storage requirements (foreign-key looked up into other tables) – Star schema, Snowflake schema • Components of a DW: Data Loaders, DBMS, Query and Analysis Tools (+ data sources)
Data Mining: • Classifiers: – Decision-Tree Classifiers – Bayesian Classifiers (easier to construct than decision-tree classifiers and work better in the case of null or missing attibute values) • Other types of data mining: clustering, text mining, data visualization • TODO: Details about classifiers here
Advanced App Dev • Benchmarks are standardized sets of tasks that help to characterize the performance of db systems. They help to get a rough ideas of the hardware and software requirements of an app, even before the app is built. • Tunable parameters at 3 levels: – Hardware Level: CPU, Memory, Adding disks or using RAID – DB System params: Buffer sizes, checkpointing intervals – Higher level: Schema (indices), transactions These must be considered together; a tuning at one level may result in a bottleneck in another (e.g., tuning at a higher level may result in a bottleneck at the CPU level) • Tuning of hardware: – For today’s disk, average access time is 10 ms and avg. transfer rate of 25 MB/s – A reduction of 1 I/O per second saves: (price per disk drive) / (access per second per disk) – Storing a page in memory costs: (price per MB of memory) / (number of pages per MB of memory) – Break-even point is: n∗
price per disk drive price per MB of memory = access per second per disk pages per MB of memory
– 5-minute rule – For sequentially accessed, we get 1-minute rule – RAID 5 is much slower than RAID 1 on random writes: RAID 5 requires 2 reads and 2 writes to execute a single randown write – If an app performs r reads and w writes (random), then RAID 5 will require r + 4w I/O ops per second; RAID 1 will require r + w I/O ops per second – If we take the current disks performance as 100 I/Os / second, we can find the number of disks required (e.g., (r+w)/100). This value is enough to hold 2 copies of all the data. For such apps, if RAID 1 is used, the number of disks required is actually less than if RAID 5 is used.
• •
• •
•
– RADI 5 is useful only when the data storage requirements are large and the data xfer and I/O rates are small (that is, for very large and very “cold” data) Tuning of schema: Use denormalized relation or materialized views Tuning of indices: – Removing of indices may speed up updating – For range queries, B+-tree indices are preferable to hash indices – If most number of queries and updates are clustered, clustered indices could be used Materialized Views: Using deferred view maintenance reduces the burden on updates. Automated tuning of physical design: – Greedy heuristics: Estimates costs of using materialized different indices / views and the cost of maintaining it – Choose the one that prvoides max benefit per unit storage space – Once this has been chosen, recompute the cost of other indices / views – Continue the process until the space avaiable for storing the mat. indices / views is exhausted or the cost of maintaining the remaining candidates is more than the benefit to the queries that could use indices / views. Tuning of transactions: – Improve set of orientation (e.g., by using proper group by or by using stored procedures) – Reduce lock contention (maybe use weaker levels of consistency) – Minibatch transactions
Performance Benchmarks Use harmonic mean for different xact types: 1 t1
+
1 t2
n + ... +
1 tn
• TPC-A: Single type xact that models cash withdrawal and deposit at a bank teller (not used widely currently) • TPC-B: Same as TPC-A but focuses only on back-end db server) (not used widely currently) • TPC-C: More complex system model; order entry, etc. • TPC-D: For decision support (scale factor is used - scale factor of 1 represents the benchmark on a 1 GB db)
• TPC-R: Db is permitted to use mat. views and other redundant info • TPC-H: Ad-hoc (prohibits mat. views and other redundant info) • TPC-W: Web commerce (performance metrics are in WIPS - Web instructions per second) • App migration: Big-bang approach v/s Chicken-little approach
Spatial Data • Nearness Queries and Region queries (inside a region, etc.) • Hash joins and sort-merge joins cannot be used on spatial data
Indexing of Spatial Data • k-d Trees: – Partitioning is done along one-dimension at the node ath the top level of the tree, along another dimension in nodes at the next level (and so on, cycling through all the dimensions) – One-half point in one partition and one-half in the other – Partioning stops when a node has less than given max. number of points – Each line in the diag. corresponds to a node in the k-d tree – k-d-B tree extends the k-d tree to allow multiple child nodes for each internal node (just like B-tree extends a binary tree) to reduce the height of the tree. k-d-B trees are better suited for secondary storage than k-d trees. • Quadtrees: – Each node of a quadtree is associated with a rectangualr region of space. – Each non-leaf node divides its region into 4 equal-sized quadrants • R-trees: – A rectangular bounding box is associated with each tree node – Ranges may overlap (as compared to the B+-trees, k-d trees and quadtrees. – A search for objects containing a given point has to follow all child nodes whose bounding boxes contain the point – The storage efficiency of R-trees is better than that of k-d trees or quadtrees, since a object is stored only once. However, searching is not efficient in R-trees since multiple paths may have to be searched. Inspite of this, R-trees are popular (because of space efficiency and similarity to B-trees) Read insertion, deletion and searching from book
View more...
Comments