Transaction Processing - University of Kentucky

Transcript Transaction Processing - University of Kentucky

Jinze Liu

   Where does recovery start?

Naïve approach:  Stop accepting new transactions (lame!)   Finish all active transactions Take a database dump Fuzzy checkpointing  Determine S, the set of currently active transactions, and log  h begin-checkpoint S i Flush all blocks (dirty at the time of the checkpoint) at your leisure   Log h end-checkpoint begin-checkpoint_location i Between begin and end, continue processing old and new transactions

    Need to determine U, the set of active transactions at time of crash Scan log backward to find the last end-checkpoint record and follow the pointer to find the corresponding h start-checkpoint S i Initially, let U be S Scan forward log from that start-checkpoint to end of the     For a log record h For a log record h For a log record h T, start i , add T to U T, commit | abort i , remove T from U T, X, old, new Basically repeats history!

i , issue write(X, new)

 Scan log backward    Undo the effects of transactions in U That is, for each log record h T, X, old, new i where T is in U, issue write(X, old), and log this operation too (part of the repeating-history paradigm) Log h T, abort i when all effects of T have been undone  An optimization  Each log record stores a pointer to the previous log record for the same transaction; follow the pointer chain during undo

 Many different ways of processing the same query  Scan? Sort? Hash? Use an index?

 All have different performance characteristics and/or make different assumptions about data  Best choice depends on the situation   Implement all alternatives Let the query optimizer choose at run-time

 Relations:

 Tuples:

 Number of tuples: |R| , |S|  Number of disk blocks: B(R) , B(S)  Number of memory blocks available:

 Cost metric  Number of I/O’s  Memory requirement

 Scan table R and process the query  Selection over R  Projection of R without duplicate elimination

Disk INPUT OUTPUT Main memory buffers Disk

 I/O’s: B(R)  Trick for selection: stop early if it is a lookup by key  Memory requirement: 2 (+1 for double buffering)  Not counting the cost of writing the result out   Same for any algorithm!

Maybe not needed—results may be pipelined into another operator

 

R

S

For each block of R, and for each r in the block: For each block of S, and for each s in the block: Output rs if p evaluates to true over r and s  R is called the outer table; S is called the inner table

Disk INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk

  I/O’s:

B(R) + |R|



B(S)

Memory requirement:

3

(+1 for double buffering)  Improvement:

block-based nested-loop join

   For each block of R, and for each block of S: For each r in the R block, and for each s in the S block: … I/O’s: B(R) + B(R)  B(S) Memory requirement: same as before

 Stop early if the key of the inner table is being matched  Make use of available memory    Stuff memory with as much of R as possible, stream S by, and join every S tuple with all R tuples in memory I/O’s: B(R) +  d B(R) / (M – 2 ) Or, roughly: B(R)  B(S) / M e  B(S) Memory requirement: M (as much as possible)  Which table would you pick as the outer?

 A classic problem in computer science!

 Data requested in sorted order  e.g., find students in increasing gpa order  Sorting is first step in bulk loading B+ tree index.

 Sorting useful for eliminating duplicate copies in a collection of records (Why?)  Sorting is useful for summarizing related groups of tuples  Sort-merge join algorithm involves sorting .

 Problem: sort 100Gb of data with 1Gb of RAM.

 why not virtual memory?

 Pass 0: Read a page, sort it, write it.

 only one buffer page is used (as in previous slide)  Pass 1, 2, 3, …, etc.:    requires 3 buffer pages merge pairs of runs into runs twice as long three buffer pages used.

INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk Disk

   Each pass we read + write each page in file.

N pages in the file => the number of passes

3,4 6,2 9,4



3,4



2,6



4,9

  log 2

  1 

2,3



4,6 8,7



4,7



8,9 5,6 3,1



7,8



5,6



1,3



1,3



5,6

 So total cost is: 2

  log 2

  1 

Idea: Divide and conquer:

sort subfiles and merge 

2,3



4,4



6,7



8,9 2



1,2



3,5



1,2



2,3



3,4



4,5



6,6



7,8



Input file



PASS 0



1-page runs



PASS 1



2-page runs



PASS 2



4-page runs



PASS 3



8-page runs

Remember (internal-memory) merge sort?

Problem: sort R, but R does not fit in memory  Pass 0 : read M blocks of R at a time, sort write out a level-0 run  There are d B(R) / M e level-0 sorted runs them, and  Pass i : merge (M – 1) level-(i-1) runs at a time, and write out a level-i run   (M – 1) memory blocks for input, 1 to buffer output # of level-i runs = d # of level-(i–1) runs / (M – 1) e  Final pass produces 1 sorted run

    Input: 1, 7, 4, 5, 2, 8, 3, 6, 9 Pass 0    1, 7, 4 → 5, 2, 8 → 9, 6, 3 → 1, 4, 7 2, 5, 8 3, 6, 9 Pass 1  1, 4, 7 + 2, 5, 8 →  3, 6, 9 1, 2, 4, 5, 7, 8 Pass 2 (final)  1, 2, 4, 5, 7, 8 + 3, 6, 9 → 1, 2, 3, 4, 5, 6, 7, 8, 9

   Number of passes: d log

– 1 d B(R) / M e e + 1 I/O’s  Multiply by 2  B(R): each pass reads the entire relation once and writes it once   Subtract B(R) for the final pass Roughly, this is O( B(R)  log

B(R) ) Memory requirement:

(as much as possible)

 Double buffering  Allocate an additional block for each run  Trade-off: smaller fan-in (more passes)  Blocked I/O    Instead of reading/writing one disk block at time, read/write a bunch (“cluster”) More sequential I/O’s Trade-off: larger cluster !

smaller fan-in (more passes)

  

X R.A = S.B Sort

R and S by their join attributes, and then merge r, s = the first tuples in sorted R and S Repeat until one of R and S is exhausted: If r.A > s.B then s = next tuple in S else if r.A < s.B then r = next tuple in R else output all matching tuples, and r, s = next in R and S I/O’s: sorting + 2 B(R) + 2 B(S)   In most cases (e.g., join of key and foreign key) Worst case is B(R)  B(S): everything joins

1 .A = 1

2 .A = 3

3 .A = 3

4 .A = 5

5 .A = 7

6 .A = 7

7 .A = 8 S:

1 .B = 1

2 .B = 2

3 .B = 3

4 .B = 3

5 .B = 8

5 S:

 Idea: combine join with the merge phase of merge sort  Sort : produce sorted runs of size M for R and S  Merge and join : merge the runs of R, merge the runs of S, and merge-join the result streams as they are generated!

Disk Memory Merge Join

Merge

 I/O’s: 3  (B(R) + B(S))  Memory requirement   To be able to merge in one pass, we should have enough memory to accommodate one block from each run: M > B(R) / M + B(S) / M M > sqrt(B(R) + B(S))

   Union (set), difference, intersection  More or less like SMJ Duplication elimination  External merge sort  Eliminate duplicates in sort and merge GROUP BY and aggregation  External merge sort    Produce partial aggregate values in each run Combine partial aggregate values during merge Partial aggregate values don’t always work though  Examples: SUM(DISTINCT … ) , MEDIAN( … )

 

X R.A = S.B Main idea

  Partition R and S by hashing their join attributes, and then consider corresponding partitions of R and S If r.A and s.B get hashed to different partitions, they don’t join 1 2 3

4 5

3 4 2 1 5 Nested-loop join considers all slots Hash join considers only those along the diagonal

 Partition R and S according to the same hash function on their join attributes Memory Disk

R M

– 1 partitions of

Same for

 Read in each partition of R, stream in the corresponding partition of S, join  Typically build a hash table for the partition of R  Not the same hash function used for partition, of course!

partitions Disk load Memory

…

partitions stream For each S tuple, probe and join

 I/O’s: 3  (B(R) + B(S))  Memory requirement:    In the probing phase, we should have enough memory to fit one partition of R:  M – 1 B(R) / (M – 1) M > sqrt(B(R)) We can always pick R to be the smaller relation, so: M > sqrt(min(B(R), B(S))

 What if a partition is too large for memory?

 Read it back in and partition it again!

 See the duality in multi-pass merge sort here?

(Assuming two-pass)  I/O’s: same   Memory requirement: hash join is lower  sqrt(min(B(R), B(S)) < sqrt(B(R) + B(S))  Hash join wins when two relations have very different sizes Other factors   Hash join performance depends on the quality of the hash  Might not get evenly sized buckets SMJ can be adapted for inequality join predicates   SMJ wins if R and/or S are already sorted SMJ wins if the result needs to be in sorted order

 May be best if many tuples join  Example: non-equality joins that are not very selective  Necessary for black-box predicates  Example: … WHERE user_defined_pred(R.A, S.B)

 Union (set), difference, intersection  More or less like hash join   Duplicate elimination  Check for duplicates within each partition/bucket GROUP BY  and aggregation Apply the hash functions to GROUP BY attributes   Tuples in the same group must end up in the same partition/bucket Keep a running aggregate value for each group

 Divide-and-conquer paradigm  Sorting: physical division, logical combination  Hashing: logical division, physical combination  Handling very large inputs   Sorting: multi-level merge Hashing: recursive partitioning  I/O patterns   Sorting: sequential write, random read (merge) Hashing: random write, sequential read (partition)

  Equality predicate: ¾ A = v  (R) Use an ISAM, B + -tree, or hash index on R(A) Range predicate: ¾ A > v  (R) Use an ordered index (e.g., ISAM or B + -tree) on R(A)  Hash index is not applicable  Indexes other than those on R(A) may be useful   Example: B + -tree index on R(A, B) How about B + -tree index on R(B, A)?

Situations where index clearly wins:   Index-only queries which do not require retrieving actual tuples  Example: ¼

( ¾ A > v (R)) Primary index clustered according to search key  One lookup leads to all result tuples in their entirety

BUT(!):  Consider ¾ A > v index on R(A) (R) and a secondary, non-clustered      Need to follow pointers to get the actual result tuples Say that 20% of R satisfies A > v  Could happen even for equality predicates I/O’s for index-based selection: lookup + 20% |R| I/O’s for scan-based selection: B(R) Table scan wins if a block contains more than 5 tuples

    

R.A = S.B

Idea: use the value of R.A to probe the index on S(B) For each block of R, and for each r in the block: Use the index on S(B) to retrieve s with s.B = r.A Output rs I/O’s: B(R) + |R| · (index lookup)   Typically, the cost of an index lookup is 2-4 I/O’s Beats other join methods if |R| is not too big  Better pick R to be the smaller relation Memory requirement: 2

 

R

R.A = S.B

S

Idea: use the ordering provided by the indexes on

R(A) and S(B) to eliminate the sorting step of sort-

merge join  Trick: use the larger key to probe the other index  Possibly skipping many keys that don’t match 1 2 3 B + -tree on

(

) 4 7 9 18 1 7 9 11 B + -tree on

(

) 12 17 19

 Scan  Selection, duplicate-preserving projection, nested-loop join  Sort  External merge sort, sort-merge join, union (set), difference, intersection, duplicate elimination, GROUP BY and aggregation  Hash  Hash join, union (set), difference, intersection, duplicate elimination, GROUP BY and aggregation  Index  Selection, index nested-loop join, zig-zag join

Transaction Processing - University of Kentucky

Transcript Transaction Processing - University of Kentucky

R

S

B(R) + |R|

B(S)

3

block-based nested-loop join

R

S

R(A) and S(B) to eliminate the sorting step of sort-

Directory