ARIES
ARIES is complex because it combines no force and steal
Buffer manager
- force or no force? (when XACT commits, do you write its dirty pages to disk?)
- force: write dirty pages to disk (simple, slow)
- it avoids the need to REDO on restart
- steal or no steal? (can you evict a dirty page from a running XACT?)
- it avoids the need to REDO on restart
- no steal: you can not deprive dirty page from a running XACT (simple, slow)
- steal: write uncommitted page into disk
- it needs UNDO to correct the data in the disk
- force: write dirty pages to disk (simple, slow)
Basic idea1: Write Ahead Log(WAL)
no force on flushing the page rather than force flushing the log records
- every data update generate a log record with enough information to Redo or Undo the update
- before stealing a page, make sure all log records for updates to the page are on disk
- before a XACT commits, flush all its log records to disk
- after XACT commits, write commit log
3 phases of recovery
- commit != write to disk && uncommitted != not write to disk
- REDO is fixing that some XACTs have committed, but they do not exist on the disk
- UNDO is fixing that some XACTs have not committed, but they have been written to the disk
——start of oldest “failed” XACT——–“firstLSN”———-most recent CKP——–end of log
start of oldest “failed” XACT, “firstLSN”(DPT), most recent CKP could be arbitrary order
- analysis
- determine starting point for Redo
- identify set of “might have been dirty at crash” pages
- identify “loser” XACTs (which did not make it at commit point)
- scans log forward from most recent CheckPoint
- initialize XACT table & DPT to values in CheckPoint (Recover from CPT state)
- process log records
- adding/deleting XACTs as they come and go (remove XACT from TT if XACT commited)
- add entries to the DPT for additional pages being updated (update DPT)
- we add all these pages because ARIES does not log whether a page is dirty, so the DPT is now a super set
- “firstLSN” = earliest recovery_LSN in DPT, is place to begin REDO pass
- recovery_LSN could be before CKP (arbitrary order)
- redo: bring the crashed system to normal (what it wanted to do)
- scans log forward from firstLSN
- for each log record, redo operation
- apply logged update to the page it applies to
- set pageLSN of the page to that of the log record
- undo: undo the failed transaction
- set pageLSN of the page to that of the log record
- apply logged update to the page it applies to
- scan backwards from end of log, undoing updates of uncommited transactions
- To apply an undo
- apply undo from log record
- write a CLR (compensation log record) with:
- undo information (what was done)
- undoNextLSN = LSN of next older log record that must be undone for XACT
- When CLRs are encountered during REDO, redo the undo log
- When CLRs are encountered during UNDO, ignore, follow undoNextLSN to find next log record to undo
- so the CLR is bounded
- “nested top action”: something that should not be undone(dangerous), even if XACT fails
- skip them
key data structure
- XACT table
- entries for each active XACT, keeps LogSequenceNumber(LSN) of last log record generated by XACT (last LSN)
- LSN tells you whether it is within the log - Dirty Page Table (DPT)
- entries for all dirty buffer-resident pages; entries contain a recovery_LSN field
- recovery_LSN: the LSN of log record that first dirtied the page
Checkpoint records
- do not have to go beyond certain point (rather than go back from the very start)
- it is written periodically
- save current XACT table, DPT
Think Cases: start_before_CKP/start_after_CKP & finish_before_crash/not_yet_finished
Buffer management
Query Locality Set Model (QLSM)
- sequential patterns
- straight sequential (file scan)
- need one page
- replace with next one
- clustered sequential
- example: like inner relation in merge join (seq + backup)
- number of pages in largest cluster
- LRU (kick out the oldest)
- straight sequential (file scan)
Love/Hate hints
Love: I would expect to come back; Hate: I would rather not see them again
2 LRU chains, 1 love chain, 1 hate chain
LRU-k (Least Recent Used)
LRU-2: pick the next most recent pages among Pi; LRU-k: pick the kth most recent pages among Pi
Join algorithms
Nested loop join
for each r in R:
for each s in S:
if r.A == s.B:
output(r, s)
# O(Time) = len(R) + number_of_R_rows_per_page * len(R) * len(S)
An optimization is page-oriented nested loops join by reading a block(a thousand of pages) of R and checking S: we could get rid of the number_of_R_rows_per_page
for each block of R:
- build a hash table in memory on R.A for all the rows of R in the block
- for each s in current S page probe hash table for matches
# O(Time) = len(R) + len(R) * len(S)
Notice hash table only works for equal join, not good at random comparison
Index nested loops
if you have an index on S.B then:
for each r in R:
lookup r.A in index on S.B
# O(Time) = len(R) + number_of_R_rows_per_page * len(R) * cost_of_index_lookup
# cost_of_index_lookup could just be 1.2 for hash index, 2-4 for B+ tree
Sort-Merge
- sort R on R.A
- sort S on S.B
- merge
External Sorting
- form runs the size of the buffer pool (n)
- merge runs n-1 at a time to form runs of length n * (n-1)
How big a file can you sort in 2-passes? n * (n-1)
if len(R) < n * n, len(S) < n * n:
sort R: 4 * len(R) 这里是指I/O次数而不是内存时间。因为len(R) > n,所以一次不能排完,只能用外部排序,每轮需要read/write1次,共计4次
sort S: 4 * len(S)
merge: len(R) + len(S)
GRACE Hash Join
This is a parrallel idea.
- partition R into R1, R2, … , Rk using hashing on R.a
-
partition S into S1, S2, … , Sk using hashing on S.b
for i = 1 to k: Ri join Si
- sub-join:
- read Ri into memory
- build hash table on Ri tuples
- a different hash function for partitioning & in-memory join
- partitioning hash: with “big” range
h1(x) = (h(x) mod k) + 1
, k ensure each Ri fits in memory (pick k big enough), k is N.O. of partition - in-memory hash:
h2(x) = (h(x) mod kk) + 1
, kk ensure 1 row in 1 bucket, kk is N.O. of in-memory bucket - stream Si, probing hash table on R
- partitioning hash: with “big” range
- read Ri into memory
How big can R be to do this in 2 passes? Suppose you have M memory pages:
- k <= M - 1 (fit into page, and one for input for R)
- len(R) / k <= M - 2 (one page for input for Si, one page for output of join)
- we can get len(R) <= (M-1)(M-2)
Simple Hash Join
# k is bad if k is big, because you have to write some row back to disk again
for i = 1 to k:
read R, hashing each row r
if r hashes to bucket i:
leave in memory
else:
write to disk
read s
if s hashes to bucket i:
probe in-memory R partition
else:
write to disk
Hybrid Hash
- scan smaller relation R, partitioning into k buckets but leave bucket 1 in memory
- scan larger relation S, partitioning into k buckets, except for bucket 1 just probe R1 in memory
- (GRACE Hash Join)
for i = 2 to k: Ri join Si
Cost Model of Hybrid Hash
- {R}, {S}: # tuples in R and S
-
R , S : # of pages in R and S - hash: time to hash a tuple
- move: time to move a tuple to buffer
- comp: cost of comparing 2 tuples
- I/O: cost of reading or writing a page
- F: > 1 (usually 1.1); a universal fudge factor
- q: fraction of R in first partition
- if q is 1: all R fits into memory, there is only one partition
- if there are 1000 records, 100 memory, the biggest q is 0.1, because we need to ensure one partion could fit into memory
Total cost:
({R} + {S}) * hash + // partition cost, determine which partition a tuple falls in
({R} + {S}) * (1 - q) * move + // move tuples to output buffers
({R} + {S}) * (1 - q) * I/O + // write from output buffers to disk
({R} + {S}) * (1 - q) * I/O + // read back into memory
({R} + {S}) * (1 - q) * hash + // build & probe hash tables for partition 2-n
{R} * move + // move tuples to hash table for R
{S} * comp * F // probe each tuple in S; more than one comparison
Symmetric Hash Join
Do not need study much as others
- assume everything fits in memory, build 2 hash table
- read r, check in hash table s
- read s, check in hash table r
How to ensure it is write? “(Ri, Si)”, it is symmetric, no matter Ri or Si comes first, it would finally find this pair
Comparison
GraceJoin is faster:
GraceJoin SortMerge
read(seq) R, write(rand) R read(seq) R, write(seq) R (initial runs) # random write, because there are different output files(hashed), seq write, because you read from one file, sort, then write it back
read(seq) S, write(rand) S read(seq) S, write(seq) S (initial runs)
read(seq) R, read(seq) S read R, write R (finish sort) # However, you can skip these two steps
read S, write S (finish sort) # because you need not get a fully sorted R S, you could join the smallest R and S during heap sort
read(rand) R, read(rand) S (merge-join)
Parallel DB
Goals for parallelism
- linear speedup, system with 2X nodes is 2X as fast
- linear scale up, system 2X as big solves problem 2X as big in same time
Barries to these
- startup time could not be parallelised
- interference: “slow down each new process adds to other processors”
- skew (there are straggler or unbalance work load)
Types of systems
- shared-memory: does not scale up so far
- shared disk: scale better
- shared-nothing: Teradata(Gamma), DB2 PE, Oracle Exadata, MS PDW
DATAllegro
It designs a Glue Layer to connect multiple Ingres instances
Type of parallel joins
- redistribute 2 relations
- redistribute 1 relation
- redistribute none
broadcast join
broadcast smaller relation, then do local joins: do not need do full join everywhere (but bad if both R and S are equal big)
skew problem
- Suppose S is highly skewed on S.b: if we do a repartitioning join, some nodes will be heavily loaded => skew in execution time
- If we do broadcast R join => no skew
How to solve: Split R info (assume R is not skewed):
- RfrequentIn S: broadcast
- RnofrequentIn S: partition join
- SfrequentIn S: leave in place
- SnofrequentIn S: repartition
Partitioning strategies
- round-robin
- no skew
- no useful properties (such as scheme)
- hash on some attribute
- might be skew (some happens more frequently than others)
- useful for query processing
- range partitioning
- still might be skew
How about R use round-robin, S use hash partition? R needs repartition in the end
- If R is hash partitioned on R.a, and S is hash partitioned on S.b => no redistribute join (redistribute zero)
- If neither is partitioned on join attributes => redistribute both
Example
how to select A, Avg(R.D) FROM R GROUP BY R.A
in parallel?
- first approach
- repartition R.A
- compute local group-by
- repartition R.A
- second approach
- group by locally (pre-process)
- it could not count directly if count median instead of average - repartition
- add up all the locally average result to get Avg(R.D) - finish group-by
- group by locally (pre-process)
Summary
- The difference between locking records, latching data structures and pinning pages in DBMS
- latching is like semaphore, physical lock
- locking provides multi granularity, logical lock
- TODO
- In the DBMin algorithm, consider the case when query Q2 finds a page p in some other query Q1’s locality set. When Q2 accesses the page, DB Min does not update the statistics for this page in Q1’s locality set. Is it good?
ARIES
- What is “redo”, and why needs it?
- redo to what actually happened, not just successful transactions. This allows for CLR during recovery and helps with enabling logical redo/undo
- How about do no logging during UNDO?
- There maybe subsequent crash, system would not know how many UNDOs have been performed
- How about writing normal log instead of CLR?
- If repeated crashes happend, the log space will become unbound
- How ARIES uses bounded space for recovery even in repeated crashes?
- ARIES writes only and at most one log record per undo (due to the use of CLRs) during UNDO phase, so the space is bounded
- Two conditions that there is no need to redo an update to a data page without even examining the page in question
- the page is not in DPT
- the recovery_LSN in DPT of this page > LogRecLSN (if logRecLSN <= pageLSN at affected page)
- this is for efficiency, not neccessary
- the page is not in DPT
- Why not writes any log during REDO phase?
- because previous log has recorded what have been done, log for REDO phase would just be duplicated
- How transactions could be ignored during recovery without causing a problem?
- Some transactions start before crash but write no log records into disk
- Does the pageLSN of a page modified during the REDO phase need to be updated to account for this modification? Why or why not?
- Yes. If not, there is no record of whether or not the REDO has been performed, which can lead to errors if there are subsequent crashes.
2PC
- What if coordinator fails before make commit/abort decision? Does it need to force log a “begin” log?
- In presume abort, we do not need force “begin” log, coordinator just abort in default
- Why it is OK to not force end record of coordinator?
- if end record is flushed before coordinator crash, everything is fine
- if end record is not flushed before crashing, coordinator would recovery and find out it has instructed commit/abort so it would resend “commit/abort” message to ensure all subordinate know
- Any impact on turning on force end record?
- During normal operation, it would slow down, since it would have extra flushes into disk
- During recovery operation, it would speed up, since it would skip resend the commit/abort message
- Why 2PC is a blocking protocol, how can a participant get blocked by others?
- If the participant vote YES and the coordinator crashes, the participant has to be blocked until it recovers
- Why “collecting” record is required in presumed-commit 2PC but not in presumed-commit 2PC?
- In presumed-commit, it recorded the begin. So if coordinator crashed after sending “prepare”, it would recover and find out which subordinate were participating and should abort. Otherwise, they would commit by default
- In presumed-abort, if coordinator crashed and there is no collecting record, it would not know the transaction and answer abort which is right
- In “read-only” optimization, if a subordinate is read-only, then why can’t the coordinator just ignore it during the commit protocol, because it really doesn’t matter whether the read-only subordinate commits or not
- the coordinator may not know ahead of time which subordinate is read-only
- the subordinate may not know if it is read-only, since they could have a unsatisfied conditional update so that there would be no update
- State 3 different 2PC versions. Why coordinator could forget a transaction and when
- forget means delete a transaction from in-memory data structure while still could give the right answer
- Standard 2PC: after receiving all “ACK” from subordinates (all are forced)
- Presumed-Abort:
- commit case: after receiving all “ACK”, if not the subordinate would ask again, the coordinator may forget
- abort case: after sending out “abort”
- Presumed-Commit:
- commit case: after coordinator forcing commit record
- abort case: after receiving all “ACK”
- In presumed-commit:
- In abort case, why coordinator needs to block on receiving ACKs and subordinate needs to force abort?
- if not force abort, subordinate would wake up and ask then get a “commit” (since coordinator would have already got the ack from it previously and already forget about it) (so coordinator should also block on receiving “forced” ACK)
- In commit case, in contrast, why coordinator needs not block on ACKs and subordinate needs not to force?
- because even if the coordinator forget about this transaction, the default behavior is “commit”
- In abort case, why coordinator needs to block on receiving ACKs and subordinate needs to force abort?
Join Algorithm
- GRACE hash join algorithm can be accomplished with one partitioning pass and one join pass if len(R) < M * M. Can Hybrid hash join? If not, how?
- Compare block nested loops and hybrid hash, when block nestsed loops would be faster?
- the key point of hybrid hash is to read all into buckets, but leave only 1 bucket to match at each time
- considering only 0.5R would fit into memory:
- block nested loops: read 0.5R, read S, join; read 0.5R, read S, join; total R + 2S
- hybrid hash: read R, write 0.5R, read S, write 0.5S, join; read 0.5R, read 0.5S, join; total 2R + 2S
- bitmap indices vs B-tree indices
- Since in GRACE, len(R) <= M * M, how about in Hybrid?
- In GRACE, assume we divide R into k buckets, so len(R)/k <= M and k <= M
- In Hybrid, aside from the first bucket, we need another k buffer. So, len(R) / k + k <= M, so len(R) < M * M / 4
- Describe symmetric hash join and argue why it is corret
- scanning both R and S concurrently, when process R, insert into R hash table then probe S; the same as S
- assume (r, s), if r comes first, it is in R, and it would be generated when s comes; the same as s comes first
Parallel DB
- Reason for using shared-nothing over shared-disk architecture
- shared-disk needs a high speed network compared with directly attached disk
- shared-disk needs to ensure data consistency (multiple access and cached)
- Give an example of when a repartition operation is required, and describe the three conceptual stages of the repartitioning.
- When joining R and S, but neither R nor S is partitioned on the join attribute. Then we need to repartition both R and S before the join.
- The three phases:
- split (each node splits its portion of the table into fragments)
- shuffle (redistribute the fragments)
- merge (combine the shuffled fragments at their destinations.)
- split (each node splits its portion of the table into fragments)
- Define linear scaleup, raise one factor could impede it
- a task could be finished in T seconds in single machine, when task grows n times larger it takes the same time in n machines
- accessing a shared resource, since this would consume non-linear time
- Draw a picture of the operators and their connections for a partitioned parallelism, dataflow evaluation of the query “σ(R) Join σ(S)” on two processors.
- The idea here is to show the select (or scan) operators for R and S, and the join operators, running on both processors. Also, we need the split operators and merge operators to redistribute the rows to these operators.