DB topics - Recovery

Zhaoyu Luo bio photo By Zhaoyu Luo

ARIES

ARIES is complex because it combines no force and steal

Buffer manager

  1. force or no force? (when XACT commits, do you write its dirty pages to disk?)
    • force: write dirty pages to disk (simple, slow)
      • it avoids the need to REDO on restart
        • steal or no steal? (can you evict a dirty page from a running XACT?)
    • no steal: you can not deprive dirty page from a running XACT (simple, slow)
    • steal: write uncommitted page into disk
      • it needs UNDO to correct the data in the disk

Basic idea1: Write Ahead Log(WAL)

no force on flushing the page rather than force flushing the log records

  • every data update generate a log record with enough information to Redo or Undo the update
  • before stealing a page, make sure all log records for updates to the page are on disk
  • before a XACT commits, flush all its log records to disk
  • after XACT commits, write commit log

3 phases of recovery

  • commit != write to disk && uncommitted != not write to disk
  • REDO is fixing that some XACTs have committed, but they do not exist on the disk
  • UNDO is fixing that some XACTs have not committed, but they have been written to the disk

——start of oldest “failed” XACT——–“firstLSN”———-most recent CKP——–end of log

start of oldest “failed” XACT, “firstLSN”(DPT), most recent CKP could be arbitrary order

  1. analysis
    • determine starting point for Redo
    • identify set of “might have been dirty at crash” pages
    • identify “loser” XACTs (which did not make it at commit point)
    • scans log forward from most recent CheckPoint
      • initialize XACT table & DPT to values in CheckPoint (Recover from CPT state)
      • process log records
        • adding/deleting XACTs as they come and go (remove XACT from TT if XACT commited)
        • add entries to the DPT for additional pages being updated (update DPT)
          • we add all these pages because ARIES does not log whether a page is dirty, so the DPT is now a super set
        • “firstLSN” = earliest recovery_LSN in DPT, is place to begin REDO pass
          • recovery_LSN could be before CKP (arbitrary order)
        • redo: bring the crashed system to normal (what it wanted to do)
    • scans log forward from firstLSN
    • for each log record, redo operation
      1. apply logged update to the page it applies to
        • set pageLSN of the page to that of the log record
          • undo: undo the failed transaction
    • scan backwards from end of log, undoing updates of uncommited transactions
    • To apply an undo
      • apply undo from log record
      • write a CLR (compensation log record) with:
        • undo information (what was done)
        • undoNextLSN = LSN of next older log record that must be undone for XACT
      • When CLRs are encountered during REDO, redo the undo log
      • When CLRs are encountered during UNDO, ignore, follow undoNextLSN to find next log record to undo
        • so the CLR is bounded
      • “nested top action”: something that should not be undone(dangerous), even if XACT fails
        • skip them

key data structure

  1. XACT table
    • entries for each active XACT, keeps LogSequenceNumber(LSN) of last log record generated by XACT (last LSN)
    • LSN tells you whether it is within the log - Dirty Page Table (DPT)
    • entries for all dirty buffer-resident pages; entries contain a recovery_LSN field
      • recovery_LSN: the LSN of log record that first dirtied the page

Checkpoint records

  • do not have to go beyond certain point (rather than go back from the very start)
  • it is written periodically
  • save current XACT table, DPT

Think Cases: start_before_CKP/start_after_CKP & finish_before_crash/not_yet_finished

Buffer management

Query Locality Set Model (QLSM)

  • sequential patterns
    • straight sequential (file scan)
      • need one page
      • replace with next one
    • clustered sequential
      • example: like inner relation in merge join (seq + backup)
      • number of pages in largest cluster
      • LRU (kick out the oldest)
Love/Hate hints

Love: I would expect to come back; Hate: I would rather not see them again

2 LRU chains, 1 love chain, 1 hate chain

LRU-k (Least Recent Used)

LRU-2: pick the next most recent pages among Pi; LRU-k: pick the kth most recent pages among Pi

Join algorithms

Nested loop join

for each r in R:
    for each s in S:
        if r.A == s.B:
            output(r, s)
# O(Time) = len(R) + number_of_R_rows_per_page * len(R) * len(S)

An optimization is page-oriented nested loops join by reading a block(a thousand of pages) of R and checking S: we could get rid of the number_of_R_rows_per_page

for each block of R:
    - build a hash table in memory on R.A for all the rows of R in the block
    - for each s in current S page probe hash table for matches
# O(Time) = len(R) + len(R) * len(S)

Notice hash table only works for equal join, not good at random comparison

Index nested loops

if you have an index on S.B then:

for each r in R:
    lookup r.A in index on S.B
# O(Time) = len(R) + number_of_R_rows_per_page * len(R) * cost_of_index_lookup
# cost_of_index_lookup could just be 1.2 for hash index, 2-4 for B+ tree

Sort-Merge

  1. sort R on R.A
    • sort S on S.B
    • merge

External Sorting

  1. form runs the size of the buffer pool (n)
    • merge runs n-1 at a time to form runs of length n * (n-1)

How big a file can you sort in 2-passes? n * (n-1)

if len(R) < n * n, len(S) < n * n:
    sort R: 4 * len(R) 这里是指I/O次数而不是内存时间。因为len(R) > n,所以一次不能排完,只能用外部排序,每轮需要read/write1次,共计4次
    sort S: 4 * len(S)
    merge: len(R) + len(S)

GRACE Hash Join

This is a parrallel idea.

  • partition R into R1, R2, … , Rk using hashing on R.a
  • partition S into S1, S2, … , Sk using hashing on S.b

      for i = 1 to k:
          Ri join Si
    
  • sub-join:
    1. read Ri into memory
      • build hash table on Ri tuples
      • a different hash function for partitioning & in-memory join
        • partitioning hash: with “big” range h1(x) = (h(x) mod k) + 1, k ensure each Ri fits in memory (pick k big enough), k is N.O. of partition
        • in-memory hash: h2(x) = (h(x) mod kk) + 1, kk ensure 1 row in 1 bucket, kk is N.O. of in-memory bucket - stream Si, probing hash table on R

How big can R be to do this in 2 passes? Suppose you have M memory pages:

  1. k <= M - 1 (fit into page, and one for input for R)
    • len(R) / k <= M - 2 (one page for input for Si, one page for output of join)
    • we can get len(R) <= (M-1)(M-2)

Simple Hash Join

# k is bad if k is big, because you have to write some row back to disk again
for i = 1 to k:
    read R, hashing each row r
        if r hashes to bucket i:
            leave in memory
          else:
            write to disk
    read s
        if s hashes to bucket i:
            probe in-memory R partition
        else:
            write to disk

Hybrid Hash

  1. scan smaller relation R, partitioning into k buckets but leave bucket 1 in memory
    • scan larger relation S, partitioning into k buckets, except for bucket 1 just probe R1 in memory
    • (GRACE Hash Join) for i = 2 to k: Ri join Si
Cost Model of Hybrid Hash
  • {R}, {S}: # tuples in R and S
  • R , S : # of pages in R and S
  • hash: time to hash a tuple
  • move: time to move a tuple to buffer
  • comp: cost of comparing 2 tuples
  • I/O: cost of reading or writing a page
  • F: > 1 (usually 1.1); a universal fudge factor
  • q: fraction of R in first partition
    • if q is 1: all R fits into memory, there is only one partition
    • if there are 1000 records, 100 memory, the biggest q is 0.1, because we need to ensure one partion could fit into memory

Total cost:

({R} + {S}) * hash +            // partition cost, determine which partition a tuple falls in
({R} + {S}) * (1 - q) * move +  // move tuples to output buffers
({R} + {S}) * (1 - q) * I/O  +  // write from output buffers to disk
({R} + {S}) * (1 - q) * I/O  +  // read back into memory
({R} + {S}) * (1 - q) * hash +  // build & probe hash tables for partition 2-n
{R} * move                   +  // move tuples to hash table for R
{S} * comp * F                  // probe each tuple in S; more than one comparison

Symmetric Hash Join

Do not need study much as others

  1. assume everything fits in memory, build 2 hash table
    • read r, check in hash table s
    • read s, check in hash table r

How to ensure it is write? “(Ri, Si)”, it is symmetric, no matter Ri or Si comes first, it would finally find this pair

Comparison

GraceJoin is faster:

GraceJoin                           SortMerge
read(seq) R, write(rand) R          read(seq) R, write(seq) R (initial runs) # random write, because there are different output files(hashed), seq write, because you read from one file, sort, then write it back
read(seq) S, write(rand) S          read(seq) S, write(seq) S (initial runs)
read(seq) R, read(seq) S            read R, write R (finish sort) # However, you can skip these two steps
                                    read S, write S (finish sort) # because you need not get a fully sorted R S, you could join the smallest R and S during heap sort
                                    read(rand) R, read(rand) S  (merge-join)

Parallel DB

Goals for parallelism

  • linear speedup, system with 2X nodes is 2X as fast
  • linear scale up, system 2X as big solves problem 2X as big in same time

Barries to these

  • startup time could not be parallelised
  • interference: “slow down each new process adds to other processors”
  • skew (there are straggler or unbalance work load)

Types of systems

  • shared-memory: does not scale up so far
  • shared disk: scale better
  • shared-nothing: Teradata(Gamma), DB2 PE, Oracle Exadata, MS PDW

DATAllegro

It designs a Glue Layer to connect multiple Ingres instances

Type of parallel joins

  1. redistribute 2 relations
    • redistribute 1 relation
    • redistribute none
broadcast join

broadcast smaller relation, then do local joins: do not need do full join everywhere (but bad if both R and S are equal big)

skew problem
  • Suppose S is highly skewed on S.b: if we do a repartitioning join, some nodes will be heavily loaded => skew in execution time
  • If we do broadcast R join => no skew

How to solve: Split R info (assume R is not skewed):

  • RfrequentIn S: broadcast
  • RnofrequentIn S: partition join
  • SfrequentIn S: leave in place
  • SnofrequentIn S: repartition

Partitioning strategies

  • round-robin
    • no skew
    • no useful properties (such as scheme)
  • hash on some attribute
    • might be skew (some happens more frequently than others)
    • useful for query processing
  • range partitioning
    • still might be skew

How about R use round-robin, S use hash partition? R needs repartition in the end

  • If R is hash partitioned on R.a, and S is hash partitioned on S.b => no redistribute join (redistribute zero)
  • If neither is partitioned on join attributes => redistribute both

Example

how to select A, Avg(R.D) FROM R GROUP BY R.A in parallel?

  • first approach
    1. repartition R.A
      • compute local group-by
  • second approach
    1. group by locally (pre-process)
      • it could not count directly if count median instead of average - repartition
      • add up all the locally average result to get Avg(R.D) - finish group-by

Summary

  • The difference between locking records, latching data structures and pinning pages in DBMS
    • latching is like semaphore, physical lock
    • locking provides multi granularity, logical lock
    • TODO
  • In the DBMin algorithm, consider the case when query Q2 finds a page p in some other query Q1’s locality set. When Q2 accesses the page, DB Min does not update the statistics for this page in Q1’s locality set. Is it good?

ARIES

  • What is “redo”, and why needs it?
    • redo to what actually happened, not just successful transactions. This allows for CLR during recovery and helps with enabling logical redo/undo
  • How about do no logging during UNDO?
    • There maybe subsequent crash, system would not know how many UNDOs have been performed
  • How about writing normal log instead of CLR?
    • If repeated crashes happend, the log space will become unbound
  • How ARIES uses bounded space for recovery even in repeated crashes?
    • ARIES writes only and at most one log record per undo (due to the use of CLRs) during UNDO phase, so the space is bounded
  • Two conditions that there is no need to redo an update to a data page without even examining the page in question
    1. the page is not in DPT
      • the recovery_LSN in DPT of this page > LogRecLSN (if logRecLSN <= pageLSN at affected page)
      • this is for efficiency, not neccessary
  • Why not writes any log during REDO phase?
    • because previous log has recorded what have been done, log for REDO phase would just be duplicated
  • How transactions could be ignored during recovery without causing a problem?
    • Some transactions start before crash but write no log records into disk
  • Does the pageLSN of a page modified during the REDO phase need to be updated to account for this modification? Why or why not?
    • Yes. If not, there is no record of whether or not the REDO has been performed, which can lead to errors if there are subsequent crashes.

2PC

  • What if coordinator fails before make commit/abort decision? Does it need to force log a “begin” log?
    • In presume abort, we do not need force “begin” log, coordinator just abort in default
  • Why it is OK to not force end record of coordinator?
    • if end record is flushed before coordinator crash, everything is fine
    • if end record is not flushed before crashing, coordinator would recovery and find out it has instructed commit/abort so it would resend “commit/abort” message to ensure all subordinate know
  • Any impact on turning on force end record?
    • During normal operation, it would slow down, since it would have extra flushes into disk
    • During recovery operation, it would speed up, since it would skip resend the commit/abort message
  • Why 2PC is a blocking protocol, how can a participant get blocked by others?
    • If the participant vote YES and the coordinator crashes, the participant has to be blocked until it recovers
  • Why “collecting” record is required in presumed-commit 2PC but not in presumed-commit 2PC?
    • In presumed-commit, it recorded the begin. So if coordinator crashed after sending “prepare”, it would recover and find out which subordinate were participating and should abort. Otherwise, they would commit by default
    • In presumed-abort, if coordinator crashed and there is no collecting record, it would not know the transaction and answer abort which is right
  • In “read-only” optimization, if a subordinate is read-only, then why can’t the coordinator just ignore it during the commit protocol, because it really doesn’t matter whether the read-only subordinate commits or not
    • the coordinator may not know ahead of time which subordinate is read-only
    • the subordinate may not know if it is read-only, since they could have a unsatisfied conditional update so that there would be no update
  • State 3 different 2PC versions. Why coordinator could forget a transaction and when
    • forget means delete a transaction from in-memory data structure while still could give the right answer
    • Standard 2PC: after receiving all “ACK” from subordinates (all are forced)
    • Presumed-Abort:
      • commit case: after receiving all “ACK”, if not the subordinate would ask again, the coordinator may forget
      • abort case: after sending out “abort”
    • Presumed-Commit:
      • commit case: after coordinator forcing commit record
      • abort case: after receiving all “ACK”
  • In presumed-commit:
    • In abort case, why coordinator needs to block on receiving ACKs and subordinate needs to force abort?
      • if not force abort, subordinate would wake up and ask then get a “commit” (since coordinator would have already got the ack from it previously and already forget about it) (so coordinator should also block on receiving “forced” ACK)
    • In commit case, in contrast, why coordinator needs not block on ACKs and subordinate needs not to force?
      • because even if the coordinator forget about this transaction, the default behavior is “commit”

Join Algorithm

  • GRACE hash join algorithm can be accomplished with one partitioning pass and one join pass if len(R) < M * M. Can Hybrid hash join? If not, how?
  • Compare block nested loops and hybrid hash, when block nestsed loops would be faster?
    • the key point of hybrid hash is to read all into buckets, but leave only 1 bucket to match at each time
    • considering only 0.5R would fit into memory:
      • block nested loops: read 0.5R, read S, join; read 0.5R, read S, join; total R + 2S
      • hybrid hash: read R, write 0.5R, read S, write 0.5S, join; read 0.5R, read 0.5S, join; total 2R + 2S
  • bitmap indices vs B-tree indices
  • Since in GRACE, len(R) <= M * M, how about in Hybrid?
    • In GRACE, assume we divide R into k buckets, so len(R)/k <= M and k <= M
    • In Hybrid, aside from the first bucket, we need another k buffer. So, len(R) / k + k <= M, so len(R) < M * M / 4
  • Describe symmetric hash join and argue why it is corret
    • scanning both R and S concurrently, when process R, insert into R hash table then probe S; the same as S
    • assume (r, s), if r comes first, it is in R, and it would be generated when s comes; the same as s comes first

Parallel DB

  • Reason for using shared-nothing over shared-disk architecture
    • shared-disk needs a high speed network compared with directly attached disk
    • shared-disk needs to ensure data consistency (multiple access and cached)
  • Give an example of when a repartition operation is required, and describe the three conceptual stages of the repartitioning.
    • When joining R and S, but neither R nor S is partitioned on the join attribute. Then we need to repartition both R and S before the join.
    • The three phases:
      1. split (each node splits its portion of the table into fragments)
        • shuffle (redistribute the fragments)
        • merge (combine the shuffled fragments at their destinations.)
  • Define linear scaleup, raise one factor could impede it
    • a task could be finished in T seconds in single machine, when task grows n times larger it takes the same time in n machines
    • accessing a shared resource, since this would consume non-linear time
  • Draw a picture of the operators and their connections for a partitioned parallelism, dataflow evaluation of the query “σ(R) Join σ(S)” on two processors.
    • The idea here is to show the select (or scan) operators for R and S, and the join operators, running on both processors. Also, we need the split operators and merge operators to redistribute the rows to these operators.