BlinkDB

Approximation is hard

sampling and sketches
- low space and time complexity
- make strong assumptions about the query workload - on-line aggregation
- highly variable performance
- make fewer assumptions about the query workload
  - OLA may need to read the entire table to compute a result with satisfactory error bounds

BlinkDB

a distributed sampling-based approximate query processing system
allow users to pose SQL-based aggregation queries over stored data
make no assumptions about the attribute values in the WHERE, GROUP BY, and HAVING clauses
- only assume query column sets or QCSs are stable over time
- <genre, city> --determine--> unique values -> strata samples
- sample based on each column set, not on values
  - samples may be too large: needs to reuse the other samples’ common data (common fields)
two main modules:
- sample creation
- sample selection
contributions
- use a column-set based optimization framework to compute a set of stratified samples, while takes into account:
  - the frequency of rare subgroups
- the column sets in the past queries
- the storage overhead of each sample
- create error-latency profiles (ELPs) for each query at runtime to estimate its error or response time on each available sample
- could integrate into Hive
Supported Queries:
- support standard SQL aggregate queries: COUNT, AVG, SUM and QUANTILE
  - on the other hand, bootstrapping is a more general approach, which supports UDF, but also needs more iterations on picking samples (BlinkDB relies on statistics from SUM, AVG, COUNT and QUANTILE etc)
- does not support arbitrary joins and nested SQL queries
  - any nested queries or joins can be flattened to run
- in joining with large table, the smaller table (dimension tables) should be fit into main memory of any single node in the cluster
  - dimension tables are all tables except the largest
- 70% dimension tables < 100 GB, which could be easily cached

Key ideas

builds and maintains a set of multi-dimensional stratified samples from orginal data over time
- a dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy or response time requirements

Sample Creation

sample越大error越小
stratified sampling: ensures rare subgroups are sufficiently represented
- uniform sampling may miss some groups by chance
- best choice is to assign equal sample size to each groups
  - the assignment of sample sizes is deterministic, not random
Goal function: aims to maximize the weighted sum of the coverge of the QCSs of the queries: 覆盖越多QCSs越好
- G = sum(Probability * covererage_by_other_QCSs * number_of_groups_which_is_sparsity_for_query_QCS)

workload taxonomy

predictable queries: traditional DBs use such a model for lossless synopsis
- predictable query predicates
- predictable QCSs: assuming the frequency of the sets of columns used for grouping and filtering does not change over time
- though the exact values of QCSs are unpredictable
- this model can decide the columns on which building indices would optimize data access
- 这样的sample结果又小又通用
- QCSs would be relatively stable over time in Facebook - unpredictable queries: you have no choice but to uniformly sampling data

Algorithm

近似与找到一个最大average number K (refer paper for details)
- for each value: if sample_size > K: pick K random items as sample else: pick the whole corresponding items as value

Optimizations

it is not necessary that the samples in the family be selected independently
- sample A could reuse sample B’s items if they have overlap
samples are stored sequentially according to the order of columns
if there are multiple blocks for a sample, the rows are distributed randomly
- we only need to read a subset of the blocks to satisfy the required sample size

Runtime Sample Selection

run Q in parallel on in-memory subsets of all samples currently maintained by the system
- select those have high selectivity: high number of rows selected by Q - based on the result of running query on subsample data, and its response time and relative error with respect to sample sizes
- to construct ELP of the query for each sample - after picking sample, then pick sample size
- error depends on:
  - variance of the underlying data distribution
- actual number of tuples processed in the sample
  - in turn depends on the selectivity of a query’s predicates
- latency depends on:
  - sample size
- scaling rate depending on the exact query structure (JOINS, GROUP BYs etc)
- physical placement of its input
- underlying data distribution - do bias correction on popular subgroups (since the popular subgroups have been shrinked to only K items)

Implementaion Concerns

multiple queries might use the same sample, so the answers are correlated
- periodically replacing the set of samples used in background
workload might change over time
- keep track of statisticla properties of the underlying data (e.g., variance and percentiles)
- periodically runs the sample creation module to recompute these properties and decide whether the set of samples needs to be changed
uniform samples are generally created in a few hundred seconds, while stratified samples takes between a 5 ~ 30 minutes

希言自然

Leavened Loop

BlinkDB

Approximation is hard

BlinkDB

Key ideas

Sample Creation

workload taxonomy

Algorithm

Optimizations

Runtime Sample Selection

Implementaion Concerns