samples may be too large: needs to reuse the other samples’ common data (common fields)
two main modules:
sample creation
sample selection
contributions
use a column-set based optimization framework to compute a set of stratified samples, while takes into account:
the frequency of rare subgroups
the column sets in the past queries
the storage overhead of each sample
create error-latency profiles (ELPs) for each query at runtime to estimate its error or response time on each available sample
could integrate into Hive
Supported Queries:
support standard SQL aggregate queries: COUNT, AVG, SUM and QUANTILE
on the other hand, bootstrapping is a more general approach, which supports UDF, but also needs more iterations on picking samples (BlinkDB relies on statistics from SUM, AVG, COUNT and QUANTILE etc)
does not support arbitrary joins and nested SQL queries
any nested queries or joins can be flattened to run
in joining with large table, the smaller table (dimension tables) should be fit into main memory of any single node in the cluster
dimension tables are all tables except the largest
70% dimension tables < 100 GB, which could be easily cached
Key ideas
builds and maintains a set of multi-dimensional stratified samples from orginal data over time
a dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy or response time requirements
Sample Creation
sample越大error越小
stratified sampling: ensures rare subgroups are sufficiently represented
uniform sampling may miss some groups by chance
best choice is to assign equal sample size to each groups
the assignment of sample sizes is deterministic, not random
Goal function: aims to maximize the weighted sum of the coverge of the QCSs of the queries: 覆盖越多QCSs越好
G = sum(Probability * covererage_by_other_QCSs * number_of_groups_which_is_sparsity_for_query_QCS)
workload taxonomy
predictable queries: traditional DBs use such a model for lossless synopsis
predictable query predicates
predictable QCSs: assuming the frequency of the sets of columns used for grouping and filtering does not change over time
though the exact values of QCSs are unpredictable
this model can decide the columns on which building indices would optimize data access
这样的sample结果又小又通用
QCSs would be relatively stable over time in Facebook
- unpredictable queries: you have no choice but to uniformly sampling data
Algorithm
近似与找到一个最大average number K (refer paper for details)
for each value:
if sample_size > K:
pick K random items as sample
else:
pick the whole corresponding items as value
Optimizations
it is not necessary that the samples in the family be selected independently
sample A could reuse sample B’s items if they have overlap
samples are stored sequentially according to the order of columns
if there are multiple blocks for a sample, the rows are distributed randomly
we only need to read a subset of the blocks to satisfy the required sample size
Runtime Sample Selection
run Q in parallel on in-memory subsets of all samples currently maintained by the system
select those have high selectivity: high number of rows selected by Q
- based on the result of running query on subsample data, and its response time and relative error with respect to sample sizes
to construct ELP of the query for each sample
- after picking sample, then pick sample size
error depends on:
variance of the underlying data distribution
actual number of tuples processed in the sample
in turn depends on the selectivity of a query’s predicates
latency depends on:
sample size
scaling rate depending on the exact query structure (JOINS, GROUP BYs etc)
physical placement of its input
underlying data distribution
- do bias correction on popular subgroups (since the popular subgroups have been shrinked to only K items)
Implementaion Concerns
multiple queries might use the same sample, so the answers are correlated
periodically replacing the set of samples used in background
workload might change over time
keep track of statisticla properties of the underlying data (e.g., variance and percentiles)
periodically runs the sample creation module to recompute these properties and decide whether the set of samples needs to be changed
uniform samples are generally created in a few hundred seconds, while stratified samples takes between a 5 ~ 30 minutes