Impala
- Business intelligent analytics
- An MPP database, strength on multi-user
- query plan generation
- single node plan
- do partial aggregation “combines” in plan tree
- resource allocation
- multiple queries
- admission control
- it sacrifice resource efficiency for performance
- shared-nothing
- I/O operations are always full
- Parquet
- loads working set in memory
- streaming data
- llama: low latency application master
- virtual function
- loop unrolling
- LLVM: code generation atop LLVM
- quasi quotes
- get AST
- get optimized code
Design
- Each node is able to accept and execute queries
- each node is ready, so no overhead of scheduling a map task
- read throughput could scale with number of disks
- avoid remote read, don’t need to go to name node, data node
- read directly and locally
- All nodes’ system catalogs are up to date
- Coordination and synchronization cluster-wide metadata
- Does not support
UPDATE or DELETE
, only supports bulk insertions
- avoid synchronous RPCs wherever possible on the critical path of any query
- pub-sub service: statestore
- push updates to all interested parties, e.g., metadata changes to all subscribers
- construct a bloom filter to implement a simple version of a semi-join
Datastore
- Parquet: row groups, columns stored sequentially on disk (pages, compression at pages)
- ORCFile: stripes, row groups, columnas storage
Statestore
- it is cluster metadata: load on machines, liveness of nodes, catalogs for underlying data
- catalogs: physical plan and well-form-ness
- topics: arrays of (key, value, version) triplets
- persistent throught the lifetime of the statestore
- not persisted across service restarts
- need registration
- send delta changes every 2s
- if ping time-out need re-register
- failed subscriber would be removed
Resource/Workload management
- YARN is centralized scheduling
- decision is made with full knowledge of cluster state
- latency is too high
- Impala needs to handle thousands of queries per second
- New complementary but independent admission control: allow users to control their workloads without centralized decision-making
- A service between Impala and YARN: resource caching, gang scheduling, incremental allocation changes