Impala
  - Business intelligent analytics
- An MPP database, strength on multi-user
- query plan generation
    
      - single node plan
- do partial aggregation “combines” in plan tree
 
- resource allocation
    
      - multiple queries
- admission control
- it sacrifice resource efficiency for performance
 
- shared-nothing
- I/O operations are always full
- Parquet
- loads working set in memory
- streaming data
- llama: low latency application master
- virtual function
    
  
- loop unrolling
    
      - LLVM: code generation atop LLVM
- quasi quotes
        
          - get AST
 
- get optimized code
 
Design
  - Each node is able to accept and execute queries
    
      - each node is ready, so no overhead of scheduling a map task
- read throughput could scale with number of disks
 
- avoid remote read, don’t need to go to name node, data node
    
      - read directly and locally
 
- All nodes’ system catalogs are up to date
- Coordination and synchronization cluster-wide metadata
- Does not support UPDATE or DELETE, only supports bulk insertions
- avoid synchronous RPCs wherever possible on the critical path of any query
- pub-sub service: statestore
    
      - push updates to all interested parties, e.g., metadata changes to all subscribers
 
- construct a bloom filter to implement a simple version of a semi-join
Datastore
  - Parquet: row groups, columns stored sequentially on disk (pages, compression at pages)
- ORCFile: stripes, row groups, columnas storage
Statestore
  - it is cluster metadata: load on machines, liveness of nodes, catalogs for underlying data
    
      - catalogs: physical plan and well-form-ness
 
- topics: arrays of (key, value, version) triplets
    
      - persistent throught the lifetime of the statestore
        
          - not persisted across service restarts
 
 
- need registration
    
      - send delta changes every 2s
- if ping time-out need re-register
- failed subscriber would be removed
 
Resource/Workload management
  - YARN is centralized scheduling
    
      - decision is made with full knowledge of cluster state
- latency is too high
 
- Impala needs to handle thousands of queries per second
    
      - New complementary but independent admission control: allow users to control their workloads without centralized decision-making
- A service between Impala and YARN: resource caching, gang scheduling, incremental allocation changes