SQL on Hadoop or Spark
Hive - MR/Tez
SparkSQL - Spark
- Tailoring to excution engine
- General query optimization + some other optimization
- Unnecessary Maps, RSOps, Pipelining
- Vectorized expressions
query optimization procedure
- SQL query (what is different from SQL) (original Hive paper just talks this)
- partitioned table
- map, reduce programs
- No insert, delete modify
- Distributed By vs Sort By - parser (AST) - semantic analysis (offline information stored in metastore)
- CBO optiq/calcite
- join ordering - logical optimizer
- predicate pushdown
- simplification of predicates - physical optimize
- take out unnecessary map
- pipeline map into same stage
- take out unnecessary Reduce Sink Operators - execution framework (Tez/MR) - Answer
Data format
- ORC file
- faster some query performance
- storage efficiency
- data type specified approaches to compress individual columns
- need compress/uncompress when write/read column data
- data type specified approaches to compress individual columns
- column oriented
- strip data:
- stored in columnar format
- align with HDFS blocks
- Index: statiscitcs about table, stripe, index-group
- position pointer
- min, max, avg, cardinality, number of distinct values, length of data TD
- find the minium selectivity for join order
- pruning unnecessary reduces
- skipping unnecessary index groups
- position pointer
- DataFrame
- Parquet
- Join-order selection
- Greedily
- Exhaustive search
- Join algorithm
- sort-merge: big shuffle foot print
- one of the tables is small
- Broadcast Join: send replica to all nodes
- hash(small) - map join
- semi-join big table 1 and table 2:
- table 1 gets its unique keys
- match all rows in table 2 with unique key
- send these matching rows back to big table 1, do join again
- Table A and Table B join, could have two different partition strategies
- if not: use RSOp
- Simplifications
- different nodes do: join, map
- Reduce Sink Operators
- re-partition data: aligh with downstream operation
Tree optimization
- pipeline multiple small map-joins into one task
- eliminate unnecessary RSOps, FileSops