Hive

Hive

SQL on Hadoop or Spark
Hive - MR/Tez
SparkSQL - Spark
Impala

Tailoring to excution engine
- General query optimization + some other optimization

idea

Unnecessary Maps, RSOps, Pipelining
Vectorized expressions

query optimization procedure

SQL query (what is different from SQL) (original Hive paper just talks this)
- partitioned table
- map, reduce programs
- No insert, delete modify
- Distributed By vs Sort By - parser (AST) - semantic analysis (offline information stored in metastore)
- CBO optiq/calcite
- join ordering - logical optimizer
- predicate pushdown
- simplification of predicates - physical optimize
- take out unnecessary map
- pipeline map into same stage
- take out unnecessary Reduce Sink Operators - execution framework (Tez/MR) - Answer

Data format

ORC file
- faster some query performance
- storage efficiency
  - data type specified approaches to compress individual columns
    - need compress/uncompress when write/read column data
- column oriented
- strip data:
  - stored in columnar format
  - align with HDFS blocks
- Index: statiscitcs about table, stripe, index-group
  1. position pointer
    - min, max, avg, cardinality, number of distinct values, length of data TD
    - find the minium selectivity for join order
    - pruning unnecessary reduces
    - skipping unnecessary index groups
DataFrame
Parquet

CBO

Join-order selection
- Greedily
- Exhaustive search
Join algorithm
- sort-merge: big shuffle foot print
- one of the tables is small
- Broadcast Join: send replica to all nodes
  - hash(small) - map join
  - semi-join big table 1 and table 2:
    1. table 1 gets its unique keys
  - match all rows in table 2 with unique key
  - send these matching rows back to big table 1, do join again
- Table A and Table B join, could have two different partition strategies
- if not: use RSOp
Simplifications

DAG

different nodes do: join, map
Reduce Sink Operators
- re-partition data: aligh with downstream operation

Tree optimization

pipeline multiple small map-joins into one task
eliminate unnecessary RSOps, FileSops

Reference