Hive

Zhaoyu Luo bio photo By Zhaoyu Luo

Hive

SQL on Hadoop or Spark
Hive - MR/Tez
SparkSQL - Spark
Impala
  1. Tailoring to excution engine
    • General query optimization + some other optimization

idea

  • Unnecessary Maps, RSOps, Pipelining
  • Vectorized expressions

query optimization procedure

  1. SQL query (what is different from SQL) (original Hive paper just talks this)
    • partitioned table
    • map, reduce programs
    • No insert, delete modify
    • Distributed By vs Sort By - parser (AST) - semantic analysis (offline information stored in metastore)
    • CBO optiq/calcite
    • join ordering - logical optimizer
    • predicate pushdown
    • simplification of predicates - physical optimize
    • take out unnecessary map
    • pipeline map into same stage
    • take out unnecessary Reduce Sink Operators - execution framework (Tez/MR) - Answer

Data format

  • ORC file
    • faster some query performance
    • storage efficiency
      • data type specified approaches to compress individual columns
        • need compress/uncompress when write/read column data
    • column oriented
    • strip data:
      • stored in columnar format
      • align with HDFS blocks
    • Index: statiscitcs about table, stripe, index-group
      1. position pointer
        • min, max, avg, cardinality, number of distinct values, length of data TD
        • find the minium selectivity for join order
        • pruning unnecessary reduces
        • skipping unnecessary index groups
  • DataFrame
  • Parquet

CBO

  • Join-order selection
    • Greedily
    • Exhaustive search
  • Join algorithm
    • sort-merge: big shuffle foot print
    • one of the tables is small
    • Broadcast Join: send replica to all nodes
      • hash(small) - map join
      • semi-join big table 1 and table 2:
        1. table 1 gets its unique keys
      • match all rows in table 2 with unique key
      • send these matching rows back to big table 1, do join again
    • Table A and Table B join, could have two different partition strategies
    • if not: use RSOp
  • Simplifications

DAG

  • different nodes do: join, map
  • Reduce Sink Operators
    • re-partition data: aligh with downstream operation

Tree optimization

  • pipeline multiple small map-joins into one task
  • eliminate unnecessary RSOps, FileSops

Reference