Spark SQL

Zhaoyu Luo bio photo By Zhaoyu Luo

Why Spark SQL

Previous: Shark

  • Could not apply on intermediate results of Spark program
  • A single SQL string is hard to debug, hard for modularization
  • Hive optimizer is only for MapReduce, we need new data types and new data sources

ORM

  • ORM translate an entire object into a different format, which is expensive
  • Spark SQL access the native objects in-place, extracting only the used necessary fields

New features

  • Provide tighter integration between relational and procedural processing
  • A highly extensible optimizer, Catalyst
  • Support advanced analytics above declarative queries
  • Support UDF
    • without complicated packaging and registration process

What is Spark SQL

  • Concise and declarative like SQL
    • but the intermediate results could also be named, so that it has procedure paradigm
  • Pipeline as much operations as possible
  • Lazy evaluations on output operations
  • Catalyst use Scala
    • it has complex domain specific language to specify existing and future rules
    • it has pattern-matching
    • the task is as complex as compilers

DataFrame

  • A collection of structured records == a table in DBMS
    • is RDD + schema of rows
    • is distributed collection of rows in tables
  • It would keep track of schema and various relational operations
    • RDD does not
  • It could have richer optimization, since it could be manipulated using Spark procedural API
  • Store data in columnar format
    • support compression
Crash recover of RDD
  • it would use lineage graph of the RDDs
    • missing partition could be rebuilt through rerunning operations: such as filter

Catalyst Optimization

  • tree transformation rules, it is generated by previous predicates
    • do pattern-match on the rules, apply rules in multiple passes until tree does not change