Spark SQL

Why Spark SQL

Previous: Shark

Could not apply on intermediate results of Spark program
A single SQL string is hard to debug, hard for modularization
Hive optimizer is only for MapReduce, we need new data types and new data sources

ORM

ORM translate an entire object into a different format, which is expensive
Spark SQL access the native objects in-place, extracting only the used necessary fields

New features

Provide tighter integration between relational and procedural processing
A highly extensible optimizer, Catalyst
Support advanced analytics above declarative queries
Support UDF
- without complicated packaging and registration process

What is Spark SQL

Concise and declarative like SQL
- but the intermediate results could also be named, so that it has procedure paradigm
Pipeline as much operations as possible
Lazy evaluations on output operations
Catalyst use Scala
- it has complex domain specific language to specify existing and future rules
- it has pattern-matching
- the task is as complex as compilers

DataFrame

A collection of structured records == a table in DBMS
- is RDD + schema of rows
- is distributed collection of rows in tables
It would keep track of schema and various relational operations
- RDD does not
It could have richer optimization, since it could be manipulated using Spark procedural API
Store data in columnar format
- support compression

Crash recover of RDD

it would use lineage graph of the RDDs
- missing partition could be rebuilt through rerunning operations: such as filter

Catalyst Optimization

tree transformation rules, it is generated by previous predicates
- do pattern-match on the rules, apply rules in multiple passes until tree does not change