Why Spark SQL
Previous: Shark
- Could not apply on intermediate results of Spark program
- A single SQL string is hard to debug, hard for modularization
- Hive optimizer is only for MapReduce, we need new data types and new data sources
ORM
- ORM translate an entire object into a different format, which is expensive
- Spark SQL access the native objects in-place, extracting only the used necessary fields
New features
- Provide tighter integration between relational and procedural processing
- A highly extensible optimizer, Catalyst
- Support advanced analytics above declarative queries
- Support UDF
- without complicated packaging and registration process
What is Spark SQL
- Concise and declarative like SQL
- but the intermediate results could also be named, so that it has procedure paradigm
- Pipeline as much operations as possible
- Lazy evaluations on output operations
- Catalyst use Scala
- it has complex domain specific language to specify existing and future rules
- it has pattern-matching
- the task is as complex as compilers
DataFrame
- A collection of structured records == a table in DBMS
- is RDD + schema of rows
- is distributed collection of rows in tables
- It would keep track of schema and various relational operations
- It could have richer optimization, since it could be manipulated using Spark procedural API
- Store data in columnar format
Crash recover of RDD
- it would use lineage graph of the RDDs
- missing partition could be rebuilt through rerunning operations: such as filter
Catalyst Optimization
- tree transformation rules, it is generated by previous predicates
- do pattern-match on the rules, apply rules in multiple passes until tree does not change