Spark Streaming

Data Parallel
- custom-built for stream analytics (storm/heson/millwheel)
- batch systems tuned into steam analytics (SparkStream/MRonline)
Parallel DB
- Aurora / Borealis, single node streaming DB

MapReduce: mappers informs JobTracker, JobTracker call Reducer, Reducer pulls the result from Mapper after Mapper finish
Stream <w, 1> individually, pipe it into downstream Reducer
- there is no fault tolerance here
- too much packet over network
How about: Mapper spill files as chunks of memory are filled, mapper pushes to appropriate reduce
- map fails:
  - persist spill locally to HDFS
- prepare 2x mappers to perform tasks
  - it is wasteful at most time
  - would not help with straggler mapper
- reduce fails:
  - restart reducer, and read data from mapper again

divide continuous job into multiple small interval jobs
- the state at each timestep fully deterministic given the input data: no need for synchronization protocol
- dependencies between two states are smaller
short, stateless, deterministic tasks
reduceByKey: stateless operator
reduceByWindow: stateful operator
exactly once processing (storm consistency model: at most once or at least once)

When a node fails, each node in the cluster works to recompute part of the lost node’s RDDs: no cost of replication
- handle stragglers: can recover from stragglers using speculative execution
using lineage for recovery: a graph of deterministic operations

希言自然