Storm, Heron

Topology: is a DAG where vertices represent computation and the edges represent the data flow
- is a logical plan from a DB system perspective
more than one worker process on the same machine may be executing different part of the same topology
one process is one JVM (While Heron Instance is a JVM process, runs only a single task of spout or bolt)
- worker processes serve as containers
- each worker runs a supervisor that communicates with Nimbus
one JVM has multipe threads (executors)
- executors provide intra-topology parallelism
one thread has multipe tasks
- task provide intra-bolt/intra-spout parallelism
Nimbus is “JobTracker”: is the touchpoint between the user and the Storm system
- Nimbus and Supervisor daemons are fail-fast and stateless, and all their state is kept in Zookeeper or on the local disk
- Single point of failure in Storm, but not in Heron
Supervisor: every 3 seconds, it reads worker heartbeats from local state and classifies those workers as either valid, time out, not started, or disallowed

Topology: interconnection between spout and bolt
- spout pull data from queues, such as Kafka. spout is producer
- bolt is consumer
each spout and bolt could map to multiple tasks
each stream is an indefinite sequence of tuples (tweets)
provide guarantees:
- at least once: naive implementation is to keep track of all linage for each tuple: but leads to high memory usage
  1. one tweet is considered one tuple, which is generated by a spout
- there would be multiple steps after spout, which generate multiple tuples with corresponding new IDs
- a ACK bolt is assigned to track the whole process of this tweet and calculate the IDs’ XOR
- if XOR != 0 or packet loss or time out, replay the tweet
- at most once: just disable the ACK mechanism above

on failure, replay tweet by tweet from start, but spark streaming recover from intermediate RDD
nodes failure: Spark could do some recovery

Need debug-ability and clean mapping: from the logical units of computation to each physical process
Scheduling uncertainty: JVM scheduler, thread scheduler
Storm: there is no resource isolation between different tasks:
- multiple tasks may run in the same JVM
- multiple tasks are written into a single file
- one task’s exception may bring down one JVM and affect other topologies’ execution
- Heron: each container runs the same task (for debug visibility)
Each tuple has to pass four threads (for queuing) in Storm
backpressure: Heron would automatic adjust network throughput to prevent from making no progress
- Storm just drops tuples
one of task fails, entire tweet needs to be restarted
- Storm may not make any progress while consuming all its resources
JVM GC generates high latencies larger than 1 minute

Topology Master: managing the topology throughout its existence. It provides a single point of contact for discovering the status of topology. ~ YARN’s Application Master
Stream Manager: manage the routing of tuples

希言自然