Topology: is a DAG where vertices represent computation and the edges represent the data flow
is a logical plan from a DB system perspective
more than one worker process on the same machine may be executing different part of the same topology
one process is one JVM (While Heron Instance is a JVM process, runs only a single task of spout or bolt)
worker processes serve as containers
each worker runs a supervisor that communicates with Nimbus
one JVM has multipe threads (executors)
executors provide intra-topology parallelism
one thread has multipe tasks
task provide intra-bolt/intra-spout parallelism
Nimbus is “JobTracker”: is the touchpoint between the user and the Storm system
Nimbus and Supervisor daemons are fail-fast and stateless, and all their state is kept in Zookeeper or on the local disk
Single point of failure in Storm, but not in Heron
Supervisor: every 3 seconds, it reads worker heartbeats from local state and classifies those workers as either valid, time out, not started, or disallowed
Storm: there is no resource isolation between different tasks:
multiple tasks may run in the same JVM
multiple tasks are written into a single file
one task’s exception may bring down one JVM and affect other topologies’ execution
Heron: each container runs the same task (for debug visibility)
Each tuple has to pass four threads (for queuing) in Storm
backpressure: Heron would automatic adjust network throughput to prevent from making no progress
Storm just drops tuples
one of task fails, entire tweet needs to be restarted
Storm may not make any progress while consuming all its resources
JVM GC generates high latencies larger than 1 minute
Components
Topology Master: managing the topology throughout its existence. It provides a single point of contact for discovering the status of topology. ~ YARN’s Application Master