do deadlock detection and preemption to take care of that
user level isolation: running tasks for single user in the containers of an application
each task runs in its own container process, the resource allocations are much finer grained
run spark on tez: use RDD’s lineage dependency to generate post-compilation Spark DAG, then generates Tez DAG
Component
DAG: what could not be converted to DAG? loop, recursive query
Task: represents a unit of work in a vertex
Vertex: core application logic: “processor”
each vertex can be associated with a VertexManager
can map to many “tasks”
parallelism <- static / dynamically controlled by a vertex manager
VertexManager:
it is provided a context object that notifies it about state changes like task completions
then it could make changes to its own vertex’s state
control vertex parallelism
Edges
connectivity pattern
one-one
broadcast
scatter-gather
transport mechanism: data format and physical transport mechanism (disk or memory)
Tez is data format agnostic: IPO can choose their own data format
scheduling
sequential (downstream node’s run may be blocked by previous)
concurrent
Data source
persisted
ephemeral
API
DAG API
create a rich work flow
data sources may be invoked at runtime to determine the optimal reading pattern for the initial input (since reading is always slow)
e.g., read data after knowing the join key space
Runtime API
Processor -> program runs in each task corresponding to vertex
initialize(context)
run(list, list
handle_events() -> taskfinish
interm.ouput(read/write)
Output
Init(context)
writes()
handle_events()
Input
Input
Init()
Reads()
handle_events()
Execution Efficiency
Locality Aware Scheduling (等一会): the framework automatically relaxes locality from node to rack and so on with delay scheduling used to add a wait period before each relaxation
Speculation: clone slow task
Container Reuse
Session: run AM in session mode, so that it allow tasks from multiple DAGs to reuse containers
Shared Object Registry
Fault Tolerance
Task re-execution based fault tolerance depends on deterministic and side-effect free task execution
it could not handle network streaming failure: input is lost