RDD
- read-only, partitioned collection of records
- a restricted form of shared memory
- coarse-grained transformations (e.g., map, filter and join) rather than fine-grained updates to shared state
- reuse intermediate results during iterative in memory not in external stable storage
- immutable data: could run bakcup copies of slow tasks to ease stragglers
- can schedule tasks based on data locality
- RDD degrade gracefully when memory is not enough, while DSM may get block
- RDD is best suited for batch applications that apply the same operation to all elements of a dataset
- general-purpose programming language
What RDD can’t in shared state
- asynchronous
- fine-grained updates
RDD fault tolerance efficiency
- lineage logs the transformations used to build it
- recompute from missing partition
RDD lazy
- only compute when the first time they are used in an action, so that it can pipeline transformations
- gain more times to do optimizations before executing
RDD representing
- a set of partitions, which are atomic pieces of the dataset
- a set of dependencies on parent RDDs
- a function for computing the dataset based on its parents
- metadata about its partitioning scheme
- data placement
RDD dependencies
- narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD
- wide dependencies: multiple child partitions may depend on it
- checkpointing is useful for long lineage graphs containing wide dependencies