RDD

Zhaoyu Luo bio photo By Zhaoyu Luo

RDD

  • read-only, partitioned collection of records
  • a restricted form of shared memory
  • coarse-grained transformations (e.g., map, filter and join) rather than fine-grained updates to shared state
  • reuse intermediate results during iterative in memory not in external stable storage
  • immutable data: could run bakcup copies of slow tasks to ease stragglers
  • can schedule tasks based on data locality
  • RDD degrade gracefully when memory is not enough, while DSM may get block
  • RDD is best suited for batch applications that apply the same operation to all elements of a dataset
  • general-purpose programming language
What RDD can’t in shared state
  • asynchronous
  • fine-grained updates
RDD fault tolerance efficiency
  • lineage logs the transformations used to build it
  • recompute from missing partition
RDD lazy
  • only compute when the first time they are used in an action, so that it can pipeline transformations
  • gain more times to do optimizations before executing

RDD representing

  1. a set of partitions, which are atomic pieces of the dataset
    • a set of dependencies on parent RDDs
    • a function for computing the dataset based on its parents
    • metadata about its partitioning scheme
    • data placement
RDD dependencies
  • narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD
  • wide dependencies: multiple child partitions may depend on it
    • checkpointing is useful for long lineage graphs containing wide dependencies