GraphX

Zhaoyu Luo bio photo By Zhaoyu Luo

Why GraphX?

  • Specialized graph processing systems: Google: Pregel (Page Rank, LDA, Graph traversal): Giraph; Graphlab
    • Graph processing systems often abandon fault tolerance in favor of snapshot recovery to pursuit performance
    • On-disk processing (e.g., MapReduce) is not suit for iterative graph algorithms
  • Early distributed dataflow frameworks
    • Directly implementing iterative graph algorithms using dataflow operators can be challenging
    • Did not expose fine-grained control over the data partitioning

GraphX

  • Essential dataflow patterns: join-map-group-by dataflow operators
    • join: 3 way join
    • map: send message from neighbour vertex to one
    • reduce: gather and accumulate values from different messages
  • Define computation at the granularity of vertices and their neighborhoods and exploit the sparse dependency structure pre-defined by the graph
    • general-purpose distributed dataflow frameworks define computation at granularity of individual items (e.g., filter, map) or across entire collections (e.g., non-broadcast join which requires a shuffle)
  • Enables composition of graphs with unstructured and tabular data
  • Same physical data could be viewed as a graph or collections
  • Focus on bulk-synchronous model and rely on system level techniques (e.g., pipelining and speculation) to address stragglers
    • In asynchronous case: too complex. Completed nodes may be woke by others due to asynchronous message

Property Graph Data Model

  • Single property graph with a pre-declared, sparse structure
    • Dataflow system’s operators (e.g., join) span multiple collections
    • can be logically represented as a pair of vertex and edge property collections
    • Custom data representation: Edges, Vertices, Routing Table

Partitioning

  • vertex-cut (in contrast to edge-cut) partitioning as horizontally partition

Fault Tolerance

  • logical partitioning
  • lineage

General Optimizations

  • GAS Decomposition: split vertex programs into: Gather, Apply, Scatter
    • a pull-based model of message computation: the system asks the vertex program for value between adjacent vertices rather user sending
    • steps:
      1. gather in-neighbour contributions
    • sum them, do apply (using shared-memory: shared KV store to apply deltas)
    • scatter
  • Graph Partitioning: vertex-cut partitioning
    • because there are more edges (O(Vertex)) than vertices (O(Edge)): sync vertices is easier
  • Mirror Vertices: if multiple neighbors are on the same remote machine: send a single message to that machine and then let it to be forwarded to all the neighbors
  • Active Vertices
GraphX Optimizations
  • index reuse
  • multicast join
  • join elimination
  • sequential and index scans of the vertex (depends on different queries)

References