GraphX

Specialized graph processing systems: Google: Pregel (Page Rank, LDA, Graph traversal): Giraph; Graphlab
- Graph processing systems often abandon fault tolerance in favor of snapshot recovery to pursuit performance
- On-disk processing (e.g., MapReduce) is not suit for iterative graph algorithms
Early distributed dataflow frameworks
- Directly implementing iterative graph algorithms using dataflow operators can be challenging
- Did not expose fine-grained control over the data partitioning

Essential dataflow patterns: join-map-group-by dataflow operators
- join: 3 way join
- map: send message from neighbour vertex to one
- reduce: gather and accumulate values from different messages
Define computation at the granularity of vertices and their neighborhoods and exploit the sparse dependency structure pre-defined by the graph
- general-purpose distributed dataflow frameworks define computation at granularity of individual items (e.g., filter, map) or across entire collections (e.g., non-broadcast join which requires a shuffle)
Enables composition of graphs with unstructured and tabular data
Same physical data could be viewed as a graph or collections
Focus on bulk-synchronous model and rely on system level techniques (e.g., pipelining and speculation) to address stragglers
- In asynchronous case: too complex. Completed nodes may be woke by others due to asynchronous message

Single property graph with a pre-declared, sparse structure
- Dataflow system’s operators (e.g., join) span multiple collections
- can be logically represented as a pair of vertex and edge property collections
- Custom data representation: Edges, Vertices, Routing Table

GAS Decomposition: split vertex programs into: Gather, Apply, Scatter
- a pull-based model of message computation: the system asks the vertex program for value between adjacent vertices rather user sending
- steps:
  1. gather in-neighbour contributions
- sum them, do apply (using shared-memory: shared KV store to apply deltas)
- scatter
Graph Partitioning: vertex-cut partitioning
- because there are more edges (O(Vertex)) than vertices (O(Edge)): sync vertices is easier
Mirror Vertices: if multiple neighbors are on the same remote machine: send a single message to that machine and then let it to be forwarded to all the neighbors
Active Vertices

希言自然