reduce: gather and accumulate values from different messages
Define computation at the granularity of vertices and their neighborhoods and exploit the sparse dependency structure pre-defined by the graph
general-purpose distributed dataflow frameworks define computation at granularity of individual items (e.g., filter, map) or across entire collections (e.g., non-broadcast join which requires a shuffle)
Enables composition of graphs with unstructured and tabular data
Same physical data could be viewed as a graph or collections
Focus on bulk-synchronous model and rely on system level techniques (e.g., pipelining and speculation) to address stragglers
In asynchronous case: too complex. Completed nodes may be woke by others due to asynchronous message
Property Graph Data Model
Single property graph with a pre-declared, sparse structure
can be logically represented as a pair of vertex and edge property collections
Custom data representation: Edges, Vertices, Routing Table
Partitioning
vertex-cut (in contrast to edge-cut) partitioning as horizontally partition
Fault Tolerance
logical partitioning
lineage
General Optimizations
GAS Decomposition: split vertex programs into: Gather, Apply, Scatter
a pull-based model of message computation: the system asks the vertex program for value between adjacent vertices rather user sending
steps:
gather in-neighbour contributions
sum them, do apply (using shared-memory: shared KV store to apply deltas)
scatter
Graph Partitioning: vertex-cut partitioning
because there are more edges (O(Vertex)) than vertices (O(Edge)): sync vertices is easier
Mirror Vertices: if multiple neighbors are on the same remote machine: send a single message to that machine and then let it to be forwarded to all the neighbors
Active Vertices
GraphX Optimizations
index reuse
multicast join
join elimination
sequential and index scans of the vertex (depends on different queries)