ML challenges
- needs enormous network bandwidth
- cost of synchronization and machine latency is high: since ML algorithms are sequential
- fault tolerance is critical: machines and jobs can be preempted
Parameter Server
- picking the right systems techniques, adapting them to the machine learning algorithms
- flexible consistency (seqeuntail, eventual, bounded delaj consistency)
- examples
- systems tricks/ optimizations
- ParameterServer atop Tez,MR,Spark
- Spark could also do the same thing (partition model, fault tolerance)
- why not use Spark? is its graph more convenient?
data
- k,v -> memcached
- algo-specific
- algo-agnostic
Developer Advantages
- enable application-specific code to remain concise
- provide robust, versatile, and high-performance ML implementation
Engineering Challenges
- Communication:
- key-value pair is inefficient, worker would send a segment of a vector, or an entire row of the matrix (a part of object)
- local filter + compression
- Fault tolerance: support dynamic scaling
- replicating models, ack worker push only after replications (chain replication)
Risk minimization
Reference