Blog

Programming May 16, 2016
Programming
三观整合 March 28, 2016
抛砖引玉
spanner tao paxos raft December 11, 2015
Raft
- Raft user study
CFS Chord December 02, 2015
Old time
- people are forced to use P2P for large scale network
- P2P’s need is replaced by scalable internet services within datacenters
- P2P’s extremes
  
  scalability
  
  failure (frequent join and leave)
  
  security (anyone could access)
  
  wide area (geo-distributed)
  
  heterogeneity
BlinkDB November 23, 2015
Approximation is hard
1. sampling and sketches
  
  low space and time complexity
  
  make strong assumptions about the query workload - on-line aggregation
  
  highly variable performance
  
  make fewer assumptions about the query workload
  
  OLA may need to read the entire table to compute a result with satisfactory error bounds
Project Adam: Building and Efficient and Scalable Deep Learning Training System November 20, 2015
Why Deep Learning is hard?
- extremely hard to construct appropriate features
Parameter Server November 18, 2015
ML challenges
- needs enormous network bandwidth
- cost of synchronization and machine latency is high: since ML algorithms are sequential
- fault tolerance is critical: machines and jobs can be preempted
Auth November 17, 2015
What “secure” means?
- Confidentiality
- Authentication
- Integrity. Intruder can change message
- Accountability. No party can deny sending message
- Availability (DoS)
Eventual Consistency November 16, 2015
Strong Consistency
- One copy, up to date
- Why we need it? it is easier to implement certain applications
- Why not? scale, latency, availability
Pregel-like Graph Processing Systems Comparison November 09, 2015
Pregel-like
- Bulk Synchronous Parallel model
  - is a vertex state machine (think like a vertex)
  - but may encounter the straggler problem
- use vertex-centric approach, graph parallel
  - each vertex runs parallel: gather sum apply scatter
- address in-memory batch processing of large graphs
- terminate when all vertices are inactive and no more messages are in transit
- computation is performed on locally stored data
- Pregel only supports graphs that fit in memory
- master/workers model: the master partitions the input graph into partitions and assign it to a worker
- Global state?
  - global aggregators
  - phases of the algorithm
GraphX November 02, 2015
Why GraphX?
- Specialized graph processing systems: Google: Pregel (Page Rank, LDA, Graph traversal): Giraph; Graphlab
  - Graph processing systems often abandon fault tolerance in favor of snapshot recovery to pursuit performance
  - On-disk processing (e.g., MapReduce) is not suit for iterative graph algorithms
- Early distributed dataflow frameworks
  - Directly implementing iterative graph algorithms using dataflow operators can be challenging
  - Did not expose fine-grained control over the data partitioning
Storm, Heron October 31, 2015
Storm
- Topology: is a DAG where vertices represent computation and the edges represent the data flow
  
  is a logical plan from a DB system perspective
- more than one worker process on the same machine may be executing different part of the same topology
- one process is one JVM (While Heron Instance is a JVM process, runs only a single task of spout or bolt)
  
  worker processes serve as containers
  
  each worker runs a supervisor that communicates with Nimbus
- one JVM has multipe threads (executors)
  
  executors provide intra-topology parallelism
- one thread has multipe tasks
  
  task provide intra-bolt/intra-spout parallelism
- Nimbus is “JobTracker”: is the touchpoint between the user and the Storm system
  
  Nimbus and Supervisor daemons are fail-fast and stateless, and all their state is kept in Zookeeper or on the local disk
  
  Single point of failure in Storm, but not in Heron
- Supervisor: every 3 seconds, it reads worker heartbeats from local state and classifies those workers as either valid, time out, not started, or disallowed
Quorums October 30, 2015
Blocking

coordinator and server could block -> we don’t need all reply, majority reply is enough
Spark Streaming October 29, 2015
Current distributed stream processing model
- requiring hot replication
- long recovery times
- do not handle stragglers
- Continuous operator model: Storm, TimeStream, MapReduce Online and streaming DBs
  
  long-running, stateful operators receive each record, update internal state and send new records
  
  since it is stateful, it is hard to provide fault tolerance efficiently
  
  like Map and Reduce job, no start and no stop, always running
  
  how to do recovery?
  
  replication and also synchronization protocol
  
  to ensure replicas receive messages in the same order
  
  upstream backup: upstream nodes buffer sent messages and replay them to a new copy of a failed node
  
  this will take longer time to recover: the whole system needs to wait for a new node to serially rebuild the failed node’s state by rerunning
2PC October 28, 2015
Distributed Atomic Commit

Why hard?
Remus and HA NFS October 21, 2015
Failures
- What kind of failures in hardware/software can it deal with?
  
  Fail-stop failure and only one of primary or backup fails
  
  it could handle both of them crash temporarily, since the data will remain crash-consistent
- What kind can’t it deal with?
  
  Can not recover from nonfail-stop conditions
  
  Both primary and backup fails permanently
  
  Software bugs, which could propagate application errors to backup
- What happens when an unexpected failure arises?
  
  if primary fails
  
  the backup node will try to wait the new checkpoint transmission until time-out
  
  the alive node will upgrade itself as primary and resume execution from most recent checkpoint
  
  if backup fails
  
  the primary will try to wait for a time-out of the backup responding to commit requests
  
  the primary would have to replicate another backup
Tachyon October 19, 2015
Problem
- I/O is the bottleneck of a system
- Read is easy to deal with, but write operations could not scale-out due to multiple replication
remote write protocols, state machine replication October 16, 2015
ReplicatedService

2 nodes at first
SQL-on-Hadoop October 15, 2015
SQL-on-Hadoop
- Both Hive-MR and Hive-Tez are CPU-bound during scan operations, which is Java deserialization
  - A faster Java deserialization lib or JSON lib would help a lot
    Hive
- Tez eliminates the startup and materialization overheads of MapReduce
  - However, it does not avoid Java deserialization overheads during scan operations
- Tez pipelines data through execution stages instead of creating temporary intermediate files
Grapevine, Leases October 13, 2015
Grapevine: An Exercise in Distributed Computing
- Compared with hardware, software failure could be more arduous. Same version software could get crashed by the same bug which results in crash totally
  - nohup is a must to alleviate from having all the servers crash in a short period before anyone notices that
Impala October 12, 2015
Impala
- loop unrolling
  - LLVM: code generation atop LLVM
  - quasi quotes
    1. get AST
  - get optimized code
Spark SQL October 08, 2015
Why Spark SQL

Previous: Shark
- Could not apply on intermediate results of Spark program
- A single SQL string is hard to debug, hard for modularization
- Hive optimizer is only for MapReduce, we need new data types and new data sources

Hive October 07, 2015

Hive

SQL on Hadoop or Spark
Hive - MR/Tez
SparkSQL - Spark
Impala

NFS October 06, 2015
NFS

Why we need NFS?
Sparrow Omega October 02, 2015
Sparrow
- Yarn: centralized scheduling: limited throughput
- scheduling without coordination
  
  sparrow distributed scheduling
  
  omega uncoordinated, multi-schedules
Time, Clocks, and the Ordering of Events October 01, 2015
Time, Clocks, and the Ordering of Events in a Distributed System
- Process: stream of events
- Communicate: send message, receive
  
  Model: message send takes time, reliable
- Ordering:
  
  a, c: a is before c
SAMC, memory, flash, networks failure September 29, 2015
SAMC
- Use the knowledge of a system to craft the semantic could help to alleviate the state-space explosion in test cases
- Future test works would highly relate to the pattern extraction, since it is still manually so far. And of course, we do not need low-level test engineer that much
  - Future test engineers need to learn how to model the codes into a specific protocol, then generate tests from these protocols
    - This kind of test methodology seems to be more precise than fuzzy test and could give more understanding of why the codes fail
RDD September 28, 2015
RDD
- read-only, partitioned collection of records
- a restricted form of shared memory
- coarse-grained transformations (e.g., map, filter and join) rather than fine-grained updates to shared state
- reuse intermediate results during iterative in memory not in external stable storage
- immutable data: could run bakcup copies of slow tasks to ease stragglers
- can schedule tasks based on data locality
- RDD degrade gracefully when memory is not enough, while DSM may get block
- RDD is best suited for batch applications that apply the same operation to all elements of a dataset
- general-purpose programming language
Tez September 27, 2015
Why Tez?
1. allow users to model computation as a DAG
  
  runtime query optimization, such as pruning data partitions, based on online information
  
  YARN-compatible security, data-locality awareness, resource-reuse, fault-tolerance and speculation
  
  pluggable API
Disk failures, data corruptions September 23, 2015
An Analysis of Data Corruption in the Storage Stack
- Table 1: Corruption classes summary
  
  This summary overturns my illusion that “once I have checksum and verified, I would trust the data”. In fact, a block of data may pass the checksum verification but is still inconsistent since the firmware returns false negative or there is logic error due to misdirected writes
YARN September 18, 2015
Why YARN?
- Higher scalability: As the central arbiter of the compute cluster, the JobTracker was also responsible for (each of these limited its scalability):
  
  admission control,
  
  tracking the liveness of TaskTrackers (to re-execute running tasks or tasks whose output becomes unavailable),
  
  launching tasks speculatively to route around slow nodes,
  
  reporting job status to users through a web server,
  
  recording audit logs and aggregate statistics,
  
  authenticating users and many other functions;
- Decoupling from MapReduce programming model
Why it stops September 17, 2015
Why computers stop
- The key to providing high availability is to modularize the system so that modules are the unit of failure and replacement.
- The top priority for improving system availability is to reduce administrative mistakes by making self-configured systems with minimal maintenance and minimal operator interaction.
- Bohr vs Heisenbugs
MapReduce September 16, 2015
MapReduce
- M/R programming model
- Execution isssues, run time problems
- Scheduling and fairness across jobs
  
  greedy
  
  fair scheduler (slot-based)
  
  flow-based fair scheduler
DCTCP, incast September 15, 2015
Basics of Distributed Computing
- communication
- failure
- time
HDFS September 14, 2015
Timeline
1. How to build large-scale computer systems? 1993
  - 1999: look for a job (google nowsort)
  - 2003: GFS
  - 2004: MR (start to care about node failure) (faults at scale) - SMPs: shared memory machines ~ multicore (but it scales) - MPP: massively parallel processor - How about commodity hardware?
  - make a cluster
  - key technology: networks
DMV September 10, 2015
Numbers
- Fire hydrant: 15ft
  
  park: not within 10ft
- Red flashing school bus: stopped at 20ft
- railroad parking: 25ft
- crosswalk parking: 15ft
- 提前打信号灯: 100ft
- school zone when children are present: 15mph
- 商业居民区限速: 25mph
U-Net, DCTCP September 09, 2015
Why this paper
- How to improve performance and flexibility of netwroking layer performance?
  
  move parts of the protocol processing into user space
  
  remove kernel completely from the critical path and allow the communication layers used by each process to be tailored to its demands
  
  the entire protocol stack should be placed at user level
  
  OS and hardware should allow protected user-level access directly to the network
Google Datacenter Network September 08, 2015
Network Construction Principles
- Clos topologies
  - could scale to arbitrary size
  - in-built path diversity and redundancy
  - problems: managing fiber fanout, complex routing across multiple equal-cost paths
- Merchant silicon
  - trade-offs
    - failures are more frequent
    - small shared buffer
    - ECMP(equal cost multi path), network balancing (lots of contention)
- Centralized control protocols
  - each switch calculate forwarding table from a dynamically-elected central switch
  - central control and configuration (SDN)
The Datacenter as a computer, Above the Clouds September 04, 2015
The Datacenter as a Computer
- “DC” vs WSC
- key characteristics, goals
- details
- constraints impact on -> application desgin and data center design
- cloud computing
- EC2 EMR
  
  spot pricing: auction, spot instances run when bid price > spot price
- illusion of infinite capacity (CPU, storage)
- scale out
- pay-as-you-go
KVM April 16, 2015
VM operations
1. sudo virsh list --all
  
  sudo virsh start ubuntu1404
  
  virt-viewer ubuntu1404
Tao of Cooking April 14, 2015
原理
- 加盐早，肉就紧
Linux disk partition April 08, 2015
fdisk

handle disk which is less than 2T
DB topics - Query April 07, 2015
Query optimization

Naive thinks and problems
DB topics - Recovery February 17, 2015
ARIES

ARIES is complex because it combines no force and steal

Buffer manager
1. force or no force? (when XACT commits, do you write its dirty pages to disk?)
  - force: write dirty pages to disk (simple, slow)
    - it avoids the need to REDO on restart
      
      steal or no steal? (can you evict a dirty page from a running XACT?)
  - no steal: you can not deprive dirty page from a running XACT (simple, slow)
  - steal: write uncommitted page into disk
    - it needs UNDO to correct the data in the disk
Concurrency Models February 06, 2015
Message Passing

Message passing sends a message to a process (which may be an actor or object) and relies on the process and the supporting infrastructure to select and invoke the actual code to run
Topics in Database Management Systems January 22, 2015
Concurrency Control
- much more deeper detail
- efficiency algorithm (performance rather than correctness)
- degree of consistency (between chaos and perfect)
- What concurrent access to data?
  
  performance (e.g. transaction:
  
  T1: transfer moeny from A to B
  
  T2: giving 6% interest rate to all accounts)
  
  Idea: structure access to DB as a bunch of Transactions
  
  each transaction takes DB from consistant state to consistant state
  
  then serial execution of a set of transactions guarantees consistency
  
  want to run transactions concurrently but it should “look like” they ran serially
  
  certain “conflicts” enforce another on the transactions
  
  WR, RW, WW
  
  Techniques: build serialization graph
  
  Schedule is conflict serializable if and only if the serialization graph has no cycles
Principles of Programming Languages January 21, 2015
Lambda Calculus
- Full-beta reduction (random)
- Normal order (leftmost/outermost)
- call-by-value, most programming language, evaluate arguments before passing them, evaluate expression inside parenthesis first
- call-by-name, no reductions inside abstractions (lazy strategy, call-by-need)
Political Order in Changing Societies December 11, 2014
虽然我没有看《变化社会中的政治秩序》一书，但是在看完几乎所有书评之后，我发现它对我的启发仍然是巨大的。
Supervised Learning November 19, 2014
K-NN Basic Approach
1. store all training examples
  - For new(i.e. test) examples, find k nearest stored examples
  - combine these k into the answer for the test example
Logic November 19, 2014
Relations
- Logical Consequence (entailment)
  - Syntactic Consequence: therefore, -
    - KB -i A, A is derived from KB by i
    - The study of the syntactic consequence (of a logic) is called (its) proof theory
    - sequent
  - Semantic Consequence: entails, models, =
    - The formal definition of entailment is this: α |= β if and only if, in every model in which α is true, β is also true.
  - Material Consequence (Conditional): ->; p -> q; if p then q
- logic cs540 Chuck Dyer
- FOPC cs540 Chuck Dyer
RPC November 14, 2014
Before RPC
- sockets are the fundamental mechanism with a poor read/write only interface
Kerberos November 06, 2014
Attacks
- tapping
- spurious messages injection
- replay
Google File System October 31, 2014
Consistency model
- defined: a mutation succeed without interference from concurrent writers
  
  it may include mingled fragments from multiple mutations
- consistent: all clients will see the same data
- A global mutation order is defined first by the lease grant order chosen by the master
  
  then within a lease by the serial numbers assigned by the primary
Quant 矿工 October 27, 2014
脑洞了一天关于金融矿工的工作，感想如下
AFS October 27, 2014
Goals
- scale: how many clients can server serve?
- reasonable definition of consistency (top down)
- use local disk
LOCUS October 17, 2014
Reviews
I/O-lite, Zero-copy October 15, 2014
Design
- allow all subsystems and applications share the same buffer: create an Abstract Data Type layer to represent I/O data
  - how-to share memory? Capability-based addressing
- IPC mechanisam: page remapping & shared memory
- provide ACL
Log-Structured File System October 13, 2014
Introduction

Principle: collect large amounts of new data in a file cache in main memory, then write the data to disk in a single large I/O that can use all of the disk bandwidth
Unix File System October 10, 2014
Introduction
Kernel Instrumentation, KernInst October 06, 2014
KernInst

structure
Exokernel October 03, 2014
Introduction
- A small kernel securely exports all hardware resources through a low-level interface to untrusted library operating systems
- The single overriding goal: to separate protection from management -> to provide an interface that is as low-level as possible, the approaches:
  
  give each application its own virtual machine
  
  exokernel way: export hardware resources rather than emulating
- three techniques to export resources securely:
  
  secure bindings
  
  visible resource revocation
  
  abort protocol
Synthesis kernel September 30, 2014
Design Goal
1. high performance
  
  combine kernel code synthesis, which decreases kernel call overhead through specialization
  
  reduce synchronization - self-tuning capability to dynamic load and configuration changes - a simple, uniform and intuitive model of computation with a high-level interface
Fix UEFI Ubuntu's Grub in Boot Manager September 25, 2014
I bought a HP laptop equipped with Windows 8.1, but it would jump over Ubuntu’s Grub after I install it.
Virtualization September 24, 2014
Memory Resource Management in VMware ESX Server
- double paging problem
  - use randomized page replacement policy, since paging will be uncommon
- ballooning, let guest OS invoke its own native memory management algorithms. If fails, continue to use paging mechanism
- shares-per-page ratio, relative resource acquisition right
Zestybench, measure IPC September 22, 2014
This post is a digest of the first paper in my CS736 which will talk about the IPC benchmark briefly.
Review of First-Class User-Level Threads September 17, 2014
two minutes warning
- User-level thread packages can set the actual duration of the warning by writing a value in the virtual processor data structure to avoid acquiring a spin lock near the end of the virtual processor quantum
- if an about-to-be-preempted thread is working on an important computation that needs to be continued on another physical processor, two-minute warning handler can save the state of the thread and then send an explicit interrupt to another virtual processor in the same address space, prompting it to migrate and run the preempted thread
Review of MACH September 15, 2014
vm_allocate
1. User Processcall Kernel with vm_allocate()
  
  Kernelcall Pager about page initialization
  
  User Process meets page fault interrupt
  
  Kernel request Pager
  
  Pager provide page
  
  Kernel resumes User Process
Reviews of DEMOS September 12, 2014
DEMOS
- links provide both message paths and optional data sharing between tasks
- Switchboard task is provided to allow mutually consenting arbitrary tasks to communicate
- security is a big issue to Los Alamos
Reviews of Monitors September 10, 2014
Monitors
NLP September 10, 2014
2014-09-11
- regular language
- formal language
Review of Communicating Sequential Processes September 07, 2014
Introduction

Essential Proposals
AI September 05, 2014
Course Main Topics
- Philosophical Foundation of AI
Review of A Hardware Architecture for Implementing Protection Rings September 02, 2014
Introduction

Mechanisms are applicable to any computer system which uses segmentation as a memory addressing scheme.
Using lvm to manage multi-disks May 26, 2014
Install
- insert disks in your free disk slots
- sudo apt-get install lvm2
Configure nginx with rsyslog to assemble your log files May 20, 2014
Rsyslog Configurations

We need a local rsyslog to detect local nginx log change and a remote rsyslog to receive logs.
Use logrotate: rotate as minute and customize new log filename May 16, 2014
logrotate.d

Create a configuration file in /etc/logrotate.d/ like:
泊松过程的生成 March 11, 2014
为了模拟客户到访时间或者服务器请求到达时间，通常使用的模拟方式是泊松过程
linux配置wireless无线 March 04, 2014
查看可用的无线网络 sudo iwlist wlan0 scan
linux配置idc之间的ip tunnel March 04, 2014
假设有机房A的机器10.1.0.10/16 1.2.3.4与机房B的机器192.168.1.10/24 5.6.7.8要搭建基于internet的内网隧道
linux为DMZ的机器通过透明nat配置公网ip March 04, 2014
处于DMZ之后的内网机器往往只配了一块网卡，所以往往只有一个内网ip(就算配了外网ip，外网的机器也ping不到，所以没有作用)(除非再去上游配置路由)
linux拼音输入法fcitx设置 February 23, 2014
ubuntu 13.10自带的ibus拼音输入法有问题，总是打不对字。然后搜狗输入法又装不好，搞了2个小时终于装好了fcitx googlepin
keepalived的最小化配置应用 February 19, 2014
背景介绍

keepalived构建于lvs之上，并实现了vrrp协议
配置haproxy及通过rsyslog记录日志 February 18, 2014
安装

推荐使用apt-get install haproxy里面自带了重新启动脚本