Blog
Programming Programming
三观整合 抛砖引玉
spanner tao paxos raft Raft
CFS Chord Old time
- people are forced to use P2P for large scale network
- P2P’s need is replaced by scalable internet services within datacenters
- P2P’s extremes
- scalability
- failure (frequent join and leave)
- security (anyone could access)
- wide area (geo-distributed)
- heterogeneity
BlinkDB Approximation is hard
- sampling and sketches
- low space and time complexity
- make strong assumptions about the query workload - on-line aggregation
- highly variable performance
- make fewer assumptions about the query workload
- OLA may need to read the entire table to compute a result with satisfactory error bounds
- sampling and sketches
Project Adam: Building and Efficient and Scalable Deep Learning Training System Why Deep Learning is hard?
- extremely hard to construct appropriate features
Parameter Server ML challenges
- needs enormous network bandwidth
- cost of synchronization and machine latency is high: since ML algorithms are sequential
- fault tolerance is critical: machines and jobs can be preempted
Auth What “secure” means?
- Confidentiality
- Authentication
- Integrity. Intruder can change message
- Accountability. No party can deny sending message
- Availability (DoS)
Eventual Consistency Strong Consistency
- One copy, up to date
- Why we need it? it is easier to implement certain applications
- Why not? scale, latency, availability
Pregel-like Graph Processing Systems Comparison Pregel-like
- Bulk Synchronous Parallel model
- is a vertex state machine (think like a vertex)
- but may encounter the straggler problem
- use vertex-centric approach, graph parallel
- each vertex runs parallel: gather sum apply scatter
- address in-memory batch processing of large graphs
- terminate when all vertices are inactive and no more messages are in transit
- computation is performed on locally stored data
- Pregel only supports graphs that fit in memory
- master/workers model: the master partitions the input graph into partitions and assign it to a worker
- Global state?
- global aggregators
- phases of the algorithm
- Bulk Synchronous Parallel model
GraphX Why GraphX?
- Specialized graph processing systems: Google: Pregel (Page Rank, LDA, Graph traversal): Giraph; Graphlab
- Graph processing systems often abandon fault tolerance in favor of snapshot recovery to pursuit performance
- On-disk processing (e.g., MapReduce) is not suit for iterative graph algorithms
- Early distributed dataflow frameworks
- Directly implementing iterative graph algorithms using dataflow operators can be challenging
- Did not expose fine-grained control over the data partitioning
- Specialized graph processing systems: Google: Pregel (Page Rank, LDA, Graph traversal): Giraph; Graphlab
Storm, Heron Storm
- Topology: is a DAG where vertices represent computation and the edges represent the data flow
- is a logical plan from a DB system perspective
- more than one worker process on the same machine may be executing different part of the same topology
- one process is one JVM (While Heron Instance is a JVM process, runs only a single task of spout or bolt)
- worker processes serve as containers
- each worker runs a supervisor that communicates with Nimbus
- one JVM has multipe threads (executors)
- executors provide intra-topology parallelism
- one thread has multipe tasks
- task provide intra-bolt/intra-spout parallelism
- Nimbus is “JobTracker”: is the touchpoint between the user and the Storm system
- Nimbus and Supervisor daemons are fail-fast and stateless, and all their state is kept in Zookeeper or on the local disk
- Single point of failure in Storm, but not in Heron
- Supervisor: every 3 seconds, it reads worker heartbeats from local state and classifies those workers as either valid, time out, not started, or disallowed
- Topology: is a DAG where vertices represent computation and the edges represent the data flow
Quorums Blocking
coordinator and server could block -> we don’t need all reply, majority reply is enough
Spark Streaming Current distributed stream processing model
- requiring hot replication
- long recovery times
- do not handle stragglers
- Continuous operator model: Storm, TimeStream, MapReduce Online and streaming DBs
- long-running, stateful operators receive each record, update internal state and send new records
- since it is stateful, it is hard to provide fault tolerance efficiently
- like Map and Reduce job, no start and no stop, always running
- how to do recovery?
- replication and also synchronization protocol
- to ensure replicas receive messages in the same order
- upstream backup: upstream nodes buffer sent messages and replay them to a new copy of a failed node
- this will take longer time to recover: the whole system needs to wait for a new node to serially rebuild the failed node’s state by rerunning
2PC Distributed Atomic Commit
Why hard?
Remus and HA NFS Failures
- What kind of failures in hardware/software can it deal with?
- Fail-stop failure and only one of primary or backup fails
- it could handle both of them crash temporarily, since the data will remain crash-consistent
- Fail-stop failure and only one of primary or backup fails
- What kind can’t it deal with?
- Can not recover from nonfail-stop conditions
- Both primary and backup fails permanently
- Software bugs, which could propagate application errors to backup
- What happens when an unexpected failure arises?
- if primary fails
- the backup node will try to wait the new checkpoint transmission until time-out
- the alive node will upgrade itself as primary and resume execution from most recent checkpoint
- if backup fails
- the primary will try to wait for a time-out of the backup responding to commit requests
- the primary would have to replicate another backup
- if primary fails
- What kind of failures in hardware/software can it deal with?
Tachyon Problem
- I/O is the bottleneck of a system
- Read is easy to deal with, but write operations could not scale-out due to multiple replication
remote write protocols, state machine replication ReplicatedService
2 nodes at first
SQL-on-Hadoop SQL-on-Hadoop
- Both Hive-MR and Hive-Tez are CPU-bound during scan operations, which is Java deserialization
- A faster Java deserialization lib or JSON lib would help a lot
Hive
- A faster Java deserialization lib or JSON lib would help a lot
- Tez eliminates the startup and materialization overheads of MapReduce
- However, it does not avoid Java deserialization overheads during scan operations
- Tez pipelines data through execution stages instead of creating temporary intermediate files
- Both Hive-MR and Hive-Tez are CPU-bound during scan operations, which is Java deserialization
Grapevine, Leases Grapevine: An Exercise in Distributed Computing
- Compared with hardware, software failure could be more arduous. Same version software could get crashed by the same bug which results in crash totally
nohup
is a must to alleviate from having all the servers crash in a short period before anyone notices that
- Compared with hardware, software failure could be more arduous. Same version software could get crashed by the same bug which results in crash totally
Impala Impala
- Business intelligent analytics
- An MPP database, strength on multi-user
- query plan generation
- single node plan
- do partial aggregation “combines” in plan tree
- resource allocation
- multiple queries
- admission control
- it sacrifice resource efficiency for performance
- shared-nothing
- I/O operations are always full
- Parquet
- loads working set in memory
- streaming data
- llama: low latency application master
- virtual function
- calls: body of function
- loop unrolling
- LLVM: code generation atop LLVM
- quasi quotes
- get AST
- get optimized code
Spark SQL Why Spark SQL
Previous: Shark
- Could not apply on intermediate results of Spark program
- A single SQL string is hard to debug, hard for modularization
- Hive optimizer is only for MapReduce, we need new data types and new data sources
Hive Hive
SQL on Hadoop or Spark Hive - MR/Tez SparkSQL - Spark Impala
NFS NFS
Why we need NFS?
Sparrow Omega Sparrow
- Yarn: centralized scheduling: limited throughput
- scheduling without coordination
- sparrow distributed scheduling
- omega uncoordinated, multi-schedules
Time, Clocks, and the Ordering of Events Time, Clocks, and the Ordering of Events in a Distributed System
- Process: stream of events
- Communicate: send message, receive
- Model: message send takes time, reliable
- Ordering:
- a, c: a is before c
SAMC, memory, flash, networks failure SAMC
- Use the knowledge of a system to craft the semantic could help to alleviate the state-space explosion in test cases
- Future test works would highly relate to the pattern extraction, since it is still manually so far. And of course, we do not need low-level test engineer that much
- Future test engineers need to learn how to model the codes into a specific protocol, then generate tests from these protocols
- This kind of test methodology seems to be more precise than fuzzy test and could give more understanding of why the codes fail
- Future test engineers need to learn how to model the codes into a specific protocol, then generate tests from these protocols
RDD RDD
- read-only, partitioned collection of records
- a restricted form of shared memory
- coarse-grained transformations (e.g., map, filter and join) rather than fine-grained updates to shared state
- reuse intermediate results during iterative in memory not in external stable storage
- immutable data: could run bakcup copies of slow tasks to ease stragglers
- can schedule tasks based on data locality
- RDD degrade gracefully when memory is not enough, while DSM may get block
- RDD is best suited for batch applications that apply the same operation to all elements of a dataset
- general-purpose programming language
Tez Why Tez?
- allow users to model computation as a DAG
- runtime query optimization, such as pruning data partitions, based on online information
- YARN-compatible security, data-locality awareness, resource-reuse, fault-tolerance and speculation
- pluggable API
- allow users to model computation as a DAG
Disk failures, data corruptions An Analysis of Data Corruption in the Storage Stack
- Table 1: Corruption classes summary
- This summary overturns my illusion that “once I have checksum and verified, I would trust the data”. In fact, a block of data may pass the checksum verification but is still inconsistent since the firmware returns false negative or there is logic error due to misdirected writes
- Table 1: Corruption classes summary
YARN Why YARN?
- Higher scalability: As the central arbiter of the compute cluster, the JobTracker was also responsible for (each of these limited its scalability):
- admission control,
- tracking the liveness of TaskTrackers (to re-execute running tasks or tasks whose output becomes unavailable),
- launching tasks speculatively to route around slow nodes,
- reporting job status to users through a web server,
- recording audit logs and aggregate statistics,
- authenticating users and many other functions;
- Decoupling from MapReduce programming model
- Higher scalability: As the central arbiter of the compute cluster, the JobTracker was also responsible for (each of these limited its scalability):
Why it stops Why computers stop
- The key to providing high availability is to modularize the system so that modules are the unit of failure and replacement.
- The top priority for improving system availability is to reduce administrative mistakes by making self-configured systems with minimal maintenance and minimal operator interaction.
- Bohr vs Heisenbugs
MapReduce MapReduce
- M/R programming model
- Execution isssues, run time problems
- Scheduling and fairness across jobs
- greedy
- fair scheduler (slot-based)
- flow-based fair scheduler
DCTCP, incast Basics of Distributed Computing
- communication
- failure
- time
HDFS Timeline
- How to build large-scale computer systems? 1993
- 1995: internet services (Brewer)
- 1996: how to do DB sorting on SMP (Jim Gray)
- 1997: clusters: Data; Hard to run at scale (performance)
- 1999: look for a job (google nowsort)
- 2003: GFS
- 2004: MR (start to care about node failure) (faults at scale) - SMPs: shared memory machines ~ multicore (but it scales) - MPP: massively parallel processor - How about commodity hardware?
- make a cluster
- key technology: networks
- How to build large-scale computer systems? 1993
DMV Numbers
- Fire hydrant: 15ft
- park: not within 10ft
- Red flashing school bus: stopped at 20ft
- railroad parking: 25ft
- crosswalk parking: 15ft
- 提前打信号灯: 100ft
- school zone when children are present: 15mph
- 商业居民区限速: 25mph
- Fire hydrant: 15ft
U-Net, DCTCP Why this paper
- How to improve performance and flexibility of netwroking layer performance?
- move parts of the protocol processing into user space
- remove kernel completely from the critical path and allow the communication layers used by each process to be tailored to its demands
- the entire protocol stack should be placed at user level
- OS and hardware should allow protected user-level access directly to the network
- How to improve performance and flexibility of netwroking layer performance?
Google Datacenter Network Network Construction Principles
- Clos topologies
- could scale to arbitrary size
- in-built path diversity and redundancy
- problems: managing fiber fanout, complex routing across multiple equal-cost paths
- Merchant silicon
- trade-offs
- failures are more frequent
- small shared buffer
- ECMP(equal cost multi path), network balancing (lots of contention)
- trade-offs
- Centralized control protocols
- each switch calculate forwarding table from a dynamically-elected central switch
- central control and configuration (SDN)
- Clos topologies
The Datacenter as a computer, Above the Clouds The Datacenter as a Computer
- “DC” vs WSC
- key characteristics, goals
- details
- constraints impact on -> application desgin and data center design
- cloud computing
- EC2 EMR
- spot pricing: auction, spot instances run when bid price > spot price
- illusion of infinite capacity (CPU, storage)
- scale out
- pay-as-you-go
KVM VM operations
sudo virsh list --all
sudo virsh start ubuntu1404
virt-viewer ubuntu1404
Tao of Cooking 原理
- 加盐早,肉就紧
Linux disk partition fdisk
handle disk which is less than 2T
DB topics - Query Query optimization
Naive thinks and problems
DB topics - Recovery ARIES
ARIES is complex because it combines no force and steal
Buffer manager
- force or no force? (when XACT commits, do you write its dirty pages to disk?)
- force: write dirty pages to disk (simple, slow)
- it avoids the need to REDO on restart
- steal or no steal? (can you evict a dirty page from a running XACT?)
- it avoids the need to REDO on restart
- no steal: you can not deprive dirty page from a running XACT (simple, slow)
- steal: write uncommitted page into disk
- it needs UNDO to correct the data in the disk
- force: write dirty pages to disk (simple, slow)
- force or no force? (when XACT commits, do you write its dirty pages to disk?)
Concurrency Models Message Passing
Message passing sends a message to a process (which may be an actor or object) and relies on the process and the supporting infrastructure to select and invoke the actual code to run
Topics in Database Management Systems Concurrency Control
- much more deeper detail
- efficiency algorithm (performance rather than correctness)
- degree of consistency (between chaos and perfect)
- What concurrent access to data?
- performance (e.g. transaction:
- T1: transfer moeny from A to B
- T2: giving 6% interest rate to all accounts)
- Idea: structure access to DB as a bunch of Transactions
- each transaction takes DB from consistant state to consistant state
- then serial execution of a set of transactions guarantees consistency
- want to run transactions concurrently but it should “look like” they ran serially
- certain “conflicts” enforce another on the transactions
- WR, RW, WW
- Techniques: build serialization graph
- Schedule is conflict serializable if and only if the serialization graph has no cycles
- performance (e.g. transaction:
Principles of Programming Languages Lambda Calculus
- Full-beta reduction (random)
- Normal order (leftmost/outermost)
- call-by-value, most programming language, evaluate arguments before passing them, evaluate expression inside parenthesis first
- call-by-name, no reductions inside abstractions (lazy strategy, call-by-need)
Political Order in Changing Societies 虽然我没有看《变化社会中的政治秩序》一书,但是在看完几乎所有书评之后,我发现它对我的启发仍然是巨大的。
Supervised Learning K-NN Basic Approach
- store all training examples
- For new(i.e. test) examples, find k nearest stored examples
- combine these k into the answer for the test example
- store all training examples
Logic Relations
- Formal System = Formal Language + Inference Rules
- Propositional Calculus vs FOPC
- Propositional Calculus is Formal System
- FOPC is Formal System
- FOPC has quantifier while Propositional Calculus does not
- Once we have rules for inferring nonquantified sentences from quantified sentences, it becomes possible to reduce first-order inference to propositional inference.
- In Ontological Commitment, While propositional logic commits only to the existence of facts, first-order logic commits to the existence of objects and relations and thereby gains expressive power.
- Logical Consequence (entailment)
-
Syntactic Consequence: therefore, - -
KB -i A, A is derived from KB by i - The study of the syntactic consequence (of a logic) is called (its) proof theory
- sequent
-
-
Semantic Consequence: entails, models, = - The formal definition of entailment is this: α |= β if and only if, in every model in which α is true, β is also true.
- Material Consequence (Conditional): ->; p -> q; if p then q
-
- logic cs540 Chuck Dyer
- FOPC cs540 Chuck Dyer
RPC Before RPC
- sockets are the fundamental mechanism with a poor read/write only interface
Kerberos Attacks
- tapping
- spurious messages injection
- replay
Google File System Consistency model
- defined: a mutation succeed without interference from concurrent writers
- it may include mingled fragments from multiple mutations
- consistent: all clients will see the same data
- A global mutation order is defined first by the lease grant order chosen by the master
- then within a lease by the serial numbers assigned by the primary
- defined: a mutation succeed without interference from concurrent writers
Quant 矿工 脑洞了一天关于金融矿工的工作,感想如下
AFS Goals
- scale: how many clients can server serve?
- reasonable definition of consistency (top down)
- use local disk
LOCUS Reviews
I/O-lite, Zero-copy Design
- allow all subsystems and applications share the same buffer: create an Abstract Data Type layer to represent I/O data
- each process uses the same virtual address for simplification
- so the number of stacks is limited
- use immutable I/O buffers
- avoid synchronization
- how-to share memory? Capability-based addressing
- each process uses the same virtual address for simplification
- IPC mechanisam: page remapping & shared memory
- provide ACL
- allow all subsystems and applications share the same buffer: create an Abstract Data Type layer to represent I/O data
Log-Structured File System Introduction
Principle: collect large amounts of new data in a file cache in main memory, then write the data to disk in a single large I/O that can use all of the disk bandwidth
Unix File System Introduction
Kernel Instrumentation, KernInst KernInst
structure
Exokernel Introduction
- A small kernel securely exports all hardware resources through a low-level interface to untrusted library operating systems
- The single overriding goal: to separate protection from management -> to provide an interface that is as low-level as possible, the approaches:
- give each application its own virtual machine
- exokernel way: export hardware resources rather than emulating
- three techniques to export resources securely:
- secure bindings
- visible resource revocation
- abort protocol
- secure bindings
Synthesis kernel Design Goal
- high performance
- combine kernel code synthesis, which decreases kernel call overhead through specialization
- reduce synchronization - self-tuning capability to dynamic load and configuration changes - a simple, uniform and intuitive model of computation with a high-level interface
- high performance
Fix UEFI Ubuntu's Grub in Boot Manager I bought a HP laptop equipped with Windows 8.1, but it would jump over Ubuntu’s Grub after I install it.
Virtualization Memory Resource Management in VMware ESX Server
- double paging problem
- use randomized page replacement policy, since paging will be uncommon
- ballooning, let guest OS invoke its own native memory management algorithms. If fails, continue to use paging mechanism
- shares-per-page ratio, relative resource acquisition right
- double paging problem
Zestybench, measure IPC This post is a digest of the first paper in my CS736 which will talk about the IPC benchmark briefly.
Review of First-Class User-Level Threads two minutes warning
- User-level thread packages can set the actual duration of the warning by writing a value in the virtual processor data structure to avoid acquiring a spin lock near the end of the virtual processor quantum
- if an about-to-be-preempted thread is working on an important computation that needs to be continued on another physical processor, two-minute warning handler can save the state of the thread and then send an explicit interrupt to another virtual processor in the same address space, prompting it to migrate and run the preempted thread
Review of MACH vm_allocate
- User Processcall Kernel with vm_allocate()
- Kernelcall Pager about page initialization
- User Process meets page fault interrupt
- Kernel request Pager
- Pager provide page
- Kernel resumes User Process
- User Processcall Kernel with vm_allocate()
Reviews of DEMOS DEMOS
- links provide both message paths and optional data sharing between tasks
- Switchboard task is provided to allow mutually consenting arbitrary tasks to communicate
- security is a big issue to Los Alamos
Reviews of Monitors Monitors
NLP 2014-09-11
- If L separates any pair of words {w1, w2…wn} then if L=L(M) M must have at least n states
- regular language
- formal language
Review of Communicating Sequential Processes Introduction
Essential Proposals
AI Course Main Topics
- Learning from Data
- View problem solving as heruristic search through a large space of possible solutions
- Use probabilistic reasoning and learning
- Use logical reasoning – representing knowledge about the world in computers
- Neural networks
- Philosophical Foundation of AI
Review of A Hardware Architecture for Implementing Protection Rings Introduction
Mechanisms are applicable to any computer system which uses segmentation as a memory addressing scheme.
Using lvm to manage multi-disks Install
- insert disks in your free disk slots
sudo apt-get install lvm2
Configure nginx with rsyslog to assemble your log files Rsyslog Configurations
We need a local rsyslog to detect local nginx log change and a remote rsyslog to receive logs.
Use logrotate: rotate as minute and customize new log filename logrotate.d
Create a configuration file in
/etc/logrotate.d/
like:泊松过程的生成 为了模拟客户到访时间或者服务器请求到达时间,通常使用的模拟方式是泊松过程
linux配置wireless无线 查看可用的无线网络
sudo iwlist wlan0 scan
linux配置idc之间的ip tunnel 假设有机房A的机器10.1.0.10/16 1.2.3.4与机房B的机器192.168.1.10/24 5.6.7.8要搭建基于internet的内网隧道
linux为DMZ的机器通过透明nat配置公网ip 处于DMZ之后的内网机器往往只配了一块网卡,所以往往只有一个内网ip(就算配了外网ip,外网的机器也ping不到,所以没有作用)(除非再去上游配置路由)
linux拼音输入法fcitx设置 ubuntu 13.10自带的ibus拼音输入法有问题,总是打不对字。然后搜狗输入法又装不好,搞了2个小时终于装好了fcitx googlepin
keepalived的最小化配置应用 背景介绍
keepalived构建于lvs之上,并实现了vrrp协议
配置haproxy及通过rsyslog记录日志 安装
推荐使用
apt-get install haproxy
里面自带了重新启动脚本