The Datacenter as a Computer
- “DC” vs WSC
- key characteristics, goals
- details
- constraints impact on -> application desgin and data center design
- cloud computing
- EC2 EMR
- spot pricing: auction, spot instances run when bid price > spot price
- illusion of infinite capacity (CPU, storage)
- scale out
- pay-as-you-go
WSC
- homogeneous hardware
- small set of apps
- one organization management
- communication across set-up
- share infrastructure
- massive scale (~100000 servers)
- container-sized: 2000 servers ~ 40-50 racks
- on-promise, hook-up containers into DC/WSC
- rack-scale: ~ 3-4 (upto 10 rackes)
- container-sized: 2000 servers ~ 40-50 racks
commodity servers -> cost-efficient: high utilization of hardware
- performance, fault tolerant
Some common design patterns to ensure high perf/ availability in the presence of flaky hardware, local-remote trade-off
Performance Availability
Replication Y Y
Partitioning(sharding) Y Y
checksum/ integrity/ RAID Y
Health monitoring/ watchdog timer Y
data-locality Y
priority Y
load balancing Y
compression (queries on) Y
checkpoint Y
cache in-memory Y
diagnoses(x-trace) Y Y
eventual consistency Y Y
What is the theme
This article gives us a fundamental introduction about a new class of computing system warehouse-scale computers. This system has massive scale of their software infrastructure, data repositories and hardware platform.
This system is innovative since it focuses on scale-out rather than scale-up.
Features
- large scale
- large numbers of component faults
- high availability
- run a smaller number of very large applications
- significant deployment flexibility
- replicas between datacenters
- emphasis on cost efficiency
How to do
- One user query tends to be fully processed within one datacenter
- view multi datacenters as seperate computing resources
- query across datacenters is expensive
- A fault-tolerant file system
- it could lower hardware costs and networking fabric utilization
- but difficult to implement
- it could support different machines
- it would use more networking bandwidth to complete write operations
- Otherwise, use NAS, which provides extra reliability but more expensive
- it could lower hardware costs and networking fabric utilization
- exploit rack-level networking locality
- Higher port counts would cost exponential higher cost
- intra rack connectivity is often cheaper than inter rack connectivity
- WSC is often organized as two-level hierarchy
- rack level
- cluster level
Challenge
- Size is too big to experiment with or simulate efficiently
- New programming challenge over individual servers
- How to smooth out the discrepancies which lie in latency, bandwidth and capacity of different racks
Comments
- Larger computing resource is always needed, while cost-efficiency is always the key to design large WSC
- Large hardware saving is usually achieved by implementing a software level application. In the long run, the software solution would become cheaper. Even though the software and hardware approaches may share the same idea to achieve higher availability or capacity, the software is always more easy to do modification and maintenance
- To lower the cost of purchasing machines, people tend to procure more machines with a same type at a time. While datacenter upgrade may be conducted partially at a time. So it is the high level application which would take the responsiblity to handle the isomerism
- In this way, no matter how the hardware techniques evolves (such as biology computing or quantum computing), each new computing unit should support basic stream read and write. An unified low level communication protocol is more important than other design assumptions. And this kind of protocol would last far more longer than the hardware revolution.
- As the size of network increasing, there will be more hierarchies adopt inside WSC. If the estimated over-subscription is low, we may use some software level switch to transmit data above some uniformed switch device
- Datacenter itself could also evolve into different kinds of use cases. e.g., some datacenters or their smaller unit, cluster could bind together to handle some specific computing problem
Above the Clouds
- Data Lock-In
- This is definitely the small clould providers’ opportunity to combat against AWS or other large cloud providers. I could say for sure those large clould providers such as AWS would not have any lucrative incentive to think for its potential competitors. AWS’s (just raise it as an example) best strategy is to keep its API exceptional and make the future “standard” looks as similar as its’.
- Whatever happens, those small would be more enthusiasstic in promoting standarlization
- Data Transfer Bottlenecks
- It is surprising that CPU develops much faster than the bandwidth
- Transmitting efficiency would always be the central topic. HTTP2, long live connection, multi-push, compression, new media format would be hotly discussed. Hacking on the TCP/IP would continue
- Bugs in Large-Scale Distributed Systems
- Personally I don’t think the explanation is viable. Simulating the actual distributed enviroment is still exhausting and costly, even though one could create those VMs as reliable as real servers
- Fuzz test may be an alternative way to find bugs
- Reputation Fate Sharing
- Blacklisting of IP address itself is wrong. Please blacklist the domains instead
- I am optimistic that if there exists lawsuit against cloud provider about spaming, the court would refer the DMCA’s principle inovation: exemption from direct and indirect liability of ISP and other intermediaries