The Datacenter as a computer, Above the Clouds

Zhaoyu Luo bio photo By Zhaoyu Luo

The Datacenter as a Computer

  • “DC” vs WSC
  • key characteristics, goals
  • details
  • constraints impact on -> application desgin and data center design
  • cloud computing
  • EC2 EMR
    • spot pricing: auction, spot instances run when bid price > spot price
  • illusion of infinite capacity (CPU, storage)
  • scale out
  • pay-as-you-go

WSC

  • homogeneous hardware
  • small set of apps
  • one organization management
  • communication across set-up
  • share infrastructure
  • massive scale (~100000 servers)
    • container-sized: 2000 servers ~ 40-50 racks
      • on-promise, hook-up containers into DC/WSC
    • rack-scale: ~ 3-4 (upto 10 rackes)

commodity servers -> cost-efficient: high utilization of hardware

  • performance, fault tolerant

Some common design patterns to ensure high perf/ availability in the presence of flaky hardware, local-remote trade-off

                                        Performance     Availability
Replication                             Y               Y
Partitioning(sharding)                  Y               Y
checksum/ integrity/ RAID                               Y
Health monitoring/ watchdog timer                       Y
data-locality                           Y
priority                                Y
load balancing                          Y
compression (queries on)                Y
checkpoint                                              Y
cache in-memory                         Y
diagnoses(x-trace)                      Y               Y
eventual consistency                    Y               Y

What is the theme

This article gives us a fundamental introduction about a new class of computing system warehouse-scale computers. This system has massive scale of their software infrastructure, data repositories and hardware platform.

This system is innovative since it focuses on scale-out rather than scale-up.

Features
  • large scale
    • large numbers of component faults
    • high availability
  • run a smaller number of very large applications
  • significant deployment flexibility
  • replicas between datacenters
  • emphasis on cost efficiency

How to do

  • One user query tends to be fully processed within one datacenter
    • view multi datacenters as seperate computing resources
    • query across datacenters is expensive
  • A fault-tolerant file system
    • it could lower hardware costs and networking fabric utilization
      • but difficult to implement
    • it could support different machines
    • it would use more networking bandwidth to complete write operations
    • Otherwise, use NAS, which provides extra reliability but more expensive
  • exploit rack-level networking locality
    • Higher port counts would cost exponential higher cost
    • intra rack connectivity is often cheaper than inter rack connectivity
  • WSC is often organized as two-level hierarchy
    • rack level
    • cluster level

Challenge

  • Size is too big to experiment with or simulate efficiently
  • New programming challenge over individual servers
  • How to smooth out the discrepancies which lie in latency, bandwidth and capacity of different racks

Comments

  • Larger computing resource is always needed, while cost-efficiency is always the key to design large WSC
  • Large hardware saving is usually achieved by implementing a software level application. In the long run, the software solution would become cheaper. Even though the software and hardware approaches may share the same idea to achieve higher availability or capacity, the software is always more easy to do modification and maintenance
  • To lower the cost of purchasing machines, people tend to procure more machines with a same type at a time. While datacenter upgrade may be conducted partially at a time. So it is the high level application which would take the responsiblity to handle the isomerism
    • In this way, no matter how the hardware techniques evolves (such as biology computing or quantum computing), each new computing unit should support basic stream read and write. An unified low level communication protocol is more important than other design assumptions. And this kind of protocol would last far more longer than the hardware revolution.
  • As the size of network increasing, there will be more hierarchies adopt inside WSC. If the estimated over-subscription is low, we may use some software level switch to transmit data above some uniformed switch device
  • Datacenter itself could also evolve into different kinds of use cases. e.g., some datacenters or their smaller unit, cluster could bind together to handle some specific computing problem

Above the Clouds

  • Data Lock-In
    • This is definitely the small clould providers’ opportunity to combat against AWS or other large cloud providers. I could say for sure those large clould providers such as AWS would not have any lucrative incentive to think for its potential competitors. AWS’s (just raise it as an example) best strategy is to keep its API exceptional and make the future “standard” looks as similar as its’.
    • Whatever happens, those small would be more enthusiasstic in promoting standarlization
  • Data Transfer Bottlenecks
    • It is surprising that CPU develops much faster than the bandwidth
    • Transmitting efficiency would always be the central topic. HTTP2, long live connection, multi-push, compression, new media format would be hotly discussed. Hacking on the TCP/IP would continue
  • Bugs in Large-Scale Distributed Systems
    • Personally I don’t think the explanation is viable. Simulating the actual distributed enviroment is still exhausting and costly, even though one could create those VMs as reliable as real servers
    • Fuzz test may be an alternative way to find bugs
  • Reputation Fate Sharing
    • Blacklisting of IP address itself is wrong. Please blacklist the domains instead
    • I am optimistic that if there exists lawsuit against cloud provider about spaming, the court would refer the DMCA’s principle inovation: exemption from direct and indirect liability of ISP and other intermediaries