By Zhaoyu Luo
September 14, 2015
Timeline
How to build large-scale computer systems? 1993
1995: internet services (Brewer)
1996: how to do DB sorting on SMP (Jim Gray)
1997: clusters: Data; Hard to run at scale (performance)
1999: look for a job (google nowsort )
2003: GFS
2004: MR (start to care about node failure) (faults at scale)
- SMPs: shared memory machines ~ multicore (but it scales)
- MPP: massively parallel processor
- How about commodity hardware?
make a cluster
key technology: networks
Background, design goals and assumptions
store very large data sets reliably
streaming at high bandwidth (high throughput), not random file access
sequential read
data read/writes happen as fast as compute takes
large cluster
for checking and quick recovery
write-once-read-often
large as GBs ~ TBs; 10s of millions of such files spread across 100s ~ 1000s of nodes
Batch processing vs interactive queries
Directly attached storage
it is cheaper to move compute to data than the other way round
API using which “apps” can know where data blocks are located
iterative computation. e.g., downstream map to read and start processing reduce output
no RAID
NameNode
format NameNode
to create a file system instance, which is persistently stored on all nodes of the cluster
different cluster has different ID
write data
if local memory >= 64 MB
NameNode creates a block and nominate three DataNodes
random other racks if rep > 3
* client writes them in a pipeline fashion
read data: try to read from the closest node
restart
NameNode initializes the namespace image from the checkpoint
replay the journal log, which locates in local host native file system
Data
RAM: namespace, block locations (from DN’s block report hourly)
Disk:
Checkpoint: persistent record of the image stored in disk
image: the meta-data of each file: inode data and the list of blocks belonging to the file
DataNode
start
handshake to verify namespace ID and software version
register a storage ID for identifying DataNode
live heartbeat
sent block report to NameNode every hour
heartbeat every 3s, 10 minutes are considered out of service
NameNode instructions are on replies of heartbeat
Block
each block is represented by two files: one for data, one for block’s metadata (checksum)
Image and Journal
The namespace image is the file system metadata that describes the organization of application data as directories and files
A persistent record of the image written to disk is called a checkpoint
it would be stored in multiple directories, e.g., in different volumes
once namespace information lost, it would be lost partly or entirely
if error on writing journal to one of the storage directories, it excludes that directory from the list of storage directories
NameNode shuts down if no storage directory is available
Batch commits for multithreaded system
multiple transactions would be commited together to get rid of flush-and-sync operation
Checkpoint Node
download the current checkpoint and journal files from the NameNode, merges them locally, and returns the new checkpoint back to the NameNode
good practice: create daily checkpoint
Backup Node
maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode
can be viewed as a read-only NameNode. It contains all file system metadata information except for block locations (from DNs’ block reports)
Upgrade and Snapshot
Snapshot creation is an all-cluster effort rather than a node-selective event
NameNode would create a new checkpoint
DataNode would copy-on-write: create a copy of storage directories with hard links
On upgrade failure
the namespace would roll back
NameNode would not recognize new files, new files on DataNodes would be deleted
Block Placement
No DataNode contains > 1 replica of any block
No rack contains > 2 replicas of the same block, provided there are sufficient racks
Problems
No authentications (you are who your host says you are), file permissions -> use Kerberos
Directory sub-tree should have quota
Correlated failure:
failure of a rack or core switch
loss of electrical power to the cluster: a large cluster will lose a handful of blocks during a power-on restart
NameNode memomry is limitation
Java GC makes NameNode unresponsive when memory usage is high
Read inconsistency
The writer’s lease does not prevent other clients from reading the file; a file may have many concurrent readers
HDFS permits a client to read a file that is open for writing
HDFS allows the user to retrieve its data from the corrupt replicas if all replicas are corrupt
SPOF
Overall, HDFS doesnot gurantee reliability
Secondary NameNode。它不是HA,它只是阶段性的合并edits和fsimage,以缩短集群启动的时间。当NameNode(以下简称NN)失效的时候,Secondary NN并无法立刻提供服务,Secondary NN甚至无法保证数据完整性:如果NN数据丢失的话,在上一次合并后的文件系统的改动会丢失。
Backup NameNode (HADOOP-4539)。它在内存中复制了NN的当前状态,算是Warm Standby,可也就仅限于此,并没有failover等。它同样是阶段性的做checkpoint,也无法保证数据完整性。
Avatar NameNode: Facebook HA, put editlog in NFS. Since NFS has higher reliability
Zookeeper
Hadoop HA
Hadoop 2.0中单点故障解决方案总结
Agree
Snapshot would largely decrease the risk of upgrade failure. Since the upgrade frequency is low, it is acceptable to perform a overall snapshot
The performance of this batch system is guaranteed by combine multiple transcations commits together
Single Write simplifies file operations, and in the user cases, multiple writes rarely happens
Questionable
Pipeline writes is not fast enough if datanodes is much larger than 3, in this case, the client would wait for a while to get packet ACK
I personally would config the dfs.replication.min
as 2, and it would return as soon as data is written into 2 datanodes
A whole cluster is limited solely by the master memory size, so it would be very inefficient to have large quantity of small files
Secondary NameNode is always fall behind master, so there is a possibility that it would lose data when master fails permanently
Reference