HDFS

Timeline

How to build large-scale computer systems? 1993
- 1995: internet services (Brewer)
- 1996: how to do DB sorting on SMP (Jim Gray)
- 1997: clusters: Data; Hard to run at scale (performance)
- 1999: look for a job (google nowsort)
- 2003: GFS
- 2004: MR (start to care about node failure) (faults at scale) - SMPs: shared memory machines ~ multicore (but it scales) - MPP: massively parallel processor - How about commodity hardware?
- make a cluster
- key technology: networks

Background, design goals and assumptions

store very large data sets reliably
streaming at high bandwidth (high throughput), not random file access
- sequential read
- data read/writes happen as fast as compute takes
large cluster
- for checking and quick recovery
write-once-read-often
large as GBs ~ TBs; 10s of millions of such files spread across 100s ~ 1000s of nodes
Batch processing vs interactive queries
Directly attached storage
it is cheaper to move compute to data than the other way round
- API using which “apps” can know where data blocks are located
- iterative computation. e.g., downstream map to read and start processing reduce output
no RAID

HDFS Architecture

NameNode

format NameNode
- to create a file system instance, which is persistently stored on all nodes of the cluster
- different cluster has different ID
write data
1. if local memory >= 64 MB
  - NameNode creates a block and nominate three DataNodes
  - random other racks if rep > 3 * client writes them in a pipeline fashion
read data: try to read from the closest node
restart
1. NameNode initializes the namespace image from the checkpoint
  - replay the journal log, which locates in local host native file system

Data

RAM: namespace, block locations (from DN’s block report hourly)
Disk:
- Checkpoint: persistent record of the image stored in disk
- image: the meta-data of each file: inode data and the list of blocks belonging to the file

DataNode

start
1. handshake to verify namespace ID and software version
  - register a storage ID for identifying DataNode
live heartbeat
- sent block report to NameNode every hour
- heartbeat every 3s, 10 minutes are considered out of service
- NameNode instructions are on replies of heartbeat

Block

each block is represented by two files: one for data, one for block’s metadata (checksum)

Image and Journal

The namespace image is the file system metadata that describes the organization of application data as directories and files
A persistent record of the image written to disk is called a checkpoint
- it would be stored in multiple directories, e.g., in different volumes
- once namespace information lost, it would be lost partly or entirely
- if error on writing journal to one of the storage directories, it excludes that directory from the list of storage directories
- NameNode shuts down if no storage directory is available
Batch commits for multithreaded system
- multiple transactions would be commited together to get rid of flush-and-sync operation

Checkpoint Node

download the current checkpoint and journal files from the NameNode, merges them locally, and returns the new checkpoint back to the NameNode
good practice: create daily checkpoint

Backup Node

maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode
can be viewed as a read-only NameNode. It contains all file system metadata information except for block locations(from DNs’ block reports)

Upgrade and Snapshot

Snapshot creation is an all-cluster effort rather than a node-selective event
- NameNode would create a new checkpoint
- DataNode would copy-on-write: create a copy of storage directories with hard links
On upgrade failure
1. the namespace would roll back
  - NameNode would not recognize new files, new files on DataNodes would be deleted

Block Placement

No DataNode contains > 1 replica of any block
No rack contains > 2 replicas of the same block, provided there are sufficient racks

Problems

No authentications (you are who your host says you are), file permissions -> use Kerberos
Directory sub-tree should have quota
Correlated failure:
- failure of a rack or core switch
- loss of electrical power to the cluster: a large cluster will lose a handful of blocks during a power-on restart
NameNode memomry is limitation
- Java GC makes NameNode unresponsive when memory usage is high

Read inconsistency

The writer’s lease does not prevent other clients from reading the file; a file may have many concurrent readers
HDFS permits a client to read a file that is open for writing
HDFS allows the user to retrieve its data from the corrupt replicas if all replicas are corrupt

SPOF

Overall, HDFS doesnot gurantee reliability
Secondary NameNode。它不是HA，它只是阶段性的合并edits和fsimage，以缩短集群启动的时间。当NameNode(以下简称NN)失效的时候，Secondary NN并无法立刻提供服务，Secondary NN甚至无法保证数据完整性：如果NN数据丢失的话，在上一次合并后的文件系统的改动会丢失。
Backup NameNode (HADOOP-4539)。它在内存中复制了NN的当前状态，算是Warm Standby，可也就仅限于此，并没有failover等。它同样是阶段性的做checkpoint，也无法保证数据完整性。
Avatar NameNode: Facebook HA, put editlog in NFS. Since NFS has higher reliability
Zookeeper
Hadoop HA
Hadoop 2.0中单点故障解决方案总结

Comments

Agree

Snapshot would largely decrease the risk of upgrade failure. Since the upgrade frequency is low, it is acceptable to perform a overall snapshot
The performance of this batch system is guaranteed by combine multiple transcations commits together
Single Write simplifies file operations, and in the user cases, multiple writes rarely happens

Questionable

Pipeline writes is not fast enough if datanodes is much larger than 3, in this case, the client would wait for a while to get packet ACK
- I personally would config the dfs.replication.min as 2, and it would return as soon as data is written into 2 datanodes
A whole cluster is limited solely by the master memory size, so it would be very inefficient to have large quantity of small files
- create HAR file to concatenate them The Small Files Problem
Secondary NameNode is always fall behind master, so there is a possibility that it would lose data when master fails permanently

希言自然

Leavened Loop

Timeline

Background, design goals and assumptions

HDFS Architecture

NameNode

Data

DataNode

Block

Image and Journal

Checkpoint Node

Backup Node

Upgrade and Snapshot

Block Placement

Problems

Read inconsistency

SPOF

Comments

Agree

Questionable

Reference