exploit the full network bandwidth even with small messages
faciliate the use of novel communication protocols
U-Net
each process has the illusion of owning the interface to the network
Model: LogP (costs of communication)
sender: cost to send message Os
latency l
receiver: cost to receive message Or
Time = 2l + 2Os + 2Or assume Os ~~ Or: Time = 2l + 4O
History
Distributed Computing 80s
early ethernet, bus-like, communication one-at-a-time, limits scale
Hadoop reshuffle would be a very bad idea
Massively Parallel Processors early 90s (MPPs) “culler, parallel computer architecture”
scalable networks, tree-base switched net
high throughput, low latency (potential)
Active Messages 92
Lightweight RPC (core idea similar RPC, but no compiler)
Philosophy
bare minimal network layer
virtualized network interface
protection: because multiple users of network
multiplexing/ sharing
usually done by OS, but here is done by user level
OS resource is limited, unified and constrained
performance gain
flexibility gain
flexibility
high performance
Basics
EndPoints: queues: memory for data (communication segment, pinned host memory)
send queue, recv queue, free queue
queue contain descriptors: info about a message (e.g., destination addr, point to data (communication segment))
Send
form message in communication segment
push descriptor onto send queue
Network interface (or U-Net)
must recognize that exists a message on send queue
once sent, mark descriptor bit
Receive
Network Interface (NI): must demultiplex which is EndPoint
tag
Where to place message?
receive data into Free queue descriptor
put Free Queue -> Recv Queue
Notification
poll
event-driven (interrupt, callback)
Implementation
Base Arch:
limited (small) communication segment
pinned in all memory (can not swap memory out)
but whole memory is limited
in order to let NI to access memory
small message optimization
put it directly into descriptor
Direct Arch:
removes limits of communication segment (can send/ recv from any virtual address)
NI have to do DMA
Kernel-emulated endpoints
because resource is limited
one OS EndPoint allows many EndPoints to map onto Kernel EndPoint
to migrate application to here
Fore SBA-200
Network Interface has CPU, memory, CRC
EndPoint: Queues: where to place?
NI memory or Host memory
send, free queue in NI memory
recv queue in host memory
The U-net paper discusses the notion of zero copy vs. true zero copy. What is the point being made here? Is this important for high performance applications and services?
Review
U-Net has two levels of sophistication
base-level is zero copy requires an intermediate copy into a networking buffer
direct acess is true zero copy, which has no intermediate buffering
Comment
True zero copy is useful, since the processing overhead consists of transmission time, buffer management, message copies, checksum, flow control, interrupt, network interface control and so on. Meanwhile, the speed of NIC < Memory < CPU. So it is natural to think about reducing the memory accessing times when there is almost no room for accelerate NIC speed. When transferring large bulk data, think about 1GB, it is very uneconomic to copy it once.
However, buffer could be beneficial if the data to send is yieled continuously. e.g., the user process put only 1 Byte everytime for 1000 times to send 1K data. A buffer would largely increase the throughput and reduce network congestion. Even though the author preaches that his design has minimum overhead, it is still necessary to use zero copy to buffer a large packet when one does not need too much realtime communication.
Modern high performance server such as Nginx has adopted similar mechanism, which is sendfile() a syscall since Linux 2.2. The main idea of sendfile() is to combine the read() and write() together, so that one does not need to copy some content from a file descriptor to user buffer and then write the buffer out. This idea tries to achieve zero-copy in kernel level, which would accelerate rending static file from disk to client much quickly.