[Translation] Blocking IO Nonblocking IO and Epoll

This post explains epoll usage and differences between blocking and nonblocking IO.

In this post I want to explain exactly what happens when you use nonblocking I/O. In particular, I want to explain:

The semantics of setting O_NONBLOCK on a file descrpter using fcntl
How nonblocking I/O is different from asynchronous I/O
Why nonblocking I/O is frequently used in conjunction with I/O multiplexers like select, epoll, and kqueue
How nonblocking mode interacts with edge-triggered polling with epoll

在这篇博文中，我想解释当使用非阻塞IO 时确切发生了什么。具体地，我想解释：

在一个文件描述上使用 fcntl 为其设置 O_NONBLOCKING 标志的那些语义

非阻塞IO 与异步IO 如何不同

为什么非阻塞IO 经常和 IO复用,如select, epoll 或 kqueue 一起使用

非阻塞模式如何与 epoll 边沿触发 (edge-trigered) 轮训方式交互

Blocking Mode

阻塞模式

By default, all file descriptors on Unix systems start out in “blocking mode”. That means that I/O system calls like read, write, or connect can block. A really easy way to understand this is to think about what happens when you read data on stdin from a regular TTY-based program. If you call read on stdin then your program will block until data is actually available, such as when the user actually physically types characters on their keyboard. Specifically, the kernel will put the process into the “sleeping” state until data is available on stdin. This is also the case for other types of file descriptors. For instance, if you try to read from a TCP socket then the read call will block until the other side of the connection actually sends data.

默认情况下，Unix 中的所有文件描述符都开启的是“阻塞模式”。这意为着，I/O 类系统调用如 read, write, 或 connect 会阻塞。一个容易理解的例子是，考虑在一个 TTY 程序中，从标准输入流 (stdin) 中读入数据时发生的情况。如果从标准输入中读入数据，那么该程序会一直阻塞，直到有数据可用为止，比如用户从键盘中键入字符。具体说来，内核会使进程休眠，直到标准输入有数据可用。其他文件描述符了符合这种情况。例如，从 TCP 套接字中读取数据，然后 read 系统调用会一直阻塞，直到另一端上有数据发过来为止。

Blocking is a problem for programs that should operate concurrently, since blocked processes are suspended. There are two different, complementary ways to solve this problem. They are:

Nonblocking mode
I/O multiplexing system calls, such as select and epoll

These two solutions are often used together, but they are independent strategies to solving this problem, and often both are used. In a moment we’ll see the difference and why they’re commonly both used.

因为阻塞着的进程是挂起状态，所以对于并发程序而言，阻塞是个大问题。解决此问题，有两种完全不同处理思路。

非阻塞模式（的I/O）

I/O 复用，例如 select 和 epoll 系统调用

这两种方案经常会一起使用，但它们同时也是针对该问题的两种独立策略，因而都会被用到。之后，我们会讨论两者的区别也会讨论为什么它们会都用到。

Nonblocking Mode (O_NONBLOCK)

非阻塞模式 (O_NONBLOCK)

A file descriptor is put into “nonblocking mode” by adding O_NONBLOCK to set of fcntl flags on the file descriptor:

一个文件描述符通过 fcntl 为其添加 O_NONBLOCK 标志后就会进入 ”非阻塞模式“。

1
2
3

/* set O_NONBLOCK on fd*/
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, FSETFL, flags | O_NONBLOCK);

From this point forward the file descriptor is considered nonblocking. When this happens I/O system calls like read and write that would block will return -1, and errno will be set to EWOULDBLOCK.

此后，该文件描述符会被认为是非阻塞的。read 和 write 等阻塞式系统调用会返回 -1, 而且 errno 会设置为 EWOULDBLOCK。

This is interesting, but on its own is actually not that useful. With just this primitive there’s no efficient way to do I/O on multiple file descriptors. For instance, suppose we have two file descriptors and want to read both of them at once. This could be accomplished by having a loop that checks each file descriptor for data, and then sleeps momentarily before checking again:

着很有趣，但单单就此设置而言实用不大。单此系统级支持而言，并无高效方式来处理多个文件描述符。例如，假设我们有两个文件描述，而且希望同时读取数据。这或许可以使用循环检查各个文件描述符的数据然后在下次检查之前休眠一会儿来实现。

struct timespec sleep_interval{.tv_sec = 0, .tv_nsec = 1000};
ssize_t nbytes;
for (;;) {
    /* try fd1 */
    if ((nbytes = read(fd1, buf, sizeof(buf))) < 0) {
        if (errno != EWOULDBLOCK) {
            perror("read/fd1");
        }
    } else {
        handle_data(buf, nbytes);
    }
    /* try fd2 */
    if ((nbytes = read(fd2, buf, sizeof(buf))) < 0) {
        if (error != EWOULDBLOCK) {
            perror("read/fd2");
        }
    } else {
        handle_data(buf, nbytes);
    }
    
    /* sleep for a bit; real version needs error chekcing! */
    nanosleep(sleep_interval, NULL);
}

This works, but has a lot of drawbacks:

When data is coming in very slowly the program will wake up frequently and unnecessarily, which wastes CPU resources.
When data does come in, the program may not read it immediately if it’s sleeping, so the latency of the program will be poor.
Handling a large number of file descriptors with this pattern would become cumbersome.

To fix these problems we need an I/O multiplexer.

这是可以达到要求的，但有许多不足：

当数据到来的速度很慢时，程序会经常性被唤醒，这很不必要，而且浪费CPU资源

当数据如期到来，如果程序在休眠那么不能够立即读取数据，因此存在很严重的延迟问题

用这种模式处理多个文件描述是，会即为繁琐

I/O multiplexing (select, epoll, kqueue, etc.)

I/O 复用技术 (select, epoll, kqueue, 等)

There’s a few I/O multiplexing system calls. Examples of I/O multiplexing calls include select (defined by POSIX) , the epoll family on Linux, and the kqueue family on BSD. These all work fundamentally the same way: they let the kernel know what events (typically read events and write events) are of interest on a set of file descriptors, and then they bock until something of interest happens. For instance, you might tell the kernel you are interested in just read events on file descriptor X, both read and write events on file descriptor Y, and just write event file descriptor Z.

存在许多 I/O复用系统调用。例子如 Linux 下的 epoll 家族和 POSIX 定义的 select，再如 BSD 下的 kqueue 家族。它们的工作原理一样：它们让内核感知到一系列文件描述符的事件（特别是读事件和写事件），然后会一直阻塞直到感兴趣的事情发生为止。例如，你可以对文件描述符 X 的读事件感兴趣，对文件描述符 Y 的读写事件感兴趣，对文件描述符 Z 的写事件感兴趣。

These I/O multiplexing system calls typically do not care if the file descriptors are in blocking mode or nonblocking mode. You can leave all of your file descriptors in blocking mode and they’ll work just fine with select or epoll. If you only call read and write on file descriptors returned by select and epoll the calls won’t block, event if those file descriptors are in blocking mode. There’s one important exception! The blocking or nonblocking status of a file descriptor is significant for edge-triggered polling, as explained further blow.

这些 I/O 复用系统调用一般不关心文件描述符是阻塞模式还是非阻塞模式。对于 select 和 epoll 而言，让文件描述符处于阻塞模式也是可以正常工作的。如果仅仅是对这些由 select 和 epoll 返回的（阻塞的）文件描述调用 read 和 write ，它们是不会阻塞的。但也是一个非常重要的例外！对于边沿触发式轮训，文件描述符的阻塞与否就会格外特别，下面将会深入讨论这个情况。

The multiplexing approach to concurrency is what I call “asynchronous I/O”. Sometimes people will call this same approach “nonblocking I/O”, which I believe comes from a confusion about what “nonblocking” means at the systems programming level. I suggest reserving the term “nonblocking” for referring to whether or not file descriptors are actually in nonblocking mode or not.

并发下的 I/O 复用思路，我称之为 “异步I/O”。有时人们也会把这个思路称之为“非阻塞式 I/O”，我相信这是来自系统编程层面上对于“非阻塞式 I/O” 的混淆。我建议将 “非阻塞式 I/O”（这个名称）保留给一个文件描述符是否是阻塞模式。

How O_WOULDBLOCK interacts With I/O Multiplexing

O_WOULDBLOCK 状态如何与 I/O 复用交互

Let’s say we’re writing a simple socket server using select with blocking file descriptors. For simplicity, in this example we just have descriptors we want to read from, which are in read_fds. The core part of the event loop will call select and then invoke read once for each file descriptor with data:

假设我们预计使用 select 和阻塞式文件描述符写一个简单的套接字服务器。简而化之，在这个例子中，我们仅关注读取的文件描述符（read_fds）。主体部分的事件循环会调用 select 然后对每个有数据的文件描述符立即触发 read系统调用。

ssize_t nbytes;
for (;;) {
    /* select call happens here */
    if (select(FD_SETSIZE, &read_fds, NULL, NULL, NULL) < 0) {
        perror("select");
        exit(EXIT_FAIFLURE);
    }
    
    for (int i = 0; i < FD_SETSIZE; i++) {
        if (FD_ISSET(i, &read_fds)) {
            /* read call happens here */
            if ((nbytes = read(i, buf, sizeof(buf))) >= 0) {
                handle_read(nbytes, buf);
            } else {
                /* real version needs to handle EINTR correctly */
                perror("read");
                exit(EXIT_FAILURE);
            }
        }
    }
}

This works and it’s perfectly fine. But, what happens if buf is small, and a lot of data comes down the line? To be concrete, suppose that buf is a 1024-byte buffer but 64KB of data comes in all at once. To handle this request we’ll invoke select followed by read 64 times. That’s 128 total system calls, which is a lot.

这段代码可以工作，而且还很完美。但是，如果 buf 很小，却有很多数据到达又当如何？具体说来，假设 buf 是 1024 字节的缓冲区，但是立马有 64KB 数据到来。为处理该请求，我们需要在 read 后触发 select 64 次。总共 128 次系统调用，这的确太多了。

If the buffer size is too small read will have to be called a lot of times, there’s no avoiding that. But perhaps we could reduce the number of times we call select ? Ideally in this example we would call select only one time.

如果缓冲区大小太小，read 将会调用很多次，这无可避秒。那么我们能否减少 select 调用的次数呢？理想情况下，我们可以做到仅调用 select 一次。

In fact, this is possible, and it’s accomplished by putting the file descriptors into nonblocking mode. The basic idea is that you keep calling read in a loop until it returns EWOULDBLOCK. That looks like this:

实际上，这是可能的，可以通过将文件描述符设置为非阻塞模式来实现。基本观点是得一直调用 read，直到返回 EWOULDBLOCK 为止。可能的实现的方式如下：

ssize_t nbytes;
for (;;) {
    /* select call happens here */
    if (select(FD_SETSIZE, &read_fds, NULL, NULL, NULL) < 0) {
        perror("select");
        exit(EXIT_FAILURE);
    }
    for (int i = 0; i < FD_SETSIZE; i++) {
        if (FD_ISSET(i, &read_fds)) {
            /* NEW: loop until EWOULDBLOCK is encounted */
            for (;;) {
                /* read call happens here */
                nbytes = read(i, buf, sizeof(buf));
                if (nbytes >= 0) {
                    handle_read(nbytes, buf);
                } else {
                    if (errno != EWOULDBLOCK) {
                        /* read version needs to handle EINTR correctly */
                        perror("read");
                        exit(EXIT_FAILURE);
                    }
                    break;
                }
            }
        }
    }
}

In this example (1024-byte buffer with 64KB of data incoming) we’ll do 66 system calls: select will be called one time, read will be called without error 64 times, and read will be called and return EWOULDBLOCK one time. This is much better! This is nearly half the number from the previous example, which will improve performance and scalability considerably.

在这个例子中 (1024 字节的缓冲区和到来的 64KB数据)，我们需要 66 次系统调用：select 调用一次，read 会无错调用 64 次，另外还会调用后返回 EWOULDBOCK 错误一次。相较之前，这好很多了！将近是此前例子的一半，从性能上和可扩展性显著提升。

The downside of this approach is that due to the new loop, there’s at least one extra read that happens since it’s called until it returns EWOULDBLOCK. Let’s say that typically the read buffer is large enough to read all of the incoming data in one read call. Then in the usual case through the loop there will be three system calls rather than just two: select to wait for the data, read to actually read the data, and then read again to get EWOULDBLOCK.

这个方法的缺点是多了一个循环，也会额外调用一次 read 从而从EWOULDBLOCK 中返回。那么通常情况下缓冲区足够大，可以一次读取所有到来的数据。那么通常情况下需要 3 次系统调用而不是 2 次：select 一次，read 读取数据一次，read 从 EWOULDBLOCK 中返回一次。

Edge-Triggerd Polling

边沿触发轮训

There’s one more important use of nonblocking I/O: with edge-triggered polling in the epoll system call. This system call has two modes: leve-triggered polling, and edge-triggered polling. Level-triggered polling is a simpler programming model that similar to the classic select system call. To explain the difference we need to understand how epoll works with the kernel.

非阻塞 I/O 有一个重要的使用场景：和 epoll 系统调用一起使用。该系统调用有两种模式：水平触发和边沿触发。水平触发和典型的 select 系统调用很相似。为了解释二者的不同，我们得先理解 epoll 如何与内核一起工作。

Suppose you tell the kernel you’re interested in using epoll to monitor read events on some file descriptor. The kernel maintains a list of these interests for each file descriptor. When data comes in on the file descriptor the kernel traverses the interests list and wakes up each process that was blocked in epoll_wait with that descriptor in the event list.

假设你告诉内核你对epoll 监视的某些文件描述符上的读事件感兴趣。内核会为每个文件描述符维护一个链表。当该文件描述符有数据到来时，内核会遍历该文件描述符相关的链表上的所有感兴趣事件，然后通过该文件描述上的事件链表唤醒每一个阻塞在 epoll_wait 中的进程。

What I outlined above happens regardless of what triggering mode epoll is in. The difference between level-triggered and edge-triggered polling is what happens in the kernel when you call epoll_wait. In level-triggered mode the kernel will traverse each file descriptor in the interest list to see if it already matches the interest condition. For instance, if you registered a read event on file descriptor 8, when calling epoll_wait the kernel we first check: does file descriptor 8 already have data ready for reading? If any of the file descriptors match interest then epoll_wait can return without blocking.

无论 epoll 是什么触发模式，以上我述之都是如此。水平触发与边沿触发之不同之处在于调用 epoll_wait 时，内核发生了什么。在水平触发模式下，内核会遍历兴趣列表中每一个文件描述符以检查是否匹配条件。例如，如果你在文件描述符 8 上注册了读事件，当调用 epoll_wait 时，内核首先会检查：文件描述符 8 是否有数据可读？只要有文件描述符（可多个）匹配（读事件），那么 epoll_wait 就会无阻塞地返回。

By contrast, in edge-triggered mode the kernel skips this check and immediately puts the process to sleep when it calls epoll_wait. This puts all of the responsibility on you, the programmer, to do the Right Thing and fully read and write all data for each file descriptor before waiting on this.

与之相反的是，在边沿触发模式下，在调用 epoll_wait 时内核会跳过检查直接将进程休眠。这样将所有的正确事情（程序实际业务逻辑）职责丢给了程序开发人员，从而在等待之前读写每一个文件描述符的所有数据。

This edge-triggered mode is what makes epoll an O(1) I/O multiplexer: the epoll_wait call will suspend immediately, and since a list is maintained for each file descriptor ahead of time, when new data comes in the kernel immediately knows what processes must be woken up in O(1) time.

epoll I/O 复用中 O(1) 性能的是边沿触发：epoll_wait 调用会立即挂起，因为每个文件描述符提前维护了一个列表，当有新数据到来时，内核立即知道哪些进程需要在 O(1) 时间内唤醒。

Here’s a more worked out example of the difference between edge-triggered and level-triggered modes. Suppose your read buffer is 100 bytes, and 200 bytes of data comes in for that file descriptor. Suppose then you call read exactly one time and then call epoll_wait again. There’s still 100 bytes of data already ready to read. In level-triggered mode the kernel would notice this and notify the process that it should call read again. By contrast, in edge-triggered mode the kernel would immediately go to sleep. If the other side is expecting a response (e.g. the data it sent is some kind of RPC) then the two sides will “deadlock”, as the server will be waiting for the client to send more data, but the client will be waiting for the server to send a response.

这里有另一个关于水平触发和边沿触发区别的例子。假设你的缓冲区大小是 100 字节，同时有 200 字节的数据到来。假设可以调用 read 一次，然后可以继续调用 epoll_wait 。那么仍然有 100 字节数据早已就绪。在水平触发模式下，内核会感知到这个信息，并且通知进程需要再次调用 read。与之不同的是，在边沿触发模式下，内核会立即进入休眠。如果另一端在等待回复（例如，数据是通过 RPC 发送的），那么双方会陷入“死锁”，因为服务器等待客户端发送更多数据，然而客户端也在等待服务器的回复。

To use edge-triggered polling you must put the file descriptors into nonblocking mode. Then you must call read or write until they return EWOULDBLOCK every time. If you fail to meet these conditions you will miss notifications from the kernel. But there’s a big upside of doing this: each call to epoll_wait will be more efficient, which can be very important on programs with extremely high levels of concurrency. If you want to learn more about the details I strongly encourage you to read the epoll man page.

使用边沿触发模式，必须为文件描述符设置非阻塞标志。然后，还得调用 read 或者 write 直到每次返回 EWOULDBLOCK 为止。如果没有达到这些条件，内核将不会发送通知。但是，内核这么做有一个巨大的优点，每次 epoll_wait 调用都会更加高效，这对于高并发程序很重要。如果你想知道更多细节，我强烈推荐你去阅读 epoll 的指导手册页面。

Update: I’ve put up a complete example of using edge-triggered epoll on GitHub:https://github.com/eklitzke/epollet.

更新：我已将一个如何使用 epoll 边沿触发的完整例子上传到了 GITHUB 上。