第5章　輸入輸出（下）

5.5　Redis 對 epoll 的封裝

Redis 的作者和 Nginx 的作者一樣，不喜歡引入第三方的庫，比如 libevent、libev 來做事件處理，而是自己封裝了 epoll，不像 Memcachd 的 I/O 模型還得依賴 libevent。Redis 的 I/O 模型針對不同系統做了不同的實現，比如 Linux 中的實現是對 epoll 的封裝，BSD 中的實現是對 kqueue 的封裝。針對 Linux 的實現，我們來看其核心的 ae_epoll.c：

aeApiState 封裝了 epoll_event：

typedef struct aeApiState {
    int epfd;
    struct epoll_event *events;
} aeApiState;

aeApiCreate 用於調用 epoll_create 創建 epoll：

static int aeApiCreate(aeEventLoop *eventLoop) {
    aeApiState *state = zmalloc(sizeof(aeApiState));
    …
    state->epfd = epoll_create(1024); // 1024是內核設置的默認值
 …
}

aeApiAddEvent 用於向 epoll 中註冊事件：

static int aeApiAddEvent(aeEventLoop *eventLoop, int fd, int mask) {
    aeApiState *state = eventLoop->apidata;
    struct epoll_event ee;
    …
    ee.data.fd = fd;
    if (epoll_ctl(state->epfd,op,fd,&ee) == -1) return -1;
    return 0;
}

aeApiDelEvent 用於從 epoll 中刪除事件：

static void aeApiDelEvent(aeEventLoop *eventLoop, int fd, int delmask) {
    aeApiState *state = eventLoop->apidata;
    struct epoll_event ee;
    int mask = eventLoop->events[fd].mask & (~delmask);
    ee.events = 0;
    if (mask & AE_READABLE) ee.events |= EPOLLIN;
    if (mask & AE_WRITABLE) ee.events |= EPOLLOUT;
    ee.data.u64 = 0; /* avoid valgrind warning */
    ee.data.fd = fd;
    if (mask != AE_NONE) {
        epoll_ctl(state->epfd,EPOLL_CTL_MOD,fd,&ee);
    } else {
        // 注意, Kernel < 2.6.9 EPOLL_CTL_DEL 需要一個非空的事件指針
        epoll_ctl(state->epfd,EPOLL_CTL_DEL,fd,&ee);
    }
}

aeApiPoll 通過調用 epoll_wait 等待 epoll 事件就緒：

static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
    aeApiState *state = eventLoop->apidata;
    int retval, numevents = 0;
    retval = epoll_wait(state->epfd,state->events,eventLoop->setsize,
            tvp ? (tvp->tv_sec*1000 + tvp->tv_usec/1000) : -1);
    if (retval > 0) {
        int j;
        numevents = retval;
        for (j = 0; j < numevents; j++) {
            int mask = 0;
            struct epoll_event *e = state->events+j;
            if (e->events & EPOLLIN) mask |= AE_READABLE;
            if (e->events & EPOLLOUT) mask |= AE_WRITABLE;
            if (e->events & EPOLLERR) mask |= AE_WRITABLE;
            if (e->events & EPOLLHUP) mask |= AE_WRITABLE;
            eventLoop->fired[j].fd = e->data.fd;
            eventLoop->fired[j].mask = mask;
        }
    }
    return numevents;
}

其中：

aeApiCreate：調用 epoll_create 創建了一個 epoll 池子。

aeApiAddEvent：調用 epoll_ctl 向 epoll 中註冊事件。

aeApiPoll：通過調用 epoll_wait 來獲取已經響應的事件。

那麼這個過程是如何呢？我們來一步一步看 server epoll 初始化過程：

首先在 initServer 函數執行的時候初始化了 epoll：

server.el = aeCreateEventLoop(server.maxclients+REDIS_EVENTLOOP_FDSET_INCR);

接着設置回調函數：

aeSetBeforeSleepProc(server.el,beforeSleep);

再來看主循環中的 aeMain 函數：

void aeMain(aeEventLoop *eventLoop) {

  eventLoop->stop = 0;
  while (!eventLoop->stop) {
      if (eventLoop->beforesleep != NULL)
          eventLoop->beforesleep(eventLoop);
      aeProcessEvents(eventLoop, AE_ALL_EVENTS);
  }
}

最後循環調用 aeProcessEvents 來進行事件處理（見圖5-10）。

圖5-10　Redis epoll 主循和事件的關係

可以看到 eventLoop 會對兩類事件進行處理，定時器事件和 file 事件。

最後我們來看 aeProcessEvents 函數：

int aeProcessEvents(aeEventLoop *eventLoop, int flags)
{
    int processed = 0, numevents;
    …
        aeTimeEvent *shortest = NULL;
        struct timeval tv, *tvp;
        if (flags & AE_TIME_EVENTS && !(flags & AE_DONT_WAIT))
            shortest = aeSearchNearestTimer(eventLoop);
        if (shortest) {
            long now_sec, now_ms;
            aeGetTime(&now_sec, &now_ms);
            tvp = &tv;
            tvp->tv_sec = shortest->when_sec - now_sec;
            if (shortest->when_ms < now_ms) {
                tvp->tv_usec = ((shortest->when_ms+1000) - now_ms)*1000;
                tvp->tv_sec --;
            } else {
                tvp->tv_usec = (shortest->when_ms - now_ms)*1000;
            }
            if (tvp->tv_sec < 0) tvp->tv_sec = 0;
            if (tvp->tv_usec < 0) tvp->tv_usec = 0;
        } else {
            // AE_DONT_WAIT 標誌置位，則設置超時時間爲0
            if (flags & AE_DONT_WAIT) {
                tv.tv_sec = tv.tv_usec = 0;
                tvp = &tv;
            } else {
                // 否則會發生阻塞
                tvp = NULL;         // 一直等待
            }
        }

        numevents = aeApiPoll(eventLoop, tvp);
        for (j = 0; j < numevents; j++) {
            aeFileEvent *fe = &eventLoop->events[eventLoop->fired[j].fd];
            int mask = eventLoop->fired[j].mask;
            int fd = eventLoop->fired[j].fd;
            int rfired = 0;
            if (fe->mask & mask & AE_READABLE) {
                rfired = 1;
                fe->rfileProc(eventLoop,fd,fe->clientData,mask);
            }
            if (fe->mask & mask & AE_WRITABLE) {
                if (!rfired || fe->wfileProc != fe->rfileProc)
                    fe->wfileProc(eventLoop,fd,fe->clientData,mask);
            }
            processed++;
        }
    }
    if (flags & AE_TIME_EVENTS)
        processed += processTimeEvents(eventLoop);
    return processed;                 // 返回需要處理的 file/time 事件數量
}

這個函數大致上分爲以下幾個步驟：

1）aeSearchNearestTimer 查找是否有要優先處理的定時器任務，如果有就先處理。

2）假如沒有，則執行 aeApiPoll 來處理 epoll 中的就緒事件，而且是無限等待的哦：

    if (flags & AE_DONT_WAIT) {
        tv.tv_sec = tv.tv_usec = 0;
        tvp = &tv;
    } else {
        tvp = NULL;
    }

3）處理定時器任務。

最後我們來看一下 Redis 的整體事件處理流程（見圖5-11），由於 Redis 本身是單線程的，沒有鎖的競爭，爲了提高處理的吞吐量，Redis 把工作的流程拆成了很多步，每步都是通過 epoll 的機制來回調，這樣儘量不讓一個請求 hold 住主線程，讓系統的吞吐量得到有效的提升。

圖5-11　Redis 的整體事件處理流程圖

5.6　Nginx 文件異步 I/O

爲了提升對 I/O 事件的及時響應速度，Linux 提供了 aio 機制，該機制實現了真正的異步 I/O 響應處理，不像 libc 的 aio 是異步線程僞裝的。

因爲 Linux 的 aio 對緩存不支持，所以在 Nginx 中，僅僅對讀文件做了 aio 的支持。

aio 的使用可以分爲以下幾個步驟：

1）io_setup：初始化異步 I/O 上下文，類似於 epoll_create。

2）io_submit：註冊異步事件和回調 handler。

ngx_epoll_module 在初始化的時候，會先進行 aio 的初始化：

ngx_epoll_aio_init(ngx_cycle_t *cycle, ngx_epoll_conf_t *epcf)
{
    int                    n;
    struct epoll_event  ee;

#if (NGX_HAVE_SYS_EVENTFD_H)
    ngx_eventfd = eventfd(0, 0);
#else
    ngx_eventfd = syscall(SYS_eventfd, 0);
#endif
…
    n = 1;

    if (ioctl(ngx_eventfd, FIONBIO, &n) == -1) {
        ngx_log_error(NGX_LOG_EMERG, cycle->log, ngx_errno,
                      "ioctl(eventfd, FIONBIO) failed");
        goto failed;
    }

    if (io_setup(epcf->aio_requests, &ngx_aio_ctx) == -1) {
        ngx_log_error(NGX_LOG_EMERG, cycle->log, ngx_errno,
                      "io_setup() failed");
        goto failed;
    }

    ngx_eventfd_event.data = &ngx_eventfd_conn;
    ngx_eventfd_event.handler = ngx_epoll_eventfd_handler;
    ngx_eventfd_event.log = cycle->log;
    ngx_eventfd_event.active = 1;
    ngx_eventfd_conn.fd = ngx_eventfd;
    ngx_eventfd_conn.read = &ngx_eventfd_event;
    ngx_eventfd_conn.log = cycle->log;
    ee.events = EPOLLIN|EPOLLET;
    ee.data.ptr = &ngx_eventfd_conn;

    if (epoll_ctl(ep, EPOLL_CTL_ADD, ngx_eventfd, &ee) != -1) {
        return;
    }
…
}

然後 Nginx 會在讀取文件的時候調用 ngx_file_aio_read 函數進行異步讀取：

ssize_t
ngx_file_aio_read(ngx_file_t *file, u_char *buf, size_t size, off_t offset,
    ngx_pool_t *pool)
{
    ngx_err_t         err;
    struct iocb      *piocb[1];
    ngx_event_t      *ev;
    ngx_event_aio_t  *aio;
    ...
    aio = file->aio;
    ev = &aio->event;
    ...
    if (ev->complete) {
        ev->active = 0;
        ev->complete = 0;

        if (aio->res >= 0) {
            ngx_set_errno(0);
            return aio->res;
        }

       ...
    }

    ngx_memzero(&aio->aiocb, sizeof(struct iocb));

    aio->aiocb.aio_data = (uint64_t) (uintptr_t) ev;
    aio->aiocb.aio_lio_opcode = IOCB_CMD_PREAD;
    aio->aiocb.aio_fildes = file->fd;
    aio->aiocb.aio_buf = (uint64_t) (uintptr_t) buf;
    aio->aiocb.aio_nbytes = size;
    aio->aiocb.aio_offset = offset;
    aio->aiocb.aio_flags = IOCB_FLAG_RESFD;
    aio->aiocb.aio_resfd = ngx_eventfd;

    ev->handler = ngx_file_aio_event_handler;

    piocb[0] = &aio->aiocb;

    if (io_submit(ngx_aio_ctx, 1, piocb) == 1) {
        ev->active = 1;
        ev->ready = 0;
        ev->complete = 0;

        return NGX_AGAIN;
    }
...
}

5.7　tail 指令爲何牛

因爲需要採集線上環境的數據，我開發了一個 Java agent 程序來採集相關信息。跑了一段時間後發現一個問題，假如採集數據量壓力過大的話，會產生該進程佔用 CPU 過高，例如100%以上的情況。

首先來看代理程序的僞代碼：

public void run() {
    try {
    accesslog.seek(accesslog.length());
        int i = 0;
        while (!Thread.currentThread().isInterrupted()) {
            String line = accesslog.readLine();
            if (line != null) {
                try {
                    parseLineAndLog(line);
                } catch (Exception ex) {

                    LOGGER.error("parseLineAndLog log error:", ex);
                }
                try {
                    if (i++ % 100 == 0) {
                        Thread.sleep(100);
                    }
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }
    } catch (IOException e) {
        LOGGER.error("read ngx access log error:", e);
    }
}

我們首先通過 top-H-p${pid}觀察該進程中的具體哪個線程佔用 CPU 比較高。找到後，通過以下指令把 pid 轉換成16進制的數據：

awk '{printf("%x",1234)}'

然後我們通過如下命令產生線程堆棧信息：

jstack ${pid} > stack.log

再通過剛纔 awk 的16進制進程號查看，發現 jstack 中該進程一直在做如下操作：

String line = accesslog.readLine();

因爲上面代碼採用的 accesslog 其實是：

this.accesslog = new RandomAccessFile(path, "r");

而它的 readLine（）方法是不阻塞的，輪詢必然導致 CPU 佔用率的提升。

那如何解決呢？我使用了 tail 指令：

 Process p = Runtime.getRuntime().exec("tail -n 1 -F " + path);
 br = new BufferedReader(new InputStreamReader(p.getInputStream()));

然後在 while 循環中用 br.readline 來解決問題。

上線後再觀察，神奇的事情發生了，CPU 佔用率始終控制在5%以下。

那麼一條 tail 指令爲什麼能那麼神奇呢？我們根據 Linus 大神的指示：從代碼中尋找答案。

首先我們找到 tail 的源碼：

http:// git.savannah.gnu.org/cgit/coreutils.git/tree/src/tail.c

經過一步一步分析後，我們發現，最終會調用 tail_forever_inotify 函數。

在2.6內核之後，Linux 提供了 inotify 功能，內核通過監控文件系統的變更來反向通知用戶，這樣減少了輪詢的開銷。我們來看其實現：

tail_forever_inotify (int wd, struct File_spec *f, size_t n_files,
                        double sleep_interval)
{
...
f[i].wd = inotify_add_watch (wd, f[i].name, inotify_wd_mask);
...
    if (pid)
        {
            if (writer_is_dead)
                exit (EXIT_SUCCESS);

            writer_is_dead = (kill (pid, 0) != 0 && errno != EPERM);

            struct timeval delay; // 等待文件變化的時間
            if (writer_is_dead)
                delay.tv_sec = delay.tv_usec = 0;
            else
                {
                    delay.tv_sec = (time_t) sleep_interval;
                    delay.tv_usec = 1000000 * (sleep_interval - delay.tv_sec);
                }

                fd_set rfd;
                FD_ZERO (&rfd);
                FD_SET (wd, &rfd);

                int file_change = select (wd + 1, &rfd, NULL, NULL, &delay);

                if (file_change == 0)
                    continue;
                else if (file_change == -1)
                    die (EXIT_FAILURE, errno, _("error monitoring inotify event"));
            }
...
 len = safe_read (wd, evbuf, evlen);
...

所以以上步驟主要分爲三步：

1）註冊 inotify 的 watch。

2）用 select 等待 watch 事件發生。

3）用 safe_read 讀取準備好的數據。

5.8　零拷貝技術應用分析

在常見的 I/O 場景中，都是先通過 read+write（或 send）的方式來完成的，如圖5-12所示，read 調用先從用戶態切換到內核態，然後從文件中讀取了數據，存儲到內核的緩衝區中，然後再把數據從內核態緩衝區拷貝到用戶態，同時從內核態切換到用戶態。

接着用 send 寫入到指定文件也是類似的過程，這裏存在4次上下文切換和4次緩衝區的拷貝。

圖5-12　一次 read/send 的過程

爲了優化這個緩衝區拷貝和上下文切換的次數，Linux 提供了幾種方案，下面分別介紹。

5.8.1　mmap

假如僅僅是把數據寫入到文件，Linux 提供了 mmap 的方式來共享內存虛擬地址空間，這樣只要寫共享內存就是寫文件，讀共享內存就是讀文件，減少了緩衝區拷貝的次數。

mmap 的實現最終通過 do_mmap 函數來實現：

unsigned long do_mmap(struct file *file, unsigned long addr,
            unsigned long len, unsigned long prot,
            unsigned long flags, vm_flags_t vm_flags,
            unsigned long pgoff, unsigned long *populate)
{
    struct mm_struct *mm = current->mm;                        // 當前進程的 mm
    …
    if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
                                                                // 是否隱藏了可執行屬性
        if (!(file && path_noexec(&file->f_path)))
            prot |= PROT_EXEC;
    if (!(flags & MAP_FIXED))        // MAP_FIXED沒有設置
        addr = round_hint_to_min(addr);        // 判斷輸入的欲映射的起始地址是否小於最小映射地址，如果小於，將 addr 修改爲最小地址
    len = PAGE_ALIGN(len);        // 檢測 len 是否越界
    …
    if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)                // 再次檢測是否越界
        return -EOVERFLOW;
    if (mm->map_count > sysctl_max_map_count)        // 超過一個進程中對於 mmap 的最大個數限制
        return -ENOMEM;
    addr = get_unmapped_area(file, addr, len, pgoff, flags);        // 獲取沒有映射的地址（查詢 mm 中空閒的內存地址）
    …
    // 設置 vm_flags，根據傳入的 port 和 flags 以及 mm 自己的 flag 來設置
    vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
            mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
    …
    if (file) {
        struct inode *inode = file_inode(file);
        switch (flags & MAP_TYPE) {
        case MAP_SHARED:
            if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
                                                                // file 應該被打開並允許寫入
                return -EACCES;
            if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
                                                                // 不能寫入一個只允許寫追加的文件
                return -EACCES;
            if (locks_verify_locked(file))                // 文件被強制鎖定
                return -EAGAIN;
            vm_flags |= VM_SHARED | VM_MAYSHARE        // 嘗試允許其他進程共享
            if (!(file->f_mode & FMODE_WRITE))                // 如果 file 不允許寫,取消共享
                vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
        …
        }
    }

    …
    addr = mmap_region(file, addr, len, vm_flags, pgoff); // 建立從文件到虛存區間的映射
    …
    return addr;
}

以上過程最重要的兩步是：

1）get_unmapped_area 查詢並獲取當前進程虛擬地址空間中空閒的沒有映射的地址。

2）mmap_region 建立從文件到虛存區間的映射。

最終映射後的關係如圖5-13所示。

圖5-13　文件和虛擬地址空間映射後的關係

5.8.2　sendfile

假如需要從一個文件讀數據，並且寫入到另一個文件，mmap 的方式還是會存在2次系統調用4次上下文切換，所以 Linux 又提供了 sendfile 的調用，1次系統調用搞定（見圖5-14）。

圖5-14　sendfile 調用過程

下面我們來分析一下 sendfile 的實現，sendfile 調用最終會調用 do_sendfile 函數：

static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
                    size_t count, loff_t max)
{
    file_start_write(out.file);
    retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
    file_end_write(out.file);
…
    return retval;
}

其中最關鍵的一行是 do_splice_direct：

long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
                loff_t *opos, size_t len, unsigned int flags)
{
    struct splice_desc sd = {
            .len        = len,
        .total_len      = len,
        .flags          = flags,
        .pos            = *ppos,
        .u.file         = out,
        .opos           = opos,
    };
    long ret;

    …
    ret = splice_direct_to_actor(in, &sd, direct_splice_actor);
    if (ret > 0)
        *ppos = sd.pos;
    return ret;
}

ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
                 splice_direct_actor *actor)
{
    struct pipe_inode_info *pipe;
    long ret, bytes;
    umode_t i_mode;
    size_t len;
    int i, flags, more;
    …
    pipe = current->splice_pipe;
    if (unlikely(!pipe)) {
        pipe = alloc_pipe_info();
        …
        pipe->readers = 1;
        current->splice_pipe = pipe;
    }
    // 進行拼接
    ret = 0;
    bytes = 0;
    len = sd->total_len;
    flags = sd->flags;
     // 不要在輸出的時候阻塞，我們需要清空 direct pipe
    sd->flags &= ~SPLICE_F_NONBLOCK;
    more = sd->flags & SPLICE_F_MORE;

    while (len) {
        size_t read_len;
        loff_t pos = sd->pos, prev_pos = pos;

        ret = do_splice_to(in, &pos, pipe, len, flags);
        …
        ret = actor(pipe, sd);
        if (unlikely(ret <= 0)) {
            sd->pos = prev_pos;
            goto out_release;
        }

        bytes += ret;
        len -= ret;
        sd->pos = pos;

        if (ret < read_len) {
            sd->pos = prev_pos + ret;
            goto out_release;
        }
    }

done:
    pipe->nrbufs = pipe->curbuf = 0;
    file_accessed(in);
    return bytes;
…
}

在上述代碼中，總結起來就三個步驟：

1）alloc_pipe_info 分配 pipe 對象，pipe 其實就是個緩衝區。

2）do_splice_to 把 in 文件的數據讀入到緩衝區。

3）actor 把緩衝區的數據讀到 out 文件中。

5.8.3　mmap 和 sendfile 在開源軟件中的使用

在 MongoDB 中，使用了操作系統底層提供的內存映射機制，即 mmap，數據文件使用 mmap 映射到內存空間進行管理，內存的管理（哪些數據何時換入/換出）完全交給 OS 管理。

MongoDB 對不同操作系統的 MemoryMappedFile 有不同的實現，我們這裏針對 Linux 操作系統的實現來分析 MongoDB 中把文件數據映射到進程地址空間的操作：

void* MemoryMappedFile::map(const char *filename, unsigned long long &length, int options) {
    setFilename(filename);
    FileAllocator::get()->allocateAsap( filename, length );
    len = length;
    …
    unsigned long long filelen = lseek(fd, 0, SEEK_END);
    …
    lseek( fd, 0, SEEK_SET );
    void * view = mmap(NULL, length, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    …
    views.push_back( view );
    return view;
}

MongoDB 通知操作系統去映射所有數據文件到內存，操作系統使用 mmap（）系統調用來完成。從這一點看，數據文件，包括所有的 docments、collections 及其索引，都會被操作系統通過頁（page）的方式交換到內存。如果有足夠的內存，所有數據文件最終都會加載到內存中。

當內存發生了改變，比如一個寫操作，產生的變化將會異步刷新到磁盤，但寫操作仍是很快的，直接操作內存。數據量可以適應內存大小，從而達到一個理想狀況——對磁盤的操作達到最小量。但是如果數據量超出內存，頁面訪問錯誤（page faults）將會悄悄上來，那麼系統就會頻繁訪問內存，讀寫操作要慢很多。最糟糕的狀況是數據量遠大於內存，讀寫不穩定，性能急劇下降。

Kafka 是 Apache 社區下的消息中間件，在 Kafka 上，有兩個原因可能導致低效：1）太多的網絡請求；2）過多的字節拷貝。爲了提高效率，Kafka 把 message 分成一組一組的，每次請求會把一組 message 發給相應的 consumer。此外，爲了減少字節拷貝，採用了 sendfile 系統調用。

Kafka 設計了一種“標準字節消息”，Producer、Broker、Consumer 共享這一種消息格式。Kakfa 的消息日誌在 broker 端就是一些目錄文件，這些日誌文件都是 MessageSet 按照這種“標準字節消息”格式寫入磁盤的。

維持這種通用的格式對這些操作的優化尤爲重要：持久化 log 塊的網絡傳輸。流行的 Unix 操作系統提供了一種非常高效的途徑來實現頁面緩存和 socket 之間的數據傳遞。在 Linux 操作系統中，這種方式稱作：sendfile 系統調用（Java 提供了訪問這個系統調用的方法：FileChannel.transferTo api）。

下面我們來分析在 Kafka 中的零拷貝流程。

首先我們來看 kafka 的服務端 socketServer 邏輯：

override def run() {
    startupComplete()
    while(isRunning) {
        try {
            // 配置任意一個新的可以用來排隊的連接
            configureNewConnections()
            // 註冊一個新的請求用來寫
            processNewResponses()

            try {
              selector.poll(300)
            } catch {
              case...
            }

SocketServer 會 poll 隊列，一旦對應的 KafkaChannel 寫操作準備好了，就會調用 KafkaChannel 的 write 方法：

// KafkaChannel.scala
public Send write() throws IOException {
    if (send != null && send(send))
}
// KafkaChannel.scala
private boolean send(Send send) throws IOException {
    send.writeTo(transportLayer);
    if (send.completed())
        transportLayer.removeInterestOps(SelectionKey.OP_WRITE);
    return send.completed();
}

其中 write 會調用 send 方法，對應的 Send 對象其實就是上面我們註冊的 FetchRes-ponseSend 對象。

這段代碼裏實際發送數據的代碼是 send.writeTo（transportLayer），對應的 writeTo 方法爲：

private val sends = new MultiSend(dest, JavaConversions.seqAsJavaList(fetchResponse.
    dataGroupedByTopic.toList.map {
    case(topic, data) => new TopicDataSend(dest, TopicData(topic,data.map{case
        (topicAndPartition, message) => (topicAndPartition.partition, message)}))
    }))
override def writeTo(channel: GatheringByteChannel): Long = {
    …
        written += sends.writeTo(channel)
    …
}

這裏最後調用了 sends 的 writeTo 方法，而 sends 其實是個 MultiSend。MultiSend 裏有兩個東西：

topicAndPartition.partition：分區。

message：FetchResponsePartitionData。還記得這個 FetchResponsePartitionData 嗎？我們的 MessageSet 就放在了這個對象裏。

TopicDataSend 也包含了 sends，該 sends 包含了 PartitionDataSend，而 PartitionDataSend 則包含了 FetchResponsePartitionData。

最後進行 writeTo 的時候，其實是調用了：

// partitionData 就是 FetchResponsePartitionData，messages 就是 FileMessageSet
val bytesSent = partitionData.messages.writeTo(channel, messagesSentSize, messageSize - messagesSentSize)

FileMessageSet 也有個 writeTo 方法，就是我們之前已經提到過的那段代碼：

def writeTo(destChannel: GatheringByteChannel, writePosition: Long, size: Int): Int = {
    ...
    val bytesTransferred = (destChannel match {
        case tl: TransportLayer => tl.transferFrom(channel, position, count)
        case dc => channel.transferTo(position, count, dc)
    }).toInt
    bytesTransferred
}

最後通過 tl.transferFrom（channel，position，count）來完成最後的數據發送的。trans-ferFrom 其實是 Kafka 自己封裝的一個方法，最終裏面調用的也是 transerTo：

public long transferFrom(FileChannel fileChannel, long position, long count) throws IOException {
    return fileChannel.transferTo(position, count, socketChannel);
}

5.9　本章小結

我們編寫的應用程序或多或少都會涉及 I/O，比如讀寫數據庫、讀寫網絡等，總會遇到很多問題，例如在高併發場景下，如何編寫高性能的服務端和客戶端程序。對 I/O 的理解是否深入，關係到寫出來的應用對性能的影響程度。

從狹義的角度來講，I/O 就是 in 和 out 兩條輸入輸出的彙編指令。但是從廣義角度講，I/O 可以涉及操作系統整個 I/O 模型的構建，系統從分層的角度，將數據從寫文件開始最終轉換成數據塊並落入磁盤。從更深入的視角看，會涉及 epoll 這樣的 I/O 多路複用模型。

因此，只有從操作系統的角度來理解 I/O，才能真正編寫出高性能的應用程序。

第5章　輸入輸出（下）

5.5　Redis 對 epoll 的封裝

5.6　Nginx 文件異步 I/O

5.7　tail 指令爲何牛

5.8　零拷貝技術應用分析

5.9　本章小結

SRv6技術課堂：SRv6可靠性方案（一）

SRv6可編程技術-SRv6 Policy

VPP/Segment Routing for IPv6

SRv6技術課堂（一）：SRv6概述

SRv6技術研究和組網設計

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

第5章 輸入輸出（下）

5.5 Redis 對 epoll 的封裝

5.6 Nginx 文件異步 I/O

5.7 tail 指令爲何牛

5.8 零拷貝技術應用分析

5.9 本章小結

第5章　輸入輸出（下）

5.5　Redis 對 epoll 的封裝

5.6　Nginx 文件異步 I/O

5.7　tail 指令爲何牛

5.8　零拷貝技術應用分析

5.9　本章小結