深入剖析零拷贝机制

2019/11/29

前言

最近,经常能在各个场合听到我们可爱的程序猿们在聊零拷贝这个概念,而且大家都讨论的如火如荼,那么什么是零拷贝,对技术痴迷的我也禁不住再次去拔了拔源码,于是产生了这篇技术博客?

业务场景

从一个文件中读出大量的数据并将这些数据传到另一台服务器上,有什么高效的方式实现?

传统IO方式

传统的IO实现:file.read()和 socket.send()两个方法配合,交互图如下:

1、应用程序调用 read()方法,这里会涉及到一次上下文切换(用户态->内核态),底层采用DMA(direct memory access)读取磁盘的文件,并把内容copy到内核地址空间的read buffer(读取缓存区)。

2、由于应用程序无法访问内核地址空间的数据,如果应用程序要操作这些数据,就得把这些内容从内核的Read Buffer(读取缓冲区) copy 到Application Buffer(用户缓冲区)。 read() 调用的返回引发一次上下文切换(内核态->用户态),现在数据已经被拷贝到了用户地址空间缓冲区,如果有需要,可以操作修改这些内容。

3、我们最终目的是把这个文件内容通过Socket传到另一个服务中,调用Socket的 send()方法,又涉及到一次上下文切换(用户态->内核态),同时,文件内容被进行第三次拷贝,数据从用户缓冲区拷贝到内核的socket缓冲区。

4、send()调用返回,引发第四次上下文切换,同时进行第四次拷贝,DMA把数据从目标套接字相关的缓存区传到协议引擎进行发送。

整个过程中,过程1和4是由DMA负责,并不会消耗CPU,只有过程2和3的拷贝需要CPU参与

上面方法是传统IO的实现过程,里面牵涉到4次上下文的切换和4次拷贝(2次需要cpu参与的拷贝,2次DMA copy),因此来说性能不会太高

NIO方式

NIO实现方法可以用transferTo()替代file.read()和 socket.send(),交互原理图如下:

从图可以看到,上下文切换的次数从四次减少到了两次,拷贝次数从四次减少到三次(DMA copy 2次,CPU copy 1次),尽管改善了很多,但还是有1次CPU copy,所以达不到完全的零拷贝,那么有什么更好的黑科技可以实现真正意义上的 Zero Copy呢?想得到答案,我们先来看看FileChannel#transferTo()方法的源码,看看源码中有没有什么蛛丝马迹

transferTo()源码内容

// 接口定义
package java.nio.channels;

public abstract class FileChannel
    extends AbstractInterruptibleChannel
    implements SeekableByteChannel, GatheringByteChannel, ScatteringByteChannel
{
    /**
     * Transfers bytes from this channel's file to the given writable byte
     * channel.
     *
     * <p> An attempt is made to read up to <tt>count</tt> bytes starting at
     * the given <tt>position</tt> in this channel's file and write them to the
     * target channel.  An invocation of this method may or may not transfer
     * all of the requested bytes; whether or not it does so depends upon the
     * natures and states of the channels.  Fewer than the requested number of
     * bytes are transferred if this channel's file contains fewer than
     * <tt>count</tt> bytes starting at the given <tt>position</tt>, or if the
     * target channel is non-blocking and it has fewer than <tt>count</tt>
     * bytes free in its output buffer.
     *
     * <p> This method does not modify this channel's position.  If the given
     * position is greater than the file's current size then no bytes are
     * transferred.  If the target channel has a position then bytes are
     * written starting at that position and then the position is incremented
     * by the number of bytes written.
     *
     * <p> This method is potentially much more efficient than a simple loop
     * that reads from this channel and writes to the target channel.  Many
     * operating systems can transfer bytes directly from the filesystem cache
     * to the target channel without actually copying them.  </p>
     *
     * @param  position
     *         The position within the file at which the transfer is to begin;
     *         must be non-negative
     *
     * @param  count
     *         The maximum number of bytes to be transferred; must be
     *         non-negative
     *
     * @param  target
     *         The target channel
     *
     * @return  The number of bytes, possibly zero,
     *          that were actually transferred
     *
     * @throws IllegalArgumentException
     *         If the preconditions on the parameters do not hold
     *
     * @throws  NonReadableChannelException
     *          If this channel was not opened for reading
     *
     * @throws  NonWritableChannelException
     *          If the target channel was not opened for writing
     *
     * @throws  ClosedChannelException
     *          If either this channel or the target channel is closed
     *
     * @throws  AsynchronousCloseException
     *          If another thread closes either channel
     *          while the transfer is in progress
     *
     * @throws  ClosedByInterruptException
     *          If another thread interrupts the current thread while the
     *          transfer is in progress, thereby closing both channels and
     *          setting the current thread's interrupt status
     *
     * @throws  IOException
     *          If some other I/O error occurs
     */
    public abstract long transferTo(long position, long count,
                                    WritableByteChannel target)
        throws IOException;
    
    }
// 接口实现
package sun.nio.ch;
public class FileChannelImpl extends FileChannel {
    public long transferTo(long var1, long var3, WritableByteChannel var5) throws IOException {
        this.ensureOpen();
        if (!var5.isOpen()) {
            throw new ClosedChannelException();
        } else if (!this.readable) {
            throw new NonReadableChannelException();
        } else if (var5 instanceof FileChannelImpl && !((FileChannelImpl)var5).writable) {
            throw new NonWritableChannelException();
        } else if (var1 >= 0L && var3 >= 0L) {
            long var6 = this.size();
            if (var1 > var6) {
                return 0L;
            } else {
                int var8 = (int)Math.min(var3, 2147483647L);
                if (var6 - var1 < (long)var8) {
                    var8 = (int)(var6 - var1);
                }

                long var9;
                if ((var9 = this.transferToDirectly(var1, var8, var5)) >= 0L) {
                    return var9;
                } else {
                    return (var9 = this.transferToTrustedChannel(var1, (long)var8, var5)) >= 0L ? var9 : this.transferToArbitraryChannel(var1, var8, var5);
                }
            }
        } else {
            throw new IllegalArgumentException();
        }
    }
}

通过 FileChannelImpl的源码可以看到,该类位于sun.nio.ch包下面,代码是反编译得到的,所以看到的变量命名之类的很奇怪,在 UNIX 和各种 Linux 系统中,此调用被传递到 sendfile() 系统调用中,最终实现将数据从一个文件描述符传输到了另一个文件描述符。

google看下操作系统级别sendfile()方法的描述和实现

       SYNOPSIS         top
       #include <sys/sendfile.h>

       ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
       DESCRIPTION         top
       sendfile() copies data between one file descriptor and another.
       Because this copying is done within the kernel, sendfile() is more
       efficient than the combination of read(2) and write(2), which would
       require transferring data to and from user space.
 
翻译:sendfile()在一个文件描述符和另一个文件描述符之间复制数据。因为复制是在内核中完成的,所以sendfile()的功能更多
效率比读(2)和写(2)的组合,这将需要在用户空间之间传输数据。

sendfile()方法C语言实现源码

// 文件地址
https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/generic/wordsize-32/sendfile.c.html
Browse the source code of glibc/sysdeps/unix/sysv/linux/generic/wordsize-32/sendfile.c

/* Copyright (C) 2011-2019 Free Software Foundation, Inc.
   This file is part of the GNU C Library.
   Contributed by Chris Metcalf <cmetcalf@tilera.com>, 2011.
   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.
   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.
   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library.  If not, see
   <http://www.gnu.org/licenses/>.  */

#include <sys/sendfile.h>
#include <errno.h>
/* Send COUNT bytes from file associated with IN_FD starting at OFFSET to
   descriptor OUT_FD.  */
ssize_t
sendfile (int out_fd, int in_fd, off_t *offset, size_t count)
{
  __off64_t off64;
  int rc;
  if (offset != NULL)
    {
      if (*offset < 0 || (off_t) (*offset + count) < 0)
        {
          __set_errno (EINVAL);
          return -1;
        }
      off64 = *offset;
    }
  rc = INLINE_SYSCALL (sendfile64, 4, out_fd, in_fd,
                       offset ? &off64 : NULL, count);
  if (offset)
    *offset = off64;
  return rc;
}

特别注意,transferTo()方法源码注释说明中有这样一段话:

     * <p> This method is potentially much more efficient than a simple loop
     * that reads from this channel and writes to the target channel.  Many
     * operating systems can transfer bytes directly from the filesystem cache
     * to the target channel without actually copying them.  </p>

解释为:此方法可能比从该通道读取数据并将数据写入目标通道的简单循环更有效。许多操作系统可以直接将字节从文件系统缓存传输到目标通道,而不必实际复制它们。

源码里已经特别说明了,零拷贝依赖于操作系统,是的,如果底层网络接口卡支持收集操作的话,就可以进一步的优化。 在 Linux 内核 2.4及后期版本中,针对socket缓冲区描述符做了相应调整,DMA自带了收集功能,对于用户方面,用法还是一样,只是内部操作已经发生了改变, 原理交互变更如下:

具体过程:
1、transferTo() 方法使用 DMA 将文件内容拷贝到内核读取缓冲区。

2、避免了内容的整体拷贝,只把包含数据位置和长度信息的描述符追加到socket缓冲区,DMA 引擎直接把数据从内核缓冲区传到协议引擎,从而消除了最后一次 CPU参与的拷贝动作,达到真正意义的零拷贝。

最后: 零拷贝完全依赖于操作系统,也就是说,如果操作不支持也就没有什么零拷贝了


如果您觉得文章帮到了您,那就打赏给个赞呗🙂

(转载本站文章请注明作者和出处 程序猿小尾巴

Show Disqus Comments