Hadoop源码分析笔记(七)：HDFS非远程调用接口

HDFS非远程调用接口

在网络文件系统(NFS)是SUN公司在远程调用(RPC)之上开发的，它的所有文件操作，包括文件/目录API和用于读写文件数据，都通过远程过程调用实现。客户使用本地操作系统提供的系统调用访问文件系统，当虚拟文件系统发现系统调用需要访问NFS时，如在远程服务器上创建目录或对文件进行读操作，虚拟文件系统会将操作传递给NFS客户组件，由该组件通过RPC访问相应的NFS服务器。

在上一节的分析中，特别是在对ClientProtocol的分析中个，我们知道，HDFS的文件和目录相关事务部分，遵循了NFS的实现思路，文件/目录API利用Hadoop远程过程调用，发送到名字节点上去执行。但对文件数据的读写，HDFS采取了和网络文件系统截然不同的实现方式。

HDFS没有采用远程过程调用实现对文件的读写，原因非常简单，HDFS需要支撑超大文件，基于Hadoop IPC实现文件读写，效率达不到系统设计的要求。同时，HDFS提供了对数据的流式访问，使用基于TCP的流式数据访问接口，有利于批量处理数据，提高数据的吞吐量。

我们知道第二名字节点会根据一定策略合并名字节点上的命名空间镜像和镜像编辑日志。但是NamenodeProtocol并没有获取原始镜像数据和编辑日志的远程方法，也没有上传合并后新命名镜像的远程方法，上述两个过程，是通过名字节点提供的，基于HTTP的流式接口进行的。第二名字节点利用名字节点的内建的HTTP服务器，使用HTTP的Get操作获取数据，即原始命名空间镜像和编辑日志，合并操作完成后，利用HTTP协议通知名字节点，由名字节点使用HTTP GET，从第二名字节点下载新命名空间镜像。名字节点和第二名字节点的这个接口也是非IPC的接口。

数据节点上的非IPC接口

数据节点提供对HDFS文件数据块的读写功能：将HDFS文件数据写到Linux本地文件系统的文件中，或者从这些本地文件中读取HDFS文件数据。读写的对外接口是基于TCP的非IPC接口。除了数据块的读写，数据节点还提供数据块替换、数据块拷贝和数据块效验信息等基于TCP的接口。也就是，数据节点通过流式接口，一共提供了5种操作，这些操作都有相应的请求帧结构和操作码。

数据接口的流式接口操作码定于在DataTransferProtocol接口中(org.apche.hadoop.hdfs.protocol包中)。接口定于如下：

public interface DataTransferProtocol {
  //操作码定于如下
  
  /** Version for data transfers between clients and datanodes
   * This should change when serialization of DatanodeInfo, not just
   * when protocol changes. It is not very obvious. 
   */
  /*
   * Version 18:
   *    Change the block packet ack protocol to include seqno,
   *    numberOfReplies, reply0, reply1, ...
   */
  public static final int DATA_TRANSFER_VERSION = 17;

  // Processed at datanode stream-handler 读写文件块操作码
  public static final byte OP_WRITE_BLOCK = (byte) 80;
  public static final byte OP_READ_BLOCK = (byte) 81;
  /**
   * @deprecated As of version 15, OP_READ_METADATA is no longer supported
   */
  @Deprecated public static final byte OP_READ_METADATA = (byte) 82;
  public static final byte OP_REPLACE_BLOCK = (byte) 83;
  public static final byte OP_COPY_BLOCK = (byte) 84;
  public static final byte OP_BLOCK_CHECKSUM = (byte) 85;
  
  public static final int OP_STATUS_SUCCESS = 0;  
  public static final int OP_STATUS_ERROR = 1;  
  public static final int OP_STATUS_ERROR_CHECKSUM = 2;  
  public static final int OP_STATUS_ERROR_INVALID = 3;  
  public static final int OP_STATUS_ERROR_EXISTS = 4;  
  public static final int OP_STATUS_ERROR_ACCESS_TOKEN = 5;
  public static final int OP_STATUS_CHECKSUM_OK = 6;
  ......
}

下面介绍数据节点的读写数据流程：

1、读数据

读数据就是从数据节点的某个数据块中读取一段文件数据，由上面的操作码定义我们知道，它的操作码是(OP_READ_BLOCK)81。当客户端需要读数据时，它通过和数据节点的TCP连接，发送请求，由于TCP是基于字节流的，没有消息边界的概念，所有必须在流上定义一个数据帧并通过读数据帧交互信息。客户端读取Block相关范例（org.apache.hadoop.hdfs.DFSClient .BlockReader）如下：

/********************************************************
 * DFSClient can connect to a Hadoop Filesystem and 
 * perform basic file tasks.  It uses the ClientProtocol
 * to communicate with a NameNode daemon, and connects 
 * directly to DataNodes to read/write block data.
 *
 * Hadoop DFS users should obtain an instance of 
 * DistributedFileSystem, which uses DFSClient to handle
 * filesystem tasks.
 *
 ********************************************************/
public class DFSClient implements FSConstants, java.io.Closeable {
//数据块读者类 /** This is a wrapper around connection to datadone
   * and understands checksum, offset etc
   */
  public static class BlockReader extends FSInputChecker {
	public static BlockReader newBlockReader( Socket sock, String file,
                                       long blockId, 
                                       Token<BlockTokenIdentifier> accessToken,
                                       long genStamp,
                                       long startOffset, long len,
                                       int bufferSize, boolean verifyChecksum,
                                       String clientName)
                                       throws IOException {
      // in and out will be closed when sock is closed (by the caller)
      DataOutputStream out = new DataOutputStream(
        new BufferedOutputStream(NetUtils.getOutputStream(sock,HdfsConstants.WRITE_TIMEOUT)));
      //读文件块传送的数据
      //write the header.
      out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
      out.write( DataTransferProtocol.OP_READ_BLOCK );
      out.writeLong( blockId );
      out.writeLong( genStamp );
      out.writeLong( startOffset );
      out.writeLong( len );
      Text.writeString(out, clientName);
      accessToken.write(out);
      out.flush();
       DataInputStream in = new DataInputStream(
          new BufferedInputStream(NetUtils.getInputStream(sock), 
                                  bufferSize));
      
      short status = in.readShort();
      if (status != DataTransferProtocol.OP_STATUS_SUCCESS) {
        if (status == DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) {
          throw new InvalidBlockTokenException(
              "Got access token error for OP_READ_BLOCK, self="
                  + sock.getLocalSocketAddress() + ", remote="
                  + sock.getRemoteSocketAddress() + ", for file " + file
                  + ", for block " + blockId + "_" + genStamp);
        } else {
          throw new IOException("Got error for OP_READ_BLOCK, self="
              + sock.getLocalSocketAddress() + ", remote="
              + sock.getRemoteSocketAddress() + ", for file " + file
              + ", for block " + blockId + "_" + genStamp);
        }
      }
}}
  ......
}

在上面的代码中，我们主要关注newBlockReader方法中，读文件块传送的数据部分。

首先，我们看到请求的最前面两个字段，分时是接口版本号(DataTransferProtocol.DATA_TRANSFER_VERSION)和操作码(DataTransferProtocol.OP_READ_BLOCK
)，版本好确保通信双方对数据帧的理解是一致的，紧接着一个自己操作码表明操作的目的，这个两个域会出现在所有的数据节点流接口请求中。

请求接下来是数据块ID(blockId)和数据块版本号(genStamp,实际就是文件块创建的时间)，根据这两个参数，数据节点可以确保操作的目标数据块。如同普通的文件读操作需要指定请求数据的开始位置和需要的数据长度，读数据块也需要提供偏移量(startOffset)和数据长度(len)。通过上述4个参数，发起读请求的客户端明确了这次请求获得的数据。

接下来的客户端名字(clientName)是一个字符串，只用于日志输出中，访问令牌(accessToken)则是执行Hadoop安全检查的需求。

读请求的响应也有一定的帧结构，首先是应答头(如：OP_STATUS_SUCCESS)，接下来是一系列的数据包。为了保证数据的完整性，HDFS为每个数据块保持了响应的校验信息，校验基于块，块大小默认值为512字节，即从数据块开始，每512字节就会产生一个4字节的校验和。在读/写数据块时，也需要维护基于流的数据读写和基于块的校验和的关系。

2、写数据

写数据操作的复杂程度远朝读数据操作，该操作用于往数据节点上的某一数据块上追加数据，其操作码为(OP_WRITE_BLOCK)80。在介绍写数据前，先来考察HDFS写数据的数据流管道。

数据流管道是Google实现他们的分布式文件系统(GFS)时就已经引入，其目的是：在写一份数据的多个副本是时，可以充分利用集群中的每一台机器的带框，避免网络瓶颈和高延时的连接，最小化推送所有数据的延迟。Hadoop文件系统也是先了文件数据流管道。

假设目前客户端写数据的文本副本数是3，也就是说在该HDFS集群上，一共有三个数据节点会保存这份数据的三个副本，客户端在发送数据时，不是同时往三个数据节点上写数据，而是将数据发送往第一个数据节点，然后第一个数据节点在本地保存数据，同时推送数据到数据节点2，然后照这样进行直到管道中的最后一个数据节点。确认包由最后一个数据节点产生，并逆流往客户端放下回送，沿途的数据节点在确定本地写成功后，才往上流传递应答。相对于客户端往多个不同的数据节点同时写数据的方式，处于数据流管道上的每一个节点都承担了写数据过程中的部分网络流量，降低了客户端发送多分数据对网络的冲击。客户端写Block操作相关范例如下(org.apache.hadoop.hdfs.DFSClient .DFSOutputStream.ResponseProcessor)：

/********************************************************
 * DFSClient can connect to a Hadoop Filesystem and 
 * perform basic file tasks.  It uses the ClientProtocol
 * to communicate with a NameNode daemon, and connects 
 * directly to DataNodes to read/write block data.
 *
 * Hadoop DFS users should obtain an instance of 
 * DistributedFileSystem, which uses DFSClient to handle
 * filesystem tasks.
 *
 ********************************************************/
public class DFSClient implements FSConstants, java.io.Closeable {
   ......
     /****************************************************************
   * DFSOutputStream creates files from a stream of bytes.
   *
   * The client application writes data that is cached internally by
   * this stream. Data is broken up into packets, each packet is
   * typically 64K in size. A packet comprises of chunks. Each chunk
   * is typically 512 bytes and has an associated checksum with it.
   *
   * When a client application fills up the currentPacket, it is
   * enqueued into dataQueue.  The DataStreamer thread picks up
   * packets from the dataQueue, sends it to the first datanode in
   * the pipeline and moves it from the dataQueue to the ackQueue.
   * The ResponseProcessor receives acks from the datanodes. When an
   * successful ack for a packet is received from all datanodes, the
   * ResponseProcessor removes the corresponding packet from the
   * ackQueue.
   *
   * In case of error, all outstanding packets and moved from
   * ackQueue. A new pipeline is setup by eliminating the bad
   * datanode from the original pipeline. The DataStreamer now
   * starts sending packets from the dataQueue.
  ****************************************************************/
  class DFSOutputStream extends FSOutputSummer implements Syncable { //
    // Processes reponses from the datanodes.  A packet is removed 
    // from the ackQueue when its response arrives.
    //
    private class ResponseProcessor extends Thread {
   ......
   // connects to the first datanode in the pipeline
    // Returns true if success, otherwise return failure.
    //
    private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client,
                    boolean recoveryFlag) {
      short pipelineStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS;
      String firstBadLink = "";
      if (LOG.isDebugEnabled()) {
        for (int i = 0; i < nodes.length; i++) {
          LOG.debug("pipeline = " + nodes[i].getName());
        }
      }

      // persist blocks on namenode on next flush
      persistBlocks = true;

      boolean result = false;
      try {
        LOG.debug("Connecting to " + nodes[0].getName());
        InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName());
        s = socketFactory.createSocket();
        timeoutValue = 3000 * nodes.length + socketTimeout;
        NetUtils.connect(s, target, timeoutValue);
        s.setSoTimeout(timeoutValue);
        s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
        LOG.debug("Send buf size " + s.getSendBufferSize());
        long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length +
                            datanodeWriteTimeout;

        //
        // Xmit header info to datanode
        //
        DataOutputStream out = new DataOutputStream(
            new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), 
                                     DataNode.SMALL_BUFFER_SIZE));
        blockReplyStream = new DataInputStream(NetUtils.getInputStream(s));
        //写文件头部分 版本号-操作码-数据块标识-版本号
        out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
        out.write( DataTransferProtocol.OP_WRITE_BLOCK );
        out.writeLong( block.getBlockId() );
        out.writeLong( block.getGenerationStamp() );
        out.writeInt( nodes.length );
        out.writeBoolean( recoveryFlag );       // recovery flag
        Text.writeString( out, client );
        out.writeBoolean(false); // Not sending src node information
        out.writeInt( nodes.length - 1 );
        for (int i = 1; i < nodes.length; i++) {
          nodes[i].write(out);
        }
        accessToken.write(out);
        checksum.writeHeader( out );
        out.flush();

        // receive ack for connect
        pipelineStatus = blockReplyStream.readShort();
        firstBadLink = Text.readString(blockReplyStream);
        if (pipelineStatus != DataTransferProtocol.OP_STATUS_SUCCESS) {
          if (pipelineStatus == DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) {
            throw new InvalidBlockTokenException(
                "Got access token error for connect ack with firstBadLink as "
                    + firstBadLink);
          } else {
            throw new IOException("Bad connect ack with firstBadLink as "
                + firstBadLink);
          }
        }

        blockStream = out;
        result = true;     // success

      } catch (IOException ie) {

        LOG.info("Exception in createBlockOutputStream " + nodes[0].getName() +
            " " + ie);

        // find the datanode that matches
        if (firstBadLink.length() != 0) {
          for (int i = 0; i < nodes.length; i++) {
            if (nodes[i].getName().equals(firstBadLink)) {
              errorIndex = i;
              break;
            }
          }
        }
        hasError = true;
        setLastException(ie);
        blockReplyStream = null;
        result = false;
      } finally {
        if (!result) {
          IOUtils.closeSocket(s);
          s = null;
        }
      }
      return result;
    }
   ......
}}
   ......
}

DataNode与DataNode写Block(org.apache.hadoop.hdfs.server.datanode.DataNode.DataTransfer)范例如下：

/**********************************************************
 * DataNode is a class (and program) that stores a set of
 * blocks for a DFS deployment.  A single deployment can
 * have one or many DataNodes.  Each DataNode communicates
 * regularly with a single NameNode.  It also communicates
 * with client code and other DataNodes from time to time.
 *
 * DataNodes store a series of named blocks.  The DataNode
 * allows client code to read these blocks, or to write new
 * block data.  The DataNode may also, in response to instructions
 * from its NameNode, delete blocks or copy blocks to/from other
 * DataNodes.
 *
 * The DataNode maintains just one critical table:
 *   block-> stream of bytes (of BLOCK_SIZE or less)
 *
 * This info is stored on a local disk.  The DataNode
 * reports the table's contents to the NameNode upon startup
 * and every so often afterwards.
 *
 * DataNodes spend their lives in an endless loop of asking
 * the NameNode for something to do.  A NameNode cannot connect
 * to a DataNode directly; a NameNode simply returns values from
 * functions invoked by a DataNode.
 *
 * DataNodes maintain an open server socket so that client code 
 * or other DataNodes can read/write data.  The host/port for
 * this server is reported to the NameNode, which then sends that
 * information to clients or other DataNodes that might be interested.
 *
 **********************************************************/
public class DataNode extends Configured 
{ 
 ......
/**
   * Used for transferring a block of data.  This class
   * sends a piece of data to another DataNode.
   */
  class DataTransfer implements Runnable {
    DatanodeInfo targets[];
    Block b;
    DataNode datanode;

     /**
     * Do the deed, write the bytes
     */
    public void run() {
      xmitsInProgress.getAndIncrement();
      Socket sock = null;
      DataOutputStream out = null;
      BlockSender blockSender = null;
      
      try {
        InetSocketAddress curTarget = 
          NetUtils.createSocketAddr(targets[0].getName());
        sock = newSocket();
        NetUtils.connect(sock, curTarget, socketTimeout);
        sock.setSoTimeout(targets.length * socketTimeout);

        long writeTimeout = socketWriteTimeout + 
                            HdfsConstants.WRITE_TIMEOUT_EXTENSION * (targets.length-1);
        OutputStream baseStream = NetUtils.getOutputStream(sock, writeTimeout);
        out = new DataOutputStream(new BufferedOutputStream(baseStream, 
                                                            SMALL_BUFFER_SIZE));

        blockSender = new BlockSender(b, 0, b.getNumBytes(), false, false, false, 
            datanode);
        DatanodeInfo srcNode = new DatanodeInfo(dnRegistration);

        //
        // Header info
        //
        out.writeShort(DataTransferProtocol.DATA_TRANSFER_VERSION);
        out.writeByte(DataTransferProtocol.OP_WRITE_BLOCK);
        out.writeLong(b.getBlockId());
        out.writeLong(b.getGenerationStamp());
        out.writeInt(0);           // no pipelining
        out.writeBoolean(false);   // not part of recovery
        Text.writeString(out, ""); // client
        out.writeBoolean(true); // sending src node information
        srcNode.write(out); // Write src node DatanodeInfo
        // write targets
        out.writeInt(targets.length - 1);
        for (int i = 1; i < targets.length; i++) {
          targets[i].write(out);
        }
        Token<BlockTokenIdentifier> accessToken = BlockTokenSecretManager.DUMMY_TOKEN;
        if (isBlockTokenEnabled) {
          accessToken = blockTokenSecretManager.generateToken(null, b,
              EnumSet.of(BlockTokenSecretManager.AccessMode.WRITE));
        }
        accessToken.write(out);
        // send data & checksum
        blockSender.sendBlock(out, baseStream, null);

        // no response necessary
        LOG.info(dnRegistration + ":Transmitted block " + b + " to " + curTarget);

      } catch (IOException ie) {
        LOG.warn(dnRegistration + ":Failed to transfer " + b + " to " + targets[0].getName()
            + " got " + StringUtils.stringifyException(ie));
        // check if there are any disk problem
        datanode.checkDiskError();
        
      } finally {
        xmitsInProgress.getAndDecrement();
        IOUtils.closeStream(blockSender);
        IOUtils.closeStream(out);
        IOUtils.closeSocket(sock);
      }
    }

}
   ......
}

从上面的源码中，我们发现和读请求类似，写请求的最前面两个字段分别是接口版本好和操作码，接下来是数据块标识和版本号，但和读请求不同的是，写数据请求没有偏移量字段，也就是说，用户只能往数据块后面添加数据，不能修改写入的文件内容。接下来是数据流管道的大小（nodes.length）即是，需要写数据的数据节点的个数，接下来的标识符(isRecovery)表示这个写操作是不是从错误恢复过程中的一部分。如果数据源是某个客户端，则接下来请求会携带客户端的名字，如果数据源是某个数据节点，则客户端字符为空，同时标志位hasSrcDataNode置为，在这种情况下，请求中紧接着的是源数据节点的信息字段srcDataNode，该信息就是源数据节点的DatanodeInof对象的序列化结果。写请求数据源是某一个数据节点，表明该数据节点在执行数据复制。

为了实现数据流管道功能，写请求包括numTargets和targets字段，其中targets是数据目标列表，numTargets是这个列表的大小。如果numTargets的值为零，表明当前数据节点是数据流管道中的最后一个节点。如果numTargets值大于0，那么，数据目标列表中的第一项，就是数据流管道中，位于当前数据节点后面的下游数据推送目标。还是以前面的管道为列，客户端和数据节点1、数据节点2都会想它们的下游节点发送写请求，客户端发送的请求中，numTargets值为2，数据目标列包含了数据节点2和数据节点3。如此，根据numTargets和targets中的信息，连接数据流管道上的各个节点的TCP连接被建立起来，为后续写数据准备好通道。

在上述管道中会依次逆向返回写操作的结果给上游节点，当写操作的每一个管道上的各个数据节点都顺利的写入磁盘时，最终的结果会是DataTransferProtocol.OP_STATUS_OK,否则为DataTransferProtol.OP_STATUS_ERROR。

数据流管道是HDFS实现针对海量数据写的一个优化，在进行写操作之前，位于管道上的节点，根据上游节点发送过来的写请求，建立管道，并应用与后续的具体写操作中。写数据时，数据流管道上传输的写数据数据包，它复用了读操作的数据格式，每个数据包都有对应的应答包，以保证数据成功到达各个节点，同时，数据源也能根据应答包，了解各个节点上的操作结果。

读写操作接口是HDFS数据节点上最重要的流式接口，除了这链各个接口，还有数据替换、数据块拷贝等基于TCP流的接口。

名字节点和第二名字节点上面的非IPC接口

名字节点产生的fsimage(镜像文件)和edit(修改记录文件)s文件与第二名字节点间的传送，由于这两个文件都比较大，传送交互这个过程并没有采用Hadoop IPC，同时也没有采用数据节点基于TCP的机制，而是使用了基于HTTP的流接口。

版权申明：本文部分摘自【蔡斌、陈湘萍】所著【Hadoop技术内幕深入解析Hadoop Common和HDFS架构设计与实现原理】一书，仅作为学习笔记，用于技术交流，其商业版权由原作者保留，推荐大家购买图书研究，转载请保留原作者，谢谢！

剑邑龙泉

发布了1 篇原创文章 · 获赞 0 · 访问量 3万+

私信关注

Hadoop源码分析笔记(七)：HDFS非远程调用接口

HDFS非远程调用接口

数据节点上的非IPC接口

名字节点和第二名字节点上面的非IPC接口

Hadoop源碼分析筆記(三)：Hadoop遠程過程調用

常用排序算法小結

Java IO流系統詳解

Hadoop源碼分析筆記(十一)：數據節點--數據節點整體運行

Hadoop源碼分析筆記(十二)：名字節點--文件系統目錄樹

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結