

        HDFS没有采用远程过程调用实现对文件的读写,原因非常简单,HDFS需要支撑超大文件,基于Hadoop IPC实现文件读写,效率达不到系统设计的要求。同时,HDFS提供了对数据的流式访问,使用基于TCP的流式数据访问接口,有利于批量处理数据,提高数据的吞吐量。
        我们知道第二名字节点会根据一定策略合并名字节点上的命名空间镜像和镜像编辑日志。但是NamenodeProtocol并没有获取原始镜像数据和编辑日志的远程方法,也没有上传合并后新命名镜像的远程方法,上述两个过程,是通过名字节点提供的,基于HTTP的流式接口进行的。第二名字节点利用名字节点的内建的HTTP服务器,使用HTTP的Get操作获取数据,即原始命名空间镜像和编辑日志,合并操作完成后,利用HTTP协议通知名字节点,由名字节点使用HTTP GET,从第二名字节点下载新命名空间镜像。名字节点和第二名字节点的这个接口也是非IPC的接口。


public interface DataTransferProtocol {
  /** Version for data transfers between clients and datanodes
   * This should change when serialization of DatanodeInfo, not just
   * when protocol changes. It is not very obvious. 
   * Version 18:
   *    Change the block packet ack protocol to include seqno,
   *    numberOfReplies, reply0, reply1, ...
  public static final int DATA_TRANSFER_VERSION = 17;

  // Processed at datanode stream-handler 读写文件块操作码
  public static final byte OP_WRITE_BLOCK = (byte) 80;
  public static final byte OP_READ_BLOCK = (byte) 81;
   * @deprecated As of version 15, OP_READ_METADATA is no longer supported
  @Deprecated public static final byte OP_READ_METADATA = (byte) 82;
  public static final byte OP_REPLACE_BLOCK = (byte) 83;
  public static final byte OP_COPY_BLOCK = (byte) 84;
  public static final byte OP_BLOCK_CHECKSUM = (byte) 85;
  public static final int OP_STATUS_SUCCESS = 0;  
  public static final int OP_STATUS_ERROR = 1;  
  public static final int OP_STATUS_ERROR_CHECKSUM = 2;  
  public static final int OP_STATUS_ERROR_INVALID = 3;  
  public static final int OP_STATUS_ERROR_EXISTS = 4;  
  public static final int OP_STATUS_ERROR_ACCESS_TOKEN = 5;
  public static final int OP_STATUS_CHECKSUM_OK = 6;

          读数据就是从数据节点的某个数据块中读取一段文件数据,由上面的操作码定义我们知道,它的操作码是(OP_READ_BLOCK)81。当客户端需要读数据时,它通过和数据节点的TCP连接,发送请求,由于TCP是基于字节流的,没有消息边界的概念,所有必须在流上定义一个数据帧并通过读数据帧交互信息。客户端读取Block相关范例(org.apache.hadoop.hdfs.DFSClient .BlockReader)如下:
 * DFSClient can connect to a Hadoop Filesystem and 
 * perform basic file tasks.  It uses the ClientProtocol
 * to communicate with a NameNode daemon, and connects 
 * directly to DataNodes to read/write block data.
 * Hadoop DFS users should obtain an instance of 
 * DistributedFileSystem, which uses DFSClient to handle
 * filesystem tasks.
public class DFSClient implements FSConstants, java.io.Closeable {
//数据块读者类 /** This is a wrapper around connection to datadone
   * and understands checksum, offset etc
  public static class BlockReader extends FSInputChecker {
	public static BlockReader newBlockReader( Socket sock, String file,
                                       long blockId, 
                                       Token<BlockTokenIdentifier> accessToken,
                                       long genStamp,
                                       long startOffset, long len,
                                       int bufferSize, boolean verifyChecksum,
                                       String clientName)
                                       throws IOException {
      // in and out will be closed when sock is closed (by the caller)
      DataOutputStream out = new DataOutputStream(
        new BufferedOutputStream(NetUtils.getOutputStream(sock,HdfsConstants.WRITE_TIMEOUT)));
      //write the header.
      out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
      out.write( DataTransferProtocol.OP_READ_BLOCK );
      out.writeLong( blockId );
      out.writeLong( genStamp );
      out.writeLong( startOffset );
      out.writeLong( len );
      Text.writeString(out, clientName);
       DataInputStream in = new DataInputStream(
          new BufferedInputStream(NetUtils.getInputStream(sock), 
      short status = in.readShort();
      if (status != DataTransferProtocol.OP_STATUS_SUCCESS) {
        if (status == DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) {
          throw new InvalidBlockTokenException(
              "Got access token error for OP_READ_BLOCK, self="
                  + sock.getLocalSocketAddress() + ", remote="
                  + sock.getRemoteSocketAddress() + ", for file " + file
                  + ", for block " + blockId + "_" + genStamp);
        } else {
          throw new IOException("Got error for OP_READ_BLOCK, self="
              + sock.getLocalSocketAddress() + ", remote="
              + sock.getRemoteSocketAddress() + ", for file " + file
              + ", for block " + blockId + "_" + genStamp);
        假设目前客户端写数据的文本副本数是3,也就是说在该HDFS集群上,一共有三个数据节点会保存这份数据的三个副本,客户端在发送数据时,不是同时往三个数据节点上写数据,而是将数据发送往第一个数据节点,然后第一个数据节点在本地保存数据,同时推送数据到数据节点2,然后照这样进行直到管道中的最后一个数据节点。确认包由最后一个数据节点产生,并逆流往客户端放下回送,沿途的数据节点在确定本地写成功后,才往上流传递应答。相对于客户端往多个不同的数据节点同时写数据的方式,处于数据流管道上的每一个节点都承担了写数据过程中的部分网络流量,降低了客户端发送多分数据对网络的冲击。客户端写Block操作相关范例如下(org.apache.hadoop.hdfs.DFSClient .DFSOutputStream.ResponseProcessor):
 * DFSClient can connect to a Hadoop Filesystem and 
 * perform basic file tasks.  It uses the ClientProtocol
 * to communicate with a NameNode daemon, and connects 
 * directly to DataNodes to read/write block data.
 * Hadoop DFS users should obtain an instance of 
 * DistributedFileSystem, which uses DFSClient to handle
 * filesystem tasks.
public class DFSClient implements FSConstants, java.io.Closeable {
   * DFSOutputStream creates files from a stream of bytes.
   * The client application writes data that is cached internally by
   * this stream. Data is broken up into packets, each packet is
   * typically 64K in size. A packet comprises of chunks. Each chunk
   * is typically 512 bytes and has an associated checksum with it.
   * When a client application fills up the currentPacket, it is
   * enqueued into dataQueue.  The DataStreamer thread picks up
   * packets from the dataQueue, sends it to the first datanode in
   * the pipeline and moves it from the dataQueue to the ackQueue.
   * The ResponseProcessor receives acks from the datanodes. When an
   * successful ack for a packet is received from all datanodes, the
   * ResponseProcessor removes the corresponding packet from the
   * ackQueue.
   * In case of error, all outstanding packets and moved from
   * ackQueue. A new pipeline is setup by eliminating the bad
   * datanode from the original pipeline. The DataStreamer now
   * starts sending packets from the dataQueue.
  class DFSOutputStream extends FSOutputSummer implements Syncable { //
    // Processes reponses from the datanodes.  A packet is removed 
    // from the ackQueue when its response arrives.
    private class ResponseProcessor extends Thread {
   // connects to the first datanode in the pipeline
    // Returns true if success, otherwise return failure.
    private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client,
                    boolean recoveryFlag) {
      short pipelineStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS;
      String firstBadLink = "";
      if (LOG.isDebugEnabled()) {
        for (int i = 0; i < nodes.length; i++) {
          LOG.debug("pipeline = " + nodes[i].getName());

      // persist blocks on namenode on next flush
      persistBlocks = true;

      boolean result = false;
      try {
        LOG.debug("Connecting to " + nodes[0].getName());
        InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName());
        s = socketFactory.createSocket();
        timeoutValue = 3000 * nodes.length + socketTimeout;
        NetUtils.connect(s, target, timeoutValue);
        LOG.debug("Send buf size " + s.getSendBufferSize());
        long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length +

        // Xmit header info to datanode
        DataOutputStream out = new DataOutputStream(
            new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), 
        blockReplyStream = new DataInputStream(NetUtils.getInputStream(s));
        //写文件头部分 版本号-操作码-数据块标识-版本号
        out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
        out.write( DataTransferProtocol.OP_WRITE_BLOCK );
        out.writeLong( block.getBlockId() );
        out.writeLong( block.getGenerationStamp() );
        out.writeInt( nodes.length );
        out.writeBoolean( recoveryFlag );       // recovery flag
        Text.writeString( out, client );
        out.writeBoolean(false); // Not sending src node information
        out.writeInt( nodes.length - 1 );
        for (int i = 1; i < nodes.length; i++) {
        checksum.writeHeader( out );

        // receive ack for connect
        pipelineStatus = blockReplyStream.readShort();
        firstBadLink = Text.readString(blockReplyStream);
        if (pipelineStatus != DataTransferProtocol.OP_STATUS_SUCCESS) {
          if (pipelineStatus == DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) {
            throw new InvalidBlockTokenException(
                "Got access token error for connect ack with firstBadLink as "
                    + firstBadLink);
          } else {
            throw new IOException("Bad connect ack with firstBadLink as "
                + firstBadLink);

        blockStream = out;
        result = true;     // success

      } catch (IOException ie) {

        LOG.info("Exception in createBlockOutputStream " + nodes[0].getName() +
            " " + ie);

        // find the datanode that matches
        if (firstBadLink.length() != 0) {
          for (int i = 0; i < nodes.length; i++) {
            if (nodes[i].getName().equals(firstBadLink)) {
              errorIndex = i;
        hasError = true;
        blockReplyStream = null;
        result = false;
      } finally {
        if (!result) {
          s = null;
      return result;
 * DataNode is a class (and program) that stores a set of
 * blocks for a DFS deployment.  A single deployment can
 * have one or many DataNodes.  Each DataNode communicates
 * regularly with a single NameNode.  It also communicates
 * with client code and other DataNodes from time to time.
 * DataNodes store a series of named blocks.  The DataNode
 * allows client code to read these blocks, or to write new
 * block data.  The DataNode may also, in response to instructions
 * from its NameNode, delete blocks or copy blocks to/from other
 * DataNodes.
 * The DataNode maintains just one critical table:
 *   block-> stream of bytes (of BLOCK_SIZE or less)
 * This info is stored on a local disk.  The DataNode
 * reports the table's contents to the NameNode upon startup
 * and every so often afterwards.
 * DataNodes spend their lives in an endless loop of asking
 * the NameNode for something to do.  A NameNode cannot connect
 * to a DataNode directly; a NameNode simply returns values from
 * functions invoked by a DataNode.
 * DataNodes maintain an open server socket so that client code 
 * or other DataNodes can read/write data.  The host/port for
 * this server is reported to the NameNode, which then sends that
 * information to clients or other DataNodes that might be interested.
public class DataNode extends Configured 
   * Used for transferring a block of data.  This class
   * sends a piece of data to another DataNode.
  class DataTransfer implements Runnable {
    DatanodeInfo targets[];
    Block b;
    DataNode datanode;

     * Do the deed, write the bytes
    public void run() {
      Socket sock = null;
      DataOutputStream out = null;
      BlockSender blockSender = null;
      try {
        InetSocketAddress curTarget = 
        sock = newSocket();
        NetUtils.connect(sock, curTarget, socketTimeout);
        sock.setSoTimeout(targets.length * socketTimeout);

        long writeTimeout = socketWriteTimeout + 
                            HdfsConstants.WRITE_TIMEOUT_EXTENSION * (targets.length-1);
        OutputStream baseStream = NetUtils.getOutputStream(sock, writeTimeout);
        out = new DataOutputStream(new BufferedOutputStream(baseStream, 

        blockSender = new BlockSender(b, 0, b.getNumBytes(), false, false, false, 
        DatanodeInfo srcNode = new DatanodeInfo(dnRegistration);

        // Header info
        out.writeInt(0);           // no pipelining
        out.writeBoolean(false);   // not part of recovery
        Text.writeString(out, ""); // client
        out.writeBoolean(true); // sending src node information
        srcNode.write(out); // Write src node DatanodeInfo
        // write targets
        out.writeInt(targets.length - 1);
        for (int i = 1; i < targets.length; i++) {
        Token<BlockTokenIdentifier> accessToken = BlockTokenSecretManager.DUMMY_TOKEN;
        if (isBlockTokenEnabled) {
          accessToken = blockTokenSecretManager.generateToken(null, b,
        // send data & checksum
        blockSender.sendBlock(out, baseStream, null);

        // no response necessary
        LOG.info(dnRegistration + ":Transmitted block " + b + " to " + curTarget);

      } catch (IOException ie) {
        LOG.warn(dnRegistration + ":Failed to transfer " + b + " to " + targets[0].getName()
            + " got " + StringUtils.stringifyException(ie));
        // check if there are any disk problem
      } finally {


      在上述管道中会依次逆向返回写操作的结果给上游节点 ,当写操作的每一个管道上的各个数据节点都顺利的写入磁盘时,最终的结果会是DataTransferProtocol.OP_STATUS_OK,否则为DataTransferProtol.OP_STATUS_ERROR。


        名字节点产生的fsimage(镜像文件)和edit(修改记录文件)s文件与第二名字节点间的传送,由于这两个文件都比较大,传送交互这个过程并没有采用Hadoop IPC,同时也没有采用数据节点基于TCP的机制,而是使用了基于HTTP的流接口。

 版权申明:本文部分摘自【蔡斌、陈湘萍】所著【Hadoop技术内幕 深入解析Hadoop Common和HDFS架构设计与实现原理】一书,仅作为学习笔记,用于技术交流,其商业版权由原作者保留,推荐大家购买图书研究,转载请保留原作者,谢谢!

发布了1 篇原创文章 · 获赞 0 · 访问量 3万+
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.