流式接口的實現
數據節點通過數據節點存儲DataStorage和文件系統數據集FSDataset,將數據塊的物理存儲抽象爲對象上的服務,流式接口就是構建在這個服務之上、數據節點的另一基本功能。
爲了保證HDFS的設計目標,提供高吞吐的數據訪問,數據節點使用基於TCP流的數據訪問接口,實現HDFS文件的讀寫。
數據節點的流式接口實現是典型的TCP服務器。在Java基本套接字的功能,JDK爲基本套接字準備了java.net.Socket和java.net.ServerSocket,其中ServerSocket可在特定端口接受客戶的連接請求。Java程序一般通過構造ServerSocket對象,並將該對象綁定(bind)到某空閒端口,然後通過accept()方法監聽此端口的入站連接。
客戶端連接到服務器時,accept()方法返回一個Socket對象,服務器使用該Socket對象和客戶端進行交互,直到一方關閉連接。數據節點流式接口作爲一個TCP服務器,以標準的方式實現了以上步驟。
在DataNode.startDataNode()方法中,數據節點創建ServerSocket對象並綁定到監聽地址的監聽端口上,監聽地址由配置項${dfs.datanode.address}指定,監聽端口由${dfs.datanode.port}配置。接下來,數據節點調用了ServerSocket.setReceiveBufferSize()方法,它設置Socket接收緩存區的大小爲128K(默認值一般爲8k),這是一個比較重要的參數,數據節點需要提供高吞吐的數據服務,也就需要比較大的接收緩存區。這個緩存區大小設置適用於所有從accept()返回的Socket對象。
在startDataNode()中,還會爲數據節點的流式接口服務線程建立線程組,創建DataXceiverServer服務器,並將該線程組的線程設置爲守護線程。這裏涉及了Java線程中的兩個概念,線程組和守護線程。
線程組ThreadGroup表示一個線程的集合,通過線程組,Java允許同時對一組線程進行操作。如通過ThreadGroup.interrupt()方法可以中斷線程組中的所有線程,通過setDemon(ture)設置組中線程爲守護線程等。
Java線程分爲用戶線程和守護線程兩類,守護線程是一種“在後臺提供通用性支持”的線程,比如垃圾回收線程。它與用戶線程的唯一差別是當Java虛擬機中所有的線程都是守護線程的時候,虛擬機就可以退出了;如果還有一個或以上的用戶線程,虛擬機就不會退出。流式接口服務線程所在的線程組被設置爲守護線程,簡化了數據節點對這些線程的管理。代碼如下:
/**
* This method starts the data node with the specified conf.
*
* @param conf - the configuration
* if conf's CONFIG_PROPERTY_SIMULATED property is set
* then a simulated storage based data node is created.
*
* @param dataDirs - only for a non-simulated storage data node
* @throws IOException
* @throws MalformedObjectNameException
* @throws MBeanRegistrationException
* @throws InstanceAlreadyExistsException
*/
void startDataNode(Configuration conf,
AbstractList<File> dataDirs, SecureResources resources
) throws IOException {
if(UserGroupInformation.isSecurityEnabled() && resources == null)
throw new RuntimeException("Cannot start secure cluster without " +
"privileged resources.");
this.secureResources = resources;
// use configured nameserver & interface to get local hostname
if (conf.get("slave.host.name") != null) {
machineName = conf.get("slave.host.name");
}
if (machineName == null) {
machineName = DNS.getDefaultHost(
conf.get("dfs.datanode.dns.interface","default"),
conf.get("dfs.datanode.dns.nameserver","default"));
}
InetSocketAddress nameNodeAddr = NameNode.getServiceAddress(conf, true);
this.socketTimeout = conf.getInt("dfs.socket.timeout",
HdfsConstants.READ_TIMEOUT);
this.socketWriteTimeout = conf.getInt("dfs.datanode.socket.write.timeout",
HdfsConstants.WRITE_TIMEOUT);
/* Based on results on different platforms, we might need set the default
* to false on some of them. */
this.transferToAllowed = conf.getBoolean("dfs.datanode.transferTo.allowed",
true);
this.writePacketSize = conf.getInt("dfs.write.packet.size", 64*1024);
InetSocketAddress socAddr = DataNode.getStreamingAddr(conf);
int tmpPort = socAddr.getPort();
storage = new DataStorage();
// construct registration
this.dnRegistration = new DatanodeRegistration(machineName + ":" + tmpPort);
// connect to name node
this.namenode = (DatanodeProtocol)
RPC.waitForProxy(DatanodeProtocol.class,
DatanodeProtocol.versionID,
nameNodeAddr,
conf);
// get version and id info from the name-node
NamespaceInfo nsInfo = handshake();
StartupOption startOpt = getStartupOption(conf);
assert startOpt != null : "Startup option must be set.";
boolean simulatedFSDataset =
conf.getBoolean("dfs.datanode.simulateddatastorage", false);
if (simulatedFSDataset) {
setNewStorageID(dnRegistration);
dnRegistration.storageInfo.layoutVersion = FSConstants.LAYOUT_VERSION;
dnRegistration.storageInfo.namespaceID = nsInfo.namespaceID;
// it would have been better to pass storage as a parameter to
// constructor below - need to augment ReflectionUtils used below.
conf.set("StorageId", dnRegistration.getStorageID());
try {
//Equivalent of following (can't do because Simulated is in test dir)
// this.data = new SimulatedFSDataset(conf);
this.data = (FSDatasetInterface) ReflectionUtils.newInstance(
Class.forName("org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset"), conf);
} catch (ClassNotFoundException e) {
throw new IOException(StringUtils.stringifyException(e));
}
} else { // real storage
// read storage info, lock data dirs and transition fs state if necessary
storage.recoverTransitionRead(nsInfo, dataDirs, startOpt);
// adjust
this.dnRegistration.setStorageInfo(storage);
// initialize data node internal structure
this.data = new FSDataset(storage, conf);
}
// register datanode MXBean
this.registerMXBean(conf); // register the MXBean for DataNode
// Allow configuration to delay block reports to find bugs
artificialBlockReceivedDelay = conf.getInt(
"dfs.datanode.artificialBlockReceivedDelay", 0);
// find free port or use privileged port provide
ServerSocket ss;
if(secureResources == null) {
ss = (socketWriteTimeout > 0) ?
ServerSocketChannel.open().socket() : new ServerSocket();
Server.bind(ss, socAddr, 0);
} else {
ss = resources.getStreamingSocket();
}
ss.setReceiveBufferSize(DEFAULT_DATA_SOCKET_SIZE);
// adjust machine name with the actual port
tmpPort = ss.getLocalPort();
selfAddr = new InetSocketAddress(ss.getInetAddress().getHostAddress(),
tmpPort);
this.dnRegistration.setName(machineName + ":" + tmpPort);
LOG.info("Opened info server at " + tmpPort);
this.threadGroup = new ThreadGroup("dataXceiverServer");
this.dataXceiverServer = new Daemon(threadGroup,
new DataXceiverServer(ss, conf, this));
this.threadGroup.setDaemon(true); // auto destroy when empty
this.blockReportInterval =
conf.getLong("dfs.blockreport.intervalMsec", BLOCKREPORT_INTERVAL);
this.initialBlockReportDelay = conf.getLong("dfs.blockreport.initialDelay",
BLOCKREPORT_INITIAL_DELAY)* 1000L;
if (this.initialBlockReportDelay >= blockReportInterval) {
this.initialBlockReportDelay = 0;
LOG.info("dfs.blockreport.initialDelay is greater than " +
"dfs.blockreport.intervalMsec." + " Setting initial delay to 0 msec:");
}
this.heartBeatInterval = conf.getLong("dfs.heartbeat.interval", HEARTBEAT_INTERVAL) * 1000L;
DataNode.nameNodeAddr = nameNodeAddr;
//initialize periodic block scanner
String reason = null;
if (conf.getInt("dfs.datanode.scan.period.hours", 0) < 0) {
reason = "verification is turned off by configuration";
} else if ( !(data instanceof FSDataset) ) {
reason = "verifcation is supported only with FSDataset";
}
if ( reason == null ) {
blockScanner = new DataBlockScanner(this, (FSDataset)data, conf);
} else {
LOG.info("Periodic Block Verification is disabled because " +
reason + ".");
}
//create a servlet to serve full-file content
InetSocketAddress infoSocAddr = DataNode.getInfoAddr(conf);
String infoHost = infoSocAddr.getHostName();
int tmpInfoPort = infoSocAddr.getPort();
this.infoServer = (secureResources == null)
? new HttpServer("datanode", infoHost, tmpInfoPort, tmpInfoPort == 0,
conf, SecurityUtil.getAdminAcls(conf, DFSConfigKeys.DFS_ADMIN))
: new HttpServer("datanode", infoHost, tmpInfoPort, tmpInfoPort == 0,
conf, SecurityUtil.getAdminAcls(conf, DFSConfigKeys.DFS_ADMIN),
secureResources.getListener());
if (conf.getBoolean("dfs.https.enable", false)) {
boolean needClientAuth = conf.getBoolean("dfs.https.need.client.auth", false);
InetSocketAddress secInfoSocAddr = NetUtils.createSocketAddr(conf.get(
"dfs.datanode.https.address", infoHost + ":" + 0));
Configuration sslConf = new Configuration(false);
sslConf.addResource(conf.get("dfs.https.server.keystore.resource",
"ssl-server.xml"));
this.infoServer.addSslListener(secInfoSocAddr, sslConf, needClientAuth);
}
this.infoServer.addInternalServlet(null, "/streamFile/*", StreamFile.class);
this.infoServer.addInternalServlet(null, "/getFileChecksum/*",
FileChecksumServlets.GetServlet.class);
this.infoServer.setAttribute("datanode", this);
this.infoServer.setAttribute("datanode.blockScanner", blockScanner);
this.infoServer.setAttribute(JspHelper.CURRENT_CONF, conf);
this.infoServer.addServlet(null, "/blockScannerReport",
DataBlockScanner.Servlet.class);
if (WebHdfsFileSystem.isEnabled(conf, LOG)) {
infoServer.addJerseyResourcePackage(DatanodeWebHdfsMethods.class
.getPackage().getName() + ";" + Param.class.getPackage().getName(),
WebHdfsFileSystem.PATH_PREFIX + "/*");
}
this.infoServer.start();
// adjust info port
this.dnRegistration.setInfoPort(this.infoServer.getPort());
myMetrics = DataNodeInstrumentation.create(conf,
dnRegistration.getStorageID());
// set service-level authorization security policy
if (conf.getBoolean(
ServiceAuthorizationManager.SERVICE_AUTHORIZATION_CONFIG, false)) {
ServiceAuthorizationManager.refresh(conf, new HDFSPolicyProvider());
}
// BlockTokenSecretManager is created here, but it shouldn't be
// used until it is initialized in register().
this.blockTokenSecretManager = new BlockTokenSecretManager(false,
0, 0);
//init ipc server
InetSocketAddress ipcAddr = NetUtils.createSocketAddr(
conf.get("dfs.datanode.ipc.address"));
ipcServer = RPC.getServer(this, ipcAddr.getHostName(), ipcAddr.getPort(),
conf.getInt("dfs.datanode.handler.count", 3), false, conf,
blockTokenSecretManager);
dnRegistration.setIpcPort(ipcServer.getListenerAddress().getPort());
LOG.info("dnRegistration = " + dnRegistration);
}
DataNode.startDataNode()創建的DataXceiverServer實現了accept()循環,它的實現有如下要點。
成員變量childSockets包含了所有打開的用於數據傳輸的Socket,這些Socket被DataXceiver對象使用;成員變量maxXceiverConut,是數據節點流式接口能夠支持的最大客戶數,它由配置項${dfs.datanode.max.xcievers}指定,默認值是256,在一個繁忙的集羣上,應該適當提高該數值。
DataXceiverServer.run()的accept()調用會阻塞等待客戶端的連接,如果有新的服務請求,服務器會創建一個新的線程,即創建一個DataXceiver對象,服務客戶。這裏,DataXceiverServer使用了一客戶一線程的模式,爲每一個連接創建一個新線程,並有該線程和客戶交互,DataXceiverServer的主循環只是簡單地通過accept()監聽連接請求。這種模式,非常適合數據節點的流式接口,有利於批量處理數據,提高數據的吞吐量。代碼如下:
/**
* Server used for receiving/sending a block of data.
* This is created to listen for requests from clients or
* other DataNodes. This small server does not use the
* Hadoop IPC mechanism.
*/
class DataXceiverServer implements Runnable, FSConstants {
public static final Log LOG = DataNode.LOG;
ServerSocket ss;
DataNode datanode;
// Record all sockets opend for data transfer
Map<Socket, Socket> childSockets = Collections.synchronizedMap(
new HashMap<Socket, Socket>());
/**
* Maximal number of concurrent xceivers per node.
* Enforcing the limit is required in order to avoid data-node
* running out of memory.
*/
static final int MAX_XCEIVER_COUNT = 256;
int maxXceiverCount = MAX_XCEIVER_COUNT;
/** A manager to make sure that cluster balancing does not
* take too much resources.
*
* It limits the number of block moves for balancing and
* the total amount of bandwidth they can use.
*/
static class BlockBalanceThrottler extends BlockTransferThrottler {
private int numThreads;
/**Constructor
*
* @param bandwidth Total amount of bandwidth can be used for balancing
*/
private BlockBalanceThrottler(long bandwidth) {
super(bandwidth);
LOG.info("Balancing bandwith is "+ bandwidth + " bytes/s");
}
/** Check if the block move can start.
*
* Return true if the thread quota is not exceeded and
* the counter is incremented; False otherwise.
*/
synchronized boolean acquire() {
if (numThreads >= Balancer.MAX_NUM_CONCURRENT_MOVES) {
return false;
}
numThreads++;
return true;
}
/** Mark that the move is completed. The thread counter is decremented. */
synchronized void release() {
numThreads--;
}
}
BlockBalanceThrottler balanceThrottler;
/**
* We need an estimate for block size to check if the disk partition has
* enough space. For now we set it to be the default block size set
* in the server side configuration, which is not ideal because the
* default block size should be a client-size configuration.
* A better solution is to include in the header the estimated block size,
* i.e. either the actual block size or the default block size.
*/
long estimateBlockSize;
DataXceiverServer(ServerSocket ss, Configuration conf,
DataNode datanode) {
this.ss = ss;
this.datanode = datanode;
this.maxXceiverCount = conf.getInt("dfs.datanode.max.xcievers",
MAX_XCEIVER_COUNT);
this.estimateBlockSize = conf.getLong("dfs.block.size", DEFAULT_BLOCK_SIZE);
//set up parameter for cluster balancing
this.balanceThrottler = new BlockBalanceThrottler(
conf.getLong("dfs.balance.bandwidthPerSec", 1024L*1024));
}
/**
*/
public void run() {
while (datanode.shouldRun) {
try {
Socket s = ss.accept();
s.setTcpNoDelay(true);
new Daemon(datanode.threadGroup,
new DataXceiver(s, datanode, this)).start();
} catch (SocketTimeoutException ignored) {
// wake up to see if should continue to run
} catch (AsynchronousCloseException ace) {
LOG.warn(datanode.dnRegistration + ":DataXceiveServer:"
+ StringUtils.stringifyException(ace));
datanode.shouldRun = false;
} catch (IOException ie) {
LOG.warn(datanode.dnRegistration + ":DataXceiveServer: IOException due to:"
+ StringUtils.stringifyException(ie));
} catch (Throwable te) {
LOG.error(datanode.dnRegistration + ":DataXceiveServer: Exiting due to:"
+ StringUtils.stringifyException(te));
datanode.shouldRun = false;
}
}
try {
ss.close();
} catch (IOException ie) {
LOG.warn(datanode.dnRegistration + ":DataXceiveServer: Close exception due to: "
+ StringUtils.stringifyException(ie));
}
LOG.info("Exiting DataXceiveServer");
}
void kill() {
assert datanode.shouldRun == false :
"shoudRun should be set to false before killing";
try {
this.ss.close();
} catch (IOException ie) {
LOG.warn(datanode.dnRegistration + ":DataXceiveServer.kill(): "
+ StringUtils.stringifyException(ie));
}
// close all the sockets that were accepted earlier
synchronized (childSockets) {
for (Iterator<Socket> it = childSockets.values().iterator();
it.hasNext();) {
Socket thissock = it.next();
try {
thissock.close();
} catch (IOException e) {
}
}
}
}
}
DataXceiverServer只處理客戶端的連接請求,實際的請求處理和數據交換都交由DataXceiver處理。DataXceiver對象擁有自己獨立的線程,該DataXceiver對象和它擁有的線程只處理一個客戶端請求。DataXceiver實現了Runnable接口,在它的run()方法裏,DataXceiver會執行一些流式接口共有的操作,然後根據請求碼分別調用不同的DataXceiver成員方法。run()方法代碼如下:
/**
* Thread for processing incoming/outgoing data stream.
*/
class DataXceiver implements Runnable, FSConstants {
public static final Log LOG = DataNode.LOG;
static final Log ClientTraceLog = DataNode.ClientTraceLog;
Socket s;
final String remoteAddress; // address of remote side
final String localAddress; // local address of this daemon
DataNode datanode;
DataXceiverServer dataXceiverServer;
public DataXceiver(Socket s, DataNode datanode,
DataXceiverServer dataXceiverServer) {
this.s = s;
this.datanode = datanode;
this.dataXceiverServer = dataXceiverServer;
dataXceiverServer.childSockets.put(s, s);
remoteAddress = s.getRemoteSocketAddress().toString();
localAddress = s.getLocalSocketAddress().toString();
LOG.debug("Number of active connections is: " + datanode.getXceiverCount());
}
/**
* Read/write data from/to the DataXceiveServer.
*/
public void run() {
DataInputStream in=null;
try {
in = new DataInputStream(
new BufferedInputStream(NetUtils.getInputStream(s),
SMALL_BUFFER_SIZE));
short version = in.readShort();
if ( version != DataTransferProtocol.DATA_TRANSFER_VERSION ) {
throw new IOException( "Version Mismatch" );
}
boolean local = s.getInetAddress().equals(s.getLocalAddress());
byte op = in.readByte();
// Make sure the xciver count is not exceeded
int curXceiverCount = datanode.getXceiverCount();
if (curXceiverCount > dataXceiverServer.maxXceiverCount) {
throw new IOException("xceiverCount " + curXceiverCount
+ " exceeds the limit of concurrent xcievers "
+ dataXceiverServer.maxXceiverCount);
}
long startTime = DataNode.now();
switch ( op ) {
case DataTransferProtocol.OP_READ_BLOCK:
readBlock( in );
datanode.myMetrics.addReadBlockOp(DataNode.now() - startTime);
if (local)
datanode.myMetrics.incrReadsFromLocalClient();
else
datanode.myMetrics.incrReadsFromRemoteClient();
break;
case DataTransferProtocol.OP_WRITE_BLOCK:
writeBlock( in );
datanode.myMetrics.addWriteBlockOp(DataNode.now() - startTime);
if (local)
datanode.myMetrics.incrWritesFromLocalClient();
else
datanode.myMetrics.incrWritesFromRemoteClient();
break;
case DataTransferProtocol.OP_REPLACE_BLOCK: // for balancing purpose; send to a destination
replaceBlock(in);
datanode.myMetrics.addReplaceBlockOp(DataNode.now() - startTime);
break;
case DataTransferProtocol.OP_COPY_BLOCK:
// for balancing purpose; send to a proxy source
copyBlock(in);
datanode.myMetrics.addCopyBlockOp(DataNode.now() - startTime);
break;
case DataTransferProtocol.OP_BLOCK_CHECKSUM: //get the checksum of a block
getBlockChecksum(in);
datanode.myMetrics.addBlockChecksumOp(DataNode.now() - startTime);
break;
default:
throw new IOException("Unknown opcode " + op + " in data stream");
}
} catch (Throwable t) {
LOG.error(datanode.dnRegistration + ":DataXceiver",t);
} finally {
LOG.debug(datanode.dnRegistration + ":Number of active connections is: "
+ datanode.getXceiverCount());
IOUtils.closeStream(in);
IOUtils.closeSocket(s);
dataXceiverServer.childSockets.remove(s);
}
}
.....
}
由上所知,DatXceiver.run()首先創建輸入流,然後進行版本檢查。在前面的介紹的流式接口,所有的請求幀,第一字段都是版本好,所以,DataXceiver能夠在run()方法中統一處理請求的版本號。版本檢查失敗會拋出異常,並執行最後的清理工作:關閉輸入流和Socket。方法run()進行的第二項檢查是該請求是否超出數據節點的支撐能力,以確保數據節點服務質量。經過這兩項檢查後,DataXceiver.run()讀入請求碼,並根據請求碼,調用相應的方法,如讀數據塊會由readBlock()方法進行後續處理。
DataXceiverServer和DataXceiver實現了數據節點流式接口,它們採用一客戶一線程的方式,滿足了數據節點流式接口批量讀寫數據、高吞吐量的特殊要求。
讀數據
客戶端讀取HDFS文件是通過操作碼81的流式接口進行的。讀請求包括如下字段:
blockID(數據塊ID):要讀取的數據塊標識,數據節點通過它定位數據塊。
generationStamp(數據塊版本號):用於進行版本檢查,防止讀取錯誤的數據。
startOffset(偏移量):要讀取數據位於數據塊中的位置。
length(數據長度):客戶端需要讀取的數據長度。
clientName(客戶端名字):發起讀請求的客戶端名字。
accessToken(訪問令牌):安全特性相關。
DataXceiver.readBlock()給出了數據節點 讀數據流式接口實現的框架。方法的開始部分,會通過Socket連接的輸入流,讀取上述請求信息並構造一個數據塊發送器BlockSender對象,然後,通過該對象發送數據,數據發送完畢後,方法執行一些清理工作。
BlockSender的構造函數會進行一系列的檢查,這些檢查都通過以後,纔會成功創建對象,否則異常通知readBlock()方法,並由該方法返回錯誤碼給客戶端,並結束這次請求。DataXceiver.readBlock(),代碼如下:
/**
* Read a block from the disk.
* @param in The stream to read from
* @throws IOException
*/
private void readBlock(DataInputStream in) throws IOException {
//
// Read in the header
//
long blockId = in.readLong();
Block block = new Block( blockId, 0 , in.readLong());
long startOffset = in.readLong();
long length = in.readLong();
String clientName = Text.readString(in);
Token<BlockTokenIdentifier> accessToken = new Token<BlockTokenIdentifier>();
accessToken.readFields(in);
OutputStream baseStream = NetUtils.getOutputStream(s,
datanode.socketWriteTimeout);
DataOutputStream out = new DataOutputStream(
new BufferedOutputStream(baseStream, SMALL_BUFFER_SIZE));
if (datanode.isBlockTokenEnabled) {
try {
datanode.blockTokenSecretManager.checkAccess(accessToken, null, block,
BlockTokenSecretManager.AccessMode.READ);
} catch (InvalidToken e) {
try {
out.writeShort(DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN);
out.flush();
throw new IOException("Access token verification failed, for client "
+ remoteAddress + " for OP_READ_BLOCK for block " + block);
} finally {
IOUtils.closeStream(out);
}
}
}
// send the block
BlockSender blockSender = null;
final String clientTraceFmt =
clientName.length() > 0 && ClientTraceLog.isInfoEnabled()
? String.format(DN_CLIENTTRACE_FORMAT, localAddress, remoteAddress,
"%d", "HDFS_READ", clientName, "%d",
datanode.dnRegistration.getStorageID(), block, "%d")
: datanode.dnRegistration + " Served block " + block + " to " +
s.getInetAddress();
try {
try {
blockSender = new BlockSender(block, startOffset, length,
true, true, false, datanode, clientTraceFmt);
} catch(IOException e) {
out.writeShort(DataTransferProtocol.OP_STATUS_ERROR);
throw e;
}
out.writeShort(DataTransferProtocol.OP_STATUS_SUCCESS); // send op status
long read = blockSender.sendBlock(out, baseStream, null); // send data
if (blockSender.isBlockReadFully()) {
// See if client verification succeeded.
// This is an optional response from client.
try {
if (in.readShort() == DataTransferProtocol.OP_STATUS_CHECKSUM_OK &&
datanode.blockScanner != null) {
datanode.blockScanner.verifiedByClient(block);
}
} catch (IOException ignored) {}
}
datanode.myMetrics.incrBytesRead((int) read);
datanode.myMetrics.incrBlocksRead();
} catch ( SocketException ignored ) {
// Its ok for remote side to close the connection anytime.
datanode.myMetrics.incrBlocksRead();
} catch ( IOException ioe ) {
/* What exactly should we do here?
* Earlier version shutdown() datanode if there is disk error.
*/
LOG.warn(datanode.dnRegistration + ":Got exception while serving " +
block + " to " +
s.getInetAddress() + ":\n" +
StringUtils.stringifyException(ioe) );
throw ioe;
} finally {
IOUtils.closeStream(out);
IOUtils.closeStream(blockSender);
}
}
readBlock()方法還有一個需要提及的地方,在上述代碼的最後部分,如果客戶端成功讀取並校驗數據,會發送一個附加的響應碼OP_STATUS_CHECKSUM_OK,通知數據節點。如果數據節點發送了完整的一個數據塊,這時,數據節點可以根據這個響應碼,通知數據塊掃描器,讓掃描器標記該數據塊爲客戶端校驗成功。數據節點使用數據塊掃描器定期掃描數據塊,以期儘快發現數據塊錯誤,保證節點保存數據的正確性。數據塊掃描需要讀入數據塊的數據和校驗信息文件,並做檢查,是一個比較耗資源的過程,如果客戶端已經進行了這樣的校驗,數據節點就可以省略重複的工作,以減輕系統負載。
數據塊發送器完成讀數據請求的大部分工作,包括:準備、發送讀請求應答頭、發送應答數據包和清理等。
準備工作主要由BlockSender的構造函數完成,在爲一系列成員變量賦值後,構造函數開始準備數據塊的校驗信息,打開校驗信息文件,並從文件中獲取校驗方法、校驗塊大小(它們保存在校驗信息文件的頭部),涉及的BlockSender構造函數代碼如下:
BlockSender(Block block, long startOffset, long length,
boolean corruptChecksumOk, boolean chunkOffsetOK,
boolean verifyChecksum, DataNode datanode, String clientTraceFmt)
throws IOException {
try {
this.block = block;
this.chunkOffsetOK = chunkOffsetOK;
this.corruptChecksumOk = corruptChecksumOk;
this.verifyChecksum = verifyChecksum;
this.blockLength = datanode.data.getVisibleLength(block);
this.transferToAllowed = datanode.transferToAllowed;
this.clientTraceFmt = clientTraceFmt;
if ( !corruptChecksumOk || datanode.data.metaFileExists(block) ) {
checksumIn = new DataInputStream(
new BufferedInputStream(datanode.data.getMetaDataInputStream(block),
BUFFER_SIZE));
// read and handle the common header here. For now just a version
BlockMetadataHeader header = BlockMetadataHeader.readHeader(checksumIn);
short version = header.getVersion();
if (version != FSDataset.METADATA_VERSION) {
LOG.warn("Wrong version (" + version + ") for metadata file for "
+ block + " ignoring ...");
}
checksum = header.getChecksum();
} else {
LOG.warn("Could not find metadata file for " + block);
// This only decides the buffer size. Use BUFFER_SIZE?
checksum = DataChecksum.newDataChecksum(DataChecksum.CHECKSUM_NULL,
16 * 1024);
}
/* If bytesPerChecksum is very large, then the metadata file
* is mostly corrupted. For now just truncate bytesPerchecksum to
* blockLength.
*/
bytesPerChecksum = checksum.getBytesPerChecksum();
if (bytesPerChecksum > 10*1024*1024 && bytesPerChecksum > blockLength){
checksum = DataChecksum.newDataChecksum(checksum.getChecksumType(),
Math.max((int)blockLength, 10*1024*1024));
bytesPerChecksum = checksum.getBytesPerChecksum();
}
checksumSize = checksum.getChecksumSize();
if (length < 0) {
length = blockLength;
}
endOffset = blockLength;
if (startOffset < 0 || startOffset > endOffset
|| (length + startOffset) > endOffset) {
String msg = " Offset " + startOffset + " and length " + length
+ " don't match block " + block + " ( blockLen " + endOffset + " )";
LOG.warn(datanode.dnRegistration + ":sendBlock() : " + msg);
throw new IOException(msg);
}
offset = (startOffset - (startOffset % bytesPerChecksum));
if (length >= 0) {
// Make sure endOffset points to end of a checksumed chunk.
long tmpLen = startOffset + length;
if (tmpLen % bytesPerChecksum != 0) {
tmpLen += (bytesPerChecksum - tmpLen % bytesPerChecksum);
}
if (tmpLen < endOffset) {
endOffset = tmpLen;
}
}
// seek to the right offsets
if (offset > 0) {
long checksumSkip = (offset / bytesPerChecksum) * checksumSize;
// note blockInStream is seeked when created below
if (checksumSkip > 0) {
// Should we use seek() for checksum file as well?
IOUtils.skipFully(checksumIn, checksumSkip);
}
}
seqno = 0;
blockIn = datanode.data.getBlockInputStream(block, offset); // seek to offset
memoizedBlock = new MemoizedBlock(blockIn, blockLength, datanode.data, block);
} catch (IOException ioe) {
IOUtils.closeStream(this);
IOUtils.closeStream(blockIn);
throw ioe;
}
}
上面的代碼決定了我們需要從數據塊文件和校驗信息文件中讀入哪些數據。以數據塊文件爲例,讀請求中提供了偏移量startOffst和數據長度length兩個參數,但由於校驗信息是按塊組織的,爲了讓客戶端能夠進行數據校驗,必須返回包含用戶讀取數據的所有塊。
"零拷貝"數據傳輸
數據節點是一個I/O密集型Java應用,爲了充分利用Java NIO帶來的性能提升,BlockSender支持兩種數據發送:普通方式和NIO方式。普通方式使用基於Java流的API,實現數據節點“數據流"流式接口,NIO方式則利用了Java NIO中的transferTo()方法,以零拷貝的數據傳輸高效地實現了相同的接口。
BlockSender使用了NIO的transferTo()方法,“零拷貝”進行數據高效傳輸,使得數據塊的數據不經過數據節點,帶來的一個問題是:數據節點失去了在客戶端讀取數據的過程中對數據進行校驗的能力。所有,BlockSender也支持結合數據校驗的數據傳輸,它被應用與數據塊掃描中。另一個解決方案是讓客戶端對數據進行校驗,並上報校驗的結果,在DataXceiver.readBlock()清理動作中,數據節點會接受客戶端的附加響應碼,或獲取客戶端的校驗結果。
寫數據
流式接口的寫數據實現遠比讀數據複雜。客戶端寫HDFS文件數據的操作碼爲80,請求包含如下主要字段:
blockId(數據塊ID):寫數據的數據塊標識,數據節點通過它定位數據塊。
generationStamp(版本號):版本檢查
pipelineSize(數據流管道的大小):參與到寫過程的所有數據節點的個數
isRecovery(是否是數據恢復過程):這個寫操作是不是錯誤恢復過程中的一部分
clientName(客戶端名字):發起寫請求的客戶端名字
hasSrcDataNode(源信息標記):寫請求是否攜帶源信息,如果是true,則包含源信息
srcDataNode(源信息):類型爲DtanodeInfo,包含發起寫請求的數據節點的信息
numTargets(數據目標列表大小):當前數據節點還有多少個下游數據推送目標
targets(數據目標列表):當前數據節點的下游數據推送目標列表
accessToken(訪問令牌):安全特性相關
checksum(數據校驗信息):類型爲DataChecksum,包含了後續寫數據數據包的校驗方式
上述字段在writeBlock()入口中讀取,並保存在對應的方法變量中,然後,構造數據塊接收器BlockReceiver對象,在BlockReceiver的構造函數中,會爲寫數據塊和校驗信息文件打開輸出數據流,使用的是FSDataset.writeToBlock()方法,在完成一系列檢查後,它返回到數據塊文件和校驗文件的輸出流。代碼如下:
BlockReceiver(Block block, DataInputStream in, String inAddr,
String myAddr, boolean isRecovery, String clientName,
DatanodeInfo srcDataNode, DataNode datanode) throws IOException {
try{
this.block = block;
this.in = in;
this.inAddr = inAddr;
this.myAddr = myAddr;
this.isRecovery = isRecovery;
this.clientName = clientName;
this.offsetInBlock = 0;
this.srcDataNode = srcDataNode;
this.datanode = datanode;
this.checksum = DataChecksum.newDataChecksum(in);
this.bytesPerChecksum = checksum.getBytesPerChecksum();
this.checksumSize = checksum.getChecksumSize();
//
// Open local disk out
//
streams = datanode.data.writeToBlock(block, isRecovery,
clientName == null || clientName.length() == 0);
this.finalized = false;
if (streams != null) {
this.out = streams.dataOut;
this.checksumOut = new DataOutputStream(new BufferedOutputStream(
streams.checksumOut,
SMALL_BUFFER_SIZE));
// If this block is for appends, then remove it from periodic
// validation.
if (datanode.blockScanner != null && isRecovery) {
datanode.blockScanner.deleteBlock(block);
}
}
} catch (BlockAlreadyExistsException bae) {
throw bae;
} catch(IOException ioe) {
IOUtils.closeStream(this);
cleanupBlock();
// check if there is a disk error
IOException cause = FSDataset.getCauseIfDiskError(ioe);
DataNode.LOG.warn("IOException in BlockReceiver constructor. Cause is ",
cause);
if (cause != null) { // possible disk error
ioe = cause;
datanode.checkDiskError(ioe); // may throw an exception here
}
throw ioe;
}
}
數據流管道中,順流的是HDFS的文件數據,而寫操作的確認包會逆流而上,所有,這裏需要兩個Socket對象。其中,對象s用於和管道上游通信,它的輸入和輸出流分別是in和replyOut;往下游的Socket對象是mirrirSock,關聯了輸出流mirrorOut和輸入流mirrorIn。
如果當前數據節點不是數據管道的最末端,writeBlock()方法就會使用數據目標列表的第一項,建立到下一個數據節點的Socket連接,連接建立後,通過輸出流mirrirOut,往下一個數據節點發起寫請求,除了數據目標列表大小和數據目錄列表字段會相應的變化以外,其他字段和從上游讀到的請求信息是一致的。writeBlock()方法代碼如下:
/**
* Write a block to disk.
*
* @param in The stream to read from
* @throws IOException
*/
private void writeBlock(DataInputStream in) throws IOException {
DatanodeInfo srcDataNode = null;
LOG.debug("writeBlock receive buf size " + s.getReceiveBufferSize() +
" tcp no delay " + s.getTcpNoDelay());
//
// Read in the header
//
Block block = new Block(in.readLong(),
dataXceiverServer.estimateBlockSize, in.readLong());
LOG.info("Receiving block " + block +
" src: " + remoteAddress +
" dest: " + localAddress);
int pipelineSize = in.readInt(); // num of datanodes in entire pipeline
boolean isRecovery = in.readBoolean(); // is this part of recovery?
String client = Text.readString(in); // working on behalf of this client
boolean hasSrcDataNode = in.readBoolean(); // is src node info present
if (hasSrcDataNode) {
srcDataNode = new DatanodeInfo();
srcDataNode.readFields(in);
}
int numTargets = in.readInt();
if (numTargets < 0) {
throw new IOException("Mislabelled incoming datastream.");
}
DatanodeInfo targets[] = new DatanodeInfo[numTargets];
for (int i = 0; i < targets.length; i++) {
DatanodeInfo tmp = new DatanodeInfo();
tmp.readFields(in);
targets[i] = tmp;
}
Token<BlockTokenIdentifier> accessToken = new Token<BlockTokenIdentifier>();
accessToken.readFields(in);
DataOutputStream replyOut = null; // stream to prev target
replyOut = new DataOutputStream(
NetUtils.getOutputStream(s, datanode.socketWriteTimeout));
if (datanode.isBlockTokenEnabled) {
try {
datanode.blockTokenSecretManager.checkAccess(accessToken, null, block,
BlockTokenSecretManager.AccessMode.WRITE);
} catch (InvalidToken e) {
try {
if (client.length() != 0) {
replyOut.writeShort((short)DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN);
Text.writeString(replyOut, datanode.dnRegistration.getName());
replyOut.flush();
}
throw new IOException("Access token verification failed, for client "
+ remoteAddress + " for OP_WRITE_BLOCK for block " + block);
} finally {
IOUtils.closeStream(replyOut);
}
}
}
DataOutputStream mirrorOut = null; // stream to next target
DataInputStream mirrorIn = null; // reply from next target
Socket mirrorSock = null; // socket to next target
BlockReceiver blockReceiver = null; // responsible for data handling
String mirrorNode = null; // the name:port of next target
String firstBadLink = ""; // first datanode that failed in connection setup
short mirrorInStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS;
try {
// open a block receiver and check if the block does not exist
blockReceiver = new BlockReceiver(block, in,
s.getRemoteSocketAddress().toString(),
s.getLocalSocketAddress().toString(),
isRecovery, client, srcDataNode, datanode);
//
// Open network conn to backup machine, if
// appropriate
//
if (targets.length > 0) {
InetSocketAddress mirrorTarget = null;
// Connect to backup machine
mirrorNode = targets[0].getName();
mirrorTarget = NetUtils.createSocketAddr(mirrorNode);
mirrorSock = datanode.newSocket();
try {
int timeoutValue = datanode.socketTimeout +
(HdfsConstants.READ_TIMEOUT_EXTENSION * numTargets);
int writeTimeout = datanode.socketWriteTimeout +
(HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);
NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
mirrorSock.setSoTimeout(timeoutValue);
mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
mirrorOut = new DataOutputStream(
new BufferedOutputStream(
NetUtils.getOutputStream(mirrorSock, writeTimeout),
SMALL_BUFFER_SIZE));
mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock));
// Write header: Copied from DFSClient.java!
mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );
mirrorOut.writeLong( block.getBlockId() );
mirrorOut.writeLong( block.getGenerationStamp() );
mirrorOut.writeInt( pipelineSize );
mirrorOut.writeBoolean( isRecovery );
Text.writeString( mirrorOut, client );
mirrorOut.writeBoolean(hasSrcDataNode);
if (hasSrcDataNode) { // pass src node information
srcDataNode.write(mirrorOut);
}
mirrorOut.writeInt( targets.length - 1 );
for ( int i = 1; i < targets.length; i++ ) {
targets[i].write( mirrorOut );
}
accessToken.write(mirrorOut);
blockReceiver.writeChecksumHeader(mirrorOut);
mirrorOut.flush();
// read connect ack (only for clients, not for replication req)
if (client.length() != 0) {
mirrorInStatus = mirrorIn.readShort();
firstBadLink = Text.readString(mirrorIn);
if (LOG.isDebugEnabled() || mirrorInStatus != DataTransferProtocol.OP_STATUS_SUCCESS) {
LOG.info("Datanode " + targets.length +
" got response for connect ack " +
" from downstream datanode with firstbadlink as " +
firstBadLink);
}
}
} catch (IOException e) {
if (client.length() != 0) {
replyOut.writeShort((short)DataTransferProtocol.OP_STATUS_ERROR);
Text.writeString(replyOut, mirrorNode);
replyOut.flush();
}
IOUtils.closeStream(mirrorOut);
mirrorOut = null;
IOUtils.closeStream(mirrorIn);
mirrorIn = null;
IOUtils.closeSocket(mirrorSock);
mirrorSock = null;
if (client.length() > 0) {
throw e;
} else {
LOG.info(datanode.dnRegistration + ":Exception transfering block " +
block + " to mirror " + mirrorNode +
". continuing without the mirror.\n" +
StringUtils.stringifyException(e));
}
}
}
// send connect ack back to source (only for clients)
if (client.length() != 0) {
if (LOG.isDebugEnabled() || mirrorInStatus != DataTransferProtocol.OP_STATUS_SUCCESS) {
LOG.info("Datanode " + targets.length +
" forwarding connect ack to upstream firstbadlink is " +
firstBadLink);
}
replyOut.writeShort(mirrorInStatus);
Text.writeString(replyOut, firstBadLink);
replyOut.flush();
}
// receive the block and mirror to the next target
String mirrorAddr = (mirrorSock == null) ? null : mirrorNode;
blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut,
mirrorAddr, null, targets.length);
// if this write is for a replication request (and not
// from a client), then confirm block. For client-writes,
// the block is finalized in the PacketResponder.
if (client.length() == 0) {
datanode.notifyNamenodeReceivedBlock(block, DataNode.EMPTY_DEL_HINT);
LOG.info("Received block " + block +
" src: " + remoteAddress +
" dest: " + localAddress +
" of size " + block.getNumBytes());
}
if (datanode.blockScanner != null) {
datanode.blockScanner.addBlock(block);
}
} catch (IOException ioe) {
LOG.info("writeBlock " + block + " received exception " + ioe);
throw ioe;
} finally {
// close all opened streams
IOUtils.closeStream(mirrorOut);
IOUtils.closeStream(mirrorIn);
IOUtils.closeStream(replyOut);
IOUtils.closeSocket(mirrorSock);
IOUtils.closeStream(blockReceiver);
}
}
DataXceiver委託BlockReceiver.receiveBlock()處理寫數據的數據包,成功處理完這些數據包以後,接下來的清理工作有:調用DataNode.notifyNamenodeReceivedBlock()通知名字。
PacketResponder線程
當BlockReceiver處理客戶端的寫數據請求時,方法receiveBlock()接收數據包,校驗數據並保存到本地的數據塊文件和校驗信息文件中,如果節點處於數據流管道的中間,它還需要向下一個數據節點轉發數據包。同時,數據節點還需要從下游接收數據包確認,並向上遊轉發。這裏,涉及上面說的兩個Socket輸入流(in和mirrorIn)的讀操作,爲此,數據塊接收器引入了PacketResponder線程,它和BlockReceiver所在的線程一起工作,分別用於從下游接收應答和從上游接收數據。爲什麼需要兩個線程呢?我們知道,從輸入流中讀取數據,如果流中有可讀的數據,立即讀取,如果沒有,則會阻塞等待。如果只是用一個線程,輪流讀取兩個輸入流,就會在這兩個輸入流間引入耦合。客戶端如果長時間不往數據節點發送數據,那麼,就很可能阻塞了下游確認的接收;另一個極端是,雖然客戶端往數據節點寫入大量的數據,但由於處理過程正在等待mirrorIn的輸入,也就沒有機會進行處理,從而影響了數據的吞吐。
PacketResponder線程將兩個輸入流的處理過程分開,該線程從下游數據節點接收確認,並在合適的時候,往上游發送。這裏的“合適”包括兩個條件:
1、當前數據節點已經順利處理完該數據包
2、(數據節點處於管道的中間時)當前數據節點收到下游數據節點的數據包確認。
這兩個條件都滿足,意味着當前數據節點和數據流管道後續數據節點都完成了對某個數據包的處理。
由於當前節點由BlockReceiver線程處理數據包,所有,它必須將處理結果通過某種機制,通知到PacketResponder線程,並由PacketResponder線程進行進一步的處理。
理解上述條件,PacketResponde的實現就很好理解,代碼如下:
/**
* Processed responses from downstream datanodes in the pipeline
* and sends back replies to the originator.
*/
class PacketResponder implements Runnable, FSConstants {
//packet waiting for ack
private LinkedList<Packet> ackQueue = new LinkedList<Packet>();
private volatile boolean running = true;
private Block block;
DataInputStream mirrorIn; // input from downstream datanode
DataOutputStream replyOut; // output to upstream datanode
private int numTargets; // number of downstream datanodes including myself
private BlockReceiver receiver; // The owner of this responder.
private Thread receiverThread; // the thread that spawns this responder
public String toString() {
return "PacketResponder " + numTargets + " for Block " + this.block;
}
PacketResponder(BlockReceiver receiver, Block b, DataInputStream in,
DataOutputStream out, int numTargets,
Thread receiverThread) {
this.receiver = receiver;
this.block = b;
mirrorIn = in;
replyOut = out;
this.numTargets = numTargets;
this.receiverThread = receiverThread;
}
/**
* enqueue the seqno that is still be to acked by the downstream datanode.
* @param seqno
* @param lastPacketInBlock
*/
synchronized void enqueue(long seqno, boolean lastPacketInBlock) {
if (running) {
LOG.debug("PacketResponder " + numTargets + " adding seqno " + seqno +
" to ack queue.");
ackQueue.addLast(new Packet(seqno, lastPacketInBlock));
notifyAll();
}
}
......
}
PacketResponder中的成員變量ackQueue,保存了BlockReceiver線程已經處理的寫請求數據包。BlockReceiver.receiverPackage()方法每處理完一個數據包,就通過PacketResponder.enqueue()將對應信息(內部類BlockReceiver.Packet中,包括數據包的序列號和是否是最後一個數據包兩個字段)放入隊列ackQueue中,隊列ackQueue中的信息由PacketResonder的run()方法處理,這是一個典型的生產者-消費者模型。
版權申明:本文部分摘自【蔡斌、陳湘萍】所著【Hadoop技術內幕 深入解析Hadoop Common和HDFS架構設計與實現原理】一書,僅作爲學習筆記,用於技術交流,其商業版權由原作者保留,推薦大家購買圖書研究,轉載請保留原作者,謝謝!