LoadIncrementalHFiles是copy而不是move的疑惑

轉載請標明出處:http://blackwing.iteye.com/blog/1991901

之前在另一篇文章裏實現的自定義job生成HFile並使用LoadIncrementalHFiles 入庫HBase :http://blackwing.iteye.com/blog/1991380

但發現入庫時,非常的慢,而且幾次都失敗了,明明官方教材說這個操作是move的:
The completebulkload utility will move generated StoreFiles into an HBase table. This utility is often used in conjunction with output from Section 15.1.10, “ImportTsv”.


再次驗證google的強大,發現官方有這個問題的解釋:

https://issues.apache.org/jira/browse/HBASE-9537
https://issues.apache.org/jira/browse/HBASE-8304


主要問題是,批量導入HBase時,如果hdfs地址沒有寫端口,會認爲是兩個不同文件系統,所以要copy,而如果當是同一文件系統時,則是move,秒級入庫了2GB的文件。
指令差別如下:
copy的場景:
./hbase-0.96.0-hadoop1/bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://namenode/outputtsv gonghui_test


move的場景:
./hbase-0.96.0-hadoop1/bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://namenode:8020/outputtsv gonghui_test


剩下的需要好好看看源碼增加理解了。
沒有secure authentication時,調用流程如下:

LoadIncrementalHFiles.doBulkLoad(Path hfofDir, final HTable table) -->
LoadIncrementalHFiles.bulkLoadPhase(final HTable table, final HConnection conn,ExecutorService pool, Deque<LoadQueueItem> queue,final Multimap<ByteBuffer, LoadQueueItem> regionGroups) -->

LoadIncrementalHFiles.tryAtomicRegionLoad(final HConnection conn,final TableName tableName, final byte[] first, Collection<LoadQueueItem> lqis) -->

ProtobufUtil.bulkLoadHFile(final ClientService.BlockingInterface client,final List<Pair<byte[], String>> familyPaths,final byte[] regionName, boolean assignSeqNum) -->


其中ProtobufUtil.bulkLoadHFile()代碼如下:
/**
* A helper to bulk load a list of HFiles using client protocol.
*
* @param client
* @param familyPaths
* @param regionName
* @param assignSeqNum
* @return true if all are loaded
* @throws IOException
*/
public static boolean bulkLoadHFile(final ClientService.BlockingInterface client,
final List<Pair<byte[], String>> familyPaths,
final byte[] regionName, boolean assignSeqNum) throws IOException {
BulkLoadHFileRequest request =
RequestConverter.buildBulkLoadHFileRequest(familyPaths, regionName, assignSeqNum);
try {
BulkLoadHFileResponse response =
client.bulkLoadHFile(null, request);
return response.getLoaded();
} catch (ServiceException se) {
throw getRemoteException(se);
}
}

其中client是HRegionServer實例。
之後調用過程爲:
HRegionServer --> HRegion.bulkLoadHFiles(...) --> HStore.bulkLoadHFile(...) --> HRegionFileSystem.bulkLoadStoreFile(...)

在HRegionFileSystem.bulkLoadStoreFile(...)這個方法判斷源路徑、目標路徑是否同一個文件系統:
 /**
* Bulk load: Add a specified store file to the specified family.
* If the source file is on the same different file-system is moved from the
* source location to the destination location, otherwise is copied over.
*
* @param familyName Family that will gain the file
* @param srcPath {@link Path} to the file to import
* @param seqNum Bulk Load sequence number
* @return The destination {@link Path} of the bulk loaded file
* @throws IOException
*/
Path bulkLoadStoreFile(final String familyName, Path srcPath, long seqNum)
throws IOException {
// Copy the file if it's on another filesystem
FileSystem srcFs = srcPath.getFileSystem(conf);
FileSystem desFs = fs instanceof HFileSystem ? ((HFileSystem)fs).getBackingFs() : fs;

// We can't compare FileSystem instances as equals() includes UGI instance
// as part of the comparison and won't work when doing SecureBulkLoad
// TODO deal with viewFS
if (!srcFs.getUri().equals(desFs.getUri())) {
LOG.info("Bulk-load file " + srcPath + " is on different filesystem than " +
"the destination store. Copying file over to destination filesystem.");
Path tmpPath = createTempName();
//不是同一文件系統,則拷貝
FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf);
LOG.info("Copied " + srcPath + " to temporary path on destination filesystem: " + tmpPath);
srcPath = tmpPath;
}

return commitStoreFile(familyName, srcPath, seqNum, true);
}


如果不是一個文件系統,那麼就調用:
FileUtil.copy(...)

進行拷貝。

最終進行move操作的其實是一個rename操作:
 /**
* Move the file from a build/temp location to the main family store directory.
* @param familyName Family that will gain the file
* @param buildPath {@link Path} to the file to commit.
* @param seqNum Sequence Number to append to the file name (less then 0 if no sequence number)
* @param generateNewName False if you want to keep the buildPath name
* @return The new {@link Path} of the committed file
* @throws IOException
*/
private Path commitStoreFile(final String familyName, final Path buildPath,
final long seqNum, final boolean generateNewName) throws IOException {
Path storeDir = getStoreDir(familyName);
if(!fs.exists(storeDir) && !createDir(storeDir))
throw new IOException("Failed creating " + storeDir);

String name = buildPath.getName();
if (generateNewName) {
name = generateUniqueName((seqNum < 0) ? null : "_SeqId_" + seqNum + "_");
}
Path dstPath = new Path(storeDir, name);
if (!fs.exists(buildPath)) {
throw new FileNotFoundException(buildPath.toString());
}
LOG.debug("Committing store file " + buildPath + " as " + dstPath);
// buildPath exists, therefore not doing an exists() check.
//在這裏進行rename
if (!rename(buildPath, dstPath)) {
throw new IOException("Failed rename of " + buildPath + " to " + dstPath);
}
return dstPath;
}


rename函數如下:
/**
* Renames a directory. Assumes the user has already checked for this directory existence.
* @param srcpath
* @param dstPath
* @return true if rename is successful.
* @throws IOException
*/
boolean rename(Path srcpath, Path dstPath) throws IOException {
IOException lastIOE = null;
int i = 0;
do {
try {
return fs.rename(srcpath, dstPath);
} catch (IOException ioe) {
lastIOE = ioe;
if (!fs.exists(srcpath) && fs.exists(dstPath)) return true; // successful move
// dir is not there, retry after some time.
sleepBeforeRetry("Rename Directory", i+1);
}
} while (++i <= hdfsClientRetriesNumber);
throw new IOException("Exception in rename", lastIOE);
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章