hbase bulk load相關源碼簡析之PutSortReducer、KeyValueSortReducer

轉載請註明出處： http://blog.csdn.net/lonelytrooper/article/details/17040895

PutSortReducer：

[java] view plain copy

// 對map階段傳遞過來的puts中的KVs做排序，並將有序的KVs寫到輸出流(最終寫的類是HFileWriterV1或HFileWriterV2的append方法)...
public class PutSortReducer extends
Reducer<ImmutableBytesWritable, Put, ImmutableBytesWritable, KeyValue> {
@Override
protected void reduce(ImmutableBytesWritable row, java.lang.Iterable<Put> puts,
Reducer<ImmutableBytesWritable, Put, ImmutableBytesWritable, KeyValue>.Context context)
throws java.io.IOException, InterruptedException {
// although reduce() is called per-row, handle pathological case
// 設定一個RAM的閥值，用於應對非常規的情況.. 默認值2L * (1 << 30)爲Integer.MAX_VALUE+1
long threshold = context.getConfiguration().getLong("putsortreducer.row.threshold",
2L * (1 << 30));
Iterator<Put> iter = puts.iterator();
while (iter.hasNext()) {
TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR); // KVComparator
long curSize = 0;
// stop at the end or the RAM threshold
// 用curSize累計當前puts的size，但這個size不能超過threshold...
while (iter.hasNext() && curSize < threshold) {
Put p = iter.next();
for (List<KeyValue> kvs : p.getFamilyMap().values()) {
for (KeyValue kv : kvs) {
map.add(kv);
curSize += kv.getLength();
}
}
}
// 記錄已讀取的map中的KV的個數，並將curSize轉成易讀的KB,MB,GB..
context.setStatus("Read " + map.size() + " entries of " + map.getClass() + "("
+ StringUtils.humanReadableInt(curSize) + ")");
int index = 0;
// 將當前有序的KV寫到輸出流..
for (KeyValue kv : map) {
context.write(row, kv);
if (index > 0 && index % 100 == 0) // 記錄進度，每100個記錄一次..
context.setStatus("Wrote " + index);
}
// if we have more entries to process
//如果居然還有put沒處理完..我們會通過context.write(null, null)強刷.. 這會關閉當前的Writer(StoreFile.Writer)，並形成了一個StoreFile。
//在外層的下次循環中，會繼續處理餘下的數據，並創建新的StoreFile的Writer。換言之，這種情況下相同rowkey的數據會被寫到不同的StoreFile中...
//細節部分可以看下HFileOutputFormat下RecordWriter類下的write方法..
if (iter.hasNext()) {
// force flush because we cannot guarantee intra-row sorted
// order
context.write(null, null);
}
}
}
}

KeyValueSortReducer：

[java] view plain copy

// 類比PutSortReducer，對map傳遞過來的KVs進行排序，並將有序的KVs寫到輸出流...
// 如果一行包含的列非常多的話，有oom的風險..
public class KeyValueSortReducer extends Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue> {
protected void reduce(ImmutableBytesWritable row, java.lang.Iterable<KeyValue> kvs,
org.apache.hadoop.mapreduce.Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue>.Context context)
throws java.io.IOException, InterruptedException {
TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
for (KeyValue kv: kvs) {
map.add(kv.clone());
}
context.setStatus("Read " + map.getClass());
int index = 0;
for (KeyValue kv: map) {
context.write(row, kv);
if (index > 0 && index % 100 == 0) context.setStatus("Wrote " + index);
}
}
}

簡單說下TotalOrderPartitioner和SimpleTotalOrderPartitioner：

TotalOrderPartitioner：

做全排序的東東，Hbase中的TOP其實就是Hadoop中TOP的直接拷貝，通過從外部文件中讀取分區點來實現。在bulk load中，這個外部文件即爲從HTable中獲取的region的startKeys處理之後得到的split points，這個split points文件被寫到了路徑Path partitionsPath = new Path(job.getWorkingDirectory(), "partitions_" + UUID.randomUUID())。

SimpleTotalOrderPartitioner：

簡單的做全排序的東東，原則是根據輸入的startkey和endkey進行均分，區間是左閉右開。

hbase bulk load相關源碼簡析之PutSortReducer、KeyValueSortReducer

前端使用 Konva 實現可視化設計器（13）- 折線 - 最優路徑應用【思路篇】

回答阿里社招面試如何準備，順便談談對於Java程序猿學習當中各個階段的建議

Spark RDD、DataFrame和DataSet的區別

Spark算子：RDD基本轉換操作(7)–zipWithIndex、zipWithUniqueId

Spark: sortBy和sortByKey函數詳解

Spark算子：RDD創建操作

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結