分拆TableSplit 讓多個mapper同時讀取

默認情況下,一個region是一個tableSplit,對應一個mapper進行讀取,但單mapper讀取速度較慢,因此想着把默認一個table split分拆成多個split,這樣hadoop就能通過多個mapper讀取。

由於HBase不能像hadoop一樣通過以下參數調整split大小,而實現多個mapper讀取

mapred.min.split.size
mapred.max.split.size


所以目前想到的方法有兩種,一是修改TableInputFormatBase,把默認的一個TableSplit分拆成多個,另外一種方法是,通過Coprocessor處理。這裏選擇修改TableInputFormatBase類。

HBase權威指南里面有介紹怎麼把HBase與MR結合,通過需要用到一下的輔助類實現把HBase表作爲數據來源,讀取數據:
TableMapReduceUtil.initTableMapperJob(table[0].getBytes(), scan,
UserViewHisMapper2.class, Text.class, Text.class,
genRecommendations);

而這個方法,最終是調用以下方法進行初始化設置的:
 public static void initTableMapperJob(byte[] table, Scan scan,
Class<? extends TableMapper> mapper,
Class<? extends WritableComparable> outputKeyClass,
Class<? extends Writable> outputValueClass, Job job,
boolean addDependencyJars)
throws IOException {
initTableMapperJob(Bytes.toString(table), scan, mapper, outputKeyClass,
outputValueClass, job, addDependencyJars, TableInputFormat.class);
}


所以,思路就應該修改TableInputFormat這個類。而這個類的核心方法是繼承了TableInputFormatBase:

public class TableInputFormat extends TableInputFormatBase
implements Configurable


最終要修改的則是TableInputFormatBase這個類,修改其以下方法:

public List<InputSplit> getSplits(JobContext context) throws IOException {}


這個方法的核心是,獲得table對應所有region的起始row,把每個region作爲一個tableSplit:
  public List<InputSplit> getSplits(JobContext context) throws IOException {
if (table == null) {
throw new IOException("No table was provided.");
}
Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
if (keys == null || keys.getFirst() == null ||
keys.getFirst().length == 0) {
throw new IOException("Expecting at least one region.");
}
int count = 0;
List<InputSplit> splits = new ArrayList<InputSplit>(keys.getFirst().length);
for (int i = 0; i < keys.getFirst().length; i++) {
if ( !includeRegionInSplit(keys.getFirst()[i], keys.getSecond()[i])) {
continue;
}
String regionLocation = table.getRegionLocation(keys.getFirst()[i]).
getHostname();
byte[] startRow = scan.getStartRow();
byte[] stopRow = scan.getStopRow();
// determine if the given start an stop key fall into the region
if ((startRow.length == 0 || keys.getSecond()[i].length == 0 ||
Bytes.compareTo(startRow, keys.getSecond()[i]) < 0) &&
(stopRow.length == 0 ||
Bytes.compareTo(stopRow, keys.getFirst()[i]) > 0)) {
byte[] splitStart = startRow.length == 0 ||
Bytes.compareTo(keys.getFirst()[i], startRow) >= 0 ?
keys.getFirst()[i] : startRow;
byte[] splitStop = (stopRow.length == 0 ||
Bytes.compareTo(keys.getSecond()[i], stopRow) <= 0) &&
keys.getSecond()[i].length > 0 ?
keys.getSecond()[i] : stopRow;
InputSplit split = new TableSplit(table.getTableName(),
splitStart, splitStop, regionLocation);
splits.add(split);
if (LOG.isDebugEnabled())
LOG.debug("getSplits: split -> " + (count++) + " -> " + split);
}
}
return splits;
}


這裏要做的就是,把本來屬於一個tableSplit的row在細分,分成自己希望的多個小split。但沒有找到輕巧的實現,唯有不斷迭代,把一個tableSplit的row全部取出,再拆分了,有點蠻力。
以下是我的實現方法:


public List<InputSplit> getSplits(JobContext context) throws IOException {
if (table == null) {
throw new IOException("No table was provided.");
}
Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
if (keys == null || keys.getFirst() == null
|| keys.getFirst().length == 0) {
throw new IOException("Expecting at least one region.");
}
int count = 0;
List<InputSplit> splits = new ArrayList<InputSplit>(
keys.getFirst().length);
for (int i = 0; i < keys.getFirst().length; i++) {
if (!includeRegionInSplit(keys.getFirst()[i], keys.getSecond()[i])) {
continue;
}
String regionLocation = table.getRegionLocation(keys.getFirst()[i],true)
.getHostname();
byte[] startRow = scan.getStartRow();
byte[] stopRow = scan.getStopRow();
// determine if the given start an stop key fall into the region
if ((startRow.length == 0 || keys.getSecond()[i].length == 0 || Bytes
.compareTo(startRow, keys.getSecond()[i]) < 0)
&& (stopRow.length == 0 || Bytes.compareTo(stopRow,
keys.getFirst()[i]) > 0)) {
byte[] splitStart = startRow.length == 0
|| Bytes.compareTo(keys.getFirst()[i], startRow) >= 0 ? keys
.getFirst()[i] : startRow;
byte[] splitStop = (stopRow.length == 0 || Bytes.compareTo(
keys.getSecond()[i], stopRow) <= 0)
&& keys.getSecond()[i].length > 0 ? keys.getSecond()[i]
: stopRow;

Scan scan1 = new Scan();
scan1.setStartRow(splitStart);
scan1.setStopRow(splitStop);
scan1.setFilter(new KeyOnlyFilter());
scan1.setBatch(500);

ResultScanner resultscanner = table.getScanner(scan1);

//用來保存該region的所有key
List<String> rows = new ArrayList<String>();
//Iterator<Result> it = resultscanner.iterator();

for(Result rs : resultscanner)
{
if(rs.isEmpty())
continue;
rows.add(new String(rs.getRow()));
}

int splitSize = rows.size() / mappersPerSplit;

for (int j = 0; j < mappersPerSplit; j++) {
TableSplit tablesplit = null;
if (j == mappersPerSplit - 1)
tablesplit = new TableSplit(table.getTableName(),
rows.get(j * splitSize).getBytes(),
rows.get(rows.size() - 1).getBytes(),
regionLocation);
else
tablesplit = new TableSplit(table.getTableName(),
rows.get(j * splitSize).getBytes(),
rows.get(j * splitSize + splitSize).getBytes(), regionLocation);
splits.add(tablesplit);
if (LOG.isDebugEnabled())
LOG.debug((new StringBuilder())
.append("getSplits: split -> ").append(i++)
.append(" -> ").append(tablesplit).toString());
}
resultscanner.close();
}
}
return splits;
}


通過配置設置需要拆分的split數。
發佈了62 篇原創文章 · 獲贊 1 · 訪問量 7322
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章