由於HBase不能像hadoop一樣通過以下參數調整split大小,而實現多個mapper讀取
mapred.min.split.size
mapred.max.split.size
所以目前想到的方法有兩種,一是修改TableInputFormatBase,把默認的一個TableSplit分拆成多個,另外一種方法是,通過Coprocessor處理。這裏選擇修改TableInputFormatBase類。
HBase權威指南里面有介紹怎麼把HBase與MR結合,通過需要用到一下的輔助類實現把HBase表作爲數據來源,讀取數據:
TableMapReduceUtil.initTableMapperJob(table[0].getBytes(), scan,
UserViewHisMapper2.class, Text.class, Text.class,
genRecommendations);
而這個方法,最終是調用以下方法進行初始化設置的:
public static void initTableMapperJob(byte[] table, Scan scan,
Class<? extends TableMapper> mapper,
Class<? extends WritableComparable> outputKeyClass,
Class<? extends Writable> outputValueClass, Job job,
boolean addDependencyJars)
throws IOException {
initTableMapperJob(Bytes.toString(table), scan, mapper, outputKeyClass,
outputValueClass, job, addDependencyJars, TableInputFormat.class);
}
所以,思路就應該修改TableInputFormat這個類。而這個類的核心方法是繼承了TableInputFormatBase:
public class TableInputFormat extends TableInputFormatBase
implements Configurable
最終要修改的則是TableInputFormatBase這個類,修改其以下方法:
public List<InputSplit> getSplits(JobContext context) throws IOException {}
這個方法的核心是,獲得table對應所有region的起始row,把每個region作爲一個tableSplit:
public List<InputSplit> getSplits(JobContext context) throws IOException {
if (table == null) {
throw new IOException("No table was provided.");
}
Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
if (keys == null || keys.getFirst() == null ||
keys.getFirst().length == 0) {
throw new IOException("Expecting at least one region.");
}
int count = 0;
List<InputSplit> splits = new ArrayList<InputSplit>(keys.getFirst().length);
for (int i = 0; i < keys.getFirst().length; i++) {
if ( !includeRegionInSplit(keys.getFirst()[i], keys.getSecond()[i])) {
continue;
}
String regionLocation = table.getRegionLocation(keys.getFirst()[i]).
getHostname();
byte[] startRow = scan.getStartRow();
byte[] stopRow = scan.getStopRow();
// determine if the given start an stop key fall into the region
if ((startRow.length == 0 || keys.getSecond()[i].length == 0 ||
Bytes.compareTo(startRow, keys.getSecond()[i]) < 0) &&
(stopRow.length == 0 ||
Bytes.compareTo(stopRow, keys.getFirst()[i]) > 0)) {
byte[] splitStart = startRow.length == 0 ||
Bytes.compareTo(keys.getFirst()[i], startRow) >= 0 ?
keys.getFirst()[i] : startRow;
byte[] splitStop = (stopRow.length == 0 ||
Bytes.compareTo(keys.getSecond()[i], stopRow) <= 0) &&
keys.getSecond()[i].length > 0 ?
keys.getSecond()[i] : stopRow;
InputSplit split = new TableSplit(table.getTableName(),
splitStart, splitStop, regionLocation);
splits.add(split);
if (LOG.isDebugEnabled())
LOG.debug("getSplits: split -> " + (count++) + " -> " + split);
}
}
return splits;
}
這裏要做的就是,把本來屬於一個tableSplit的row在細分,分成自己希望的多個小split。但沒有找到輕巧的實現,唯有不斷迭代,把一個tableSplit的row全部取出,再拆分了,有點蠻力。
以下是我的實現方法:
public List<InputSplit> getSplits(JobContext context) throws IOException {
if (table == null) {
throw new IOException("No table was provided.");
}
Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
if (keys == null || keys.getFirst() == null
|| keys.getFirst().length == 0) {
throw new IOException("Expecting at least one region.");
}
int count = 0;
List<InputSplit> splits = new ArrayList<InputSplit>(
keys.getFirst().length);
for (int i = 0; i < keys.getFirst().length; i++) {
if (!includeRegionInSplit(keys.getFirst()[i], keys.getSecond()[i])) {
continue;
}
String regionLocation = table.getRegionLocation(keys.getFirst()[i],true)
.getHostname();
byte[] startRow = scan.getStartRow();
byte[] stopRow = scan.getStopRow();
// determine if the given start an stop key fall into the region
if ((startRow.length == 0 || keys.getSecond()[i].length == 0 || Bytes
.compareTo(startRow, keys.getSecond()[i]) < 0)
&& (stopRow.length == 0 || Bytes.compareTo(stopRow,
keys.getFirst()[i]) > 0)) {
byte[] splitStart = startRow.length == 0
|| Bytes.compareTo(keys.getFirst()[i], startRow) >= 0 ? keys
.getFirst()[i] : startRow;
byte[] splitStop = (stopRow.length == 0 || Bytes.compareTo(
keys.getSecond()[i], stopRow) <= 0)
&& keys.getSecond()[i].length > 0 ? keys.getSecond()[i]
: stopRow;
Scan scan1 = new Scan();
scan1.setStartRow(splitStart);
scan1.setStopRow(splitStop);
scan1.setFilter(new KeyOnlyFilter());
scan1.setBatch(500);
ResultScanner resultscanner = table.getScanner(scan1);
//用來保存該region的所有key
List<String> rows = new ArrayList<String>();
//Iterator<Result> it = resultscanner.iterator();
for(Result rs : resultscanner)
{
if(rs.isEmpty())
continue;
rows.add(new String(rs.getRow()));
}
int splitSize = rows.size() / mappersPerSplit;
for (int j = 0; j < mappersPerSplit; j++) {
TableSplit tablesplit = null;
if (j == mappersPerSplit - 1)
tablesplit = new TableSplit(table.getTableName(),
rows.get(j * splitSize).getBytes(),
rows.get(rows.size() - 1).getBytes(),
regionLocation);
else
tablesplit = new TableSplit(table.getTableName(),
rows.get(j * splitSize).getBytes(),
rows.get(j * splitSize + splitSize).getBytes(), regionLocation);
splits.add(tablesplit);
if (LOG.isDebugEnabled())
LOG.debug((new StringBuilder())
.append("getSplits: split -> ").append(i++)
.append(" -> ").append(tablesplit).toString());
}
resultscanner.close();
}
}
return splits;
}
通過配置設置需要拆分的split數。