分拆TableSplit 讓多個mapper同時讀取

原創

2020-02-21 06:56

默認情況下，一個region是一個tableSplit，對應一個mapper進行讀取，但單mapper讀取速度較慢，因此想着把默認一個table split分拆成多個split，這樣hadoop就能通過多個mapper讀取。

由於HBase不能像hadoop一樣通過以下參數調整split大小，而實現多個mapper讀取


mapred.min.split.size
mapred.max.split.size

所以目前想到的方法有兩種，一是修改TableInputFormatBase，把默認的一個TableSplit分拆成多個，另外一種方法是，通過Coprocessor處理。這裏選擇修改TableInputFormatBase類。

HBase權威指南里面有介紹怎麼把HBase與MR結合，通過需要用到一下的輔助類實現把HBase表作爲數據來源，讀取數據：

TableMapReduceUtil.initTableMapperJob(table[0].getBytes(), scan,
					UserViewHisMapper2.class, Text.class, Text.class,
					genRecommendations);

而這個方法，最終是調用以下方法進行初始化設置的：

 public static void initTableMapperJob(byte[] table, Scan scan,
      Class<? extends TableMapper> mapper,
      Class<? extends WritableComparable> outputKeyClass,
      Class<? extends Writable> outputValueClass, Job job,
      boolean addDependencyJars)
  throws IOException {
      initTableMapperJob(Bytes.toString(table), scan, mapper, outputKeyClass,
              outputValueClass, job, addDependencyJars, TableInputFormat.class);
  }

所以，思路就應該修改TableInputFormat這個類。而這個類的核心方法是繼承了TableInputFormatBase：

public class TableInputFormat extends TableInputFormatBase
implements Configurable

最終要修改的則是TableInputFormatBase這個類，修改其以下方法：

public List<InputSplit> getSplits(JobContext context) throws IOException {}

這個方法的核心是，獲得table對應所有region的起始row，把每個region作爲一個tableSplit：

  public List<InputSplit> getSplits(JobContext context) throws IOException {
	if (table == null) {
	    throw new IOException("No table was provided.");
	}
    Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
    if (keys == null || keys.getFirst() == null ||
        keys.getFirst().length == 0) {
      throw new IOException("Expecting at least one region.");
    }
    int count = 0;
    List<InputSplit> splits = new ArrayList<InputSplit>(keys.getFirst().length);
    for (int i = 0; i < keys.getFirst().length; i++) {
      if ( !includeRegionInSplit(keys.getFirst()[i], keys.getSecond()[i])) {
        continue;
      }
      String regionLocation = table.getRegionLocation(keys.getFirst()[i]).
        getHostname();
      byte[] startRow = scan.getStartRow();
      byte[] stopRow = scan.getStopRow();
      // determine if the given start an stop key fall into the region
      if ((startRow.length == 0 || keys.getSecond()[i].length == 0 ||
           Bytes.compareTo(startRow, keys.getSecond()[i]) < 0) &&
          (stopRow.length == 0 ||
           Bytes.compareTo(stopRow, keys.getFirst()[i]) > 0)) {
        byte[] splitStart = startRow.length == 0 ||
          Bytes.compareTo(keys.getFirst()[i], startRow) >= 0 ?
            keys.getFirst()[i] : startRow;
        byte[] splitStop = (stopRow.length == 0 ||
          Bytes.compareTo(keys.getSecond()[i], stopRow) <= 0) &&
          keys.getSecond()[i].length > 0 ?
            keys.getSecond()[i] : stopRow;
        InputSplit split = new TableSplit(table.getTableName(),
          splitStart, splitStop, regionLocation);
        splits.add(split);
        if (LOG.isDebugEnabled())
          LOG.debug("getSplits: split -> " + (count++) + " -> " + split);
      }
    }
    return splits;
  }

這裏要做的就是，把本來屬於一個tableSplit的row在細分，分成自己希望的多個小split。但沒有找到輕巧的實現，唯有不斷迭代，把一個tableSplit的row全部取出，再拆分了，有點蠻力。
以下是我的實現方法：


	public List<InputSplit> getSplits(JobContext context) throws IOException {
		if (table == null) {
			throw new IOException("No table was provided.");
		}
		Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
		if (keys == null || keys.getFirst() == null
				|| keys.getFirst().length == 0) {
			throw new IOException("Expecting at least one region.");
		}
		int count = 0;
		List<InputSplit> splits = new ArrayList<InputSplit>(
				keys.getFirst().length);
		for (int i = 0; i < keys.getFirst().length; i++) {
			if (!includeRegionInSplit(keys.getFirst()[i], keys.getSecond()[i])) {
				continue;
			}
			String regionLocation = table.getRegionLocation(keys.getFirst()[i],true)
					.getHostname();
			byte[] startRow = scan.getStartRow();
			byte[] stopRow = scan.getStopRow();
			// determine if the given start an stop key fall into the region
			if ((startRow.length == 0 || keys.getSecond()[i].length == 0 || Bytes
					.compareTo(startRow, keys.getSecond()[i]) < 0)
					&& (stopRow.length == 0 || Bytes.compareTo(stopRow,
							keys.getFirst()[i]) > 0)) {
				byte[] splitStart = startRow.length == 0
						|| Bytes.compareTo(keys.getFirst()[i], startRow) >= 0 ? keys
						.getFirst()[i] : startRow;
				byte[] splitStop = (stopRow.length == 0 || Bytes.compareTo(
						keys.getSecond()[i], stopRow) <= 0)
						&& keys.getSecond()[i].length > 0 ? keys.getSecond()[i]
						: stopRow;

				Scan scan1 = new Scan();
				scan1.setStartRow(splitStart);
				scan1.setStopRow(splitStop);
				scan1.setFilter(new KeyOnlyFilter());
				scan1.setBatch(500);

				ResultScanner resultscanner = table.getScanner(scan1);

				//用來保存該region的所有key
				List<String> rows = new ArrayList<String>();
				//Iterator<Result>  it = resultscanner.iterator();

				for(Result rs : resultscanner)
				{
					if(rs.isEmpty())
						continue;
					rows.add(new String(rs.getRow()));
				}

				int splitSize = rows.size() / mappersPerSplit;

				for (int j = 0; j < mappersPerSplit; j++) {
					TableSplit tablesplit = null;
					if (j == mappersPerSplit - 1)
						tablesplit = new TableSplit(table.getTableName(),
								rows.get(j * splitSize).getBytes(),
								rows.get(rows.size() - 1).getBytes(),
								regionLocation);
					else
						tablesplit = new TableSplit(table.getTableName(),
								rows.get(j * splitSize).getBytes(),
								rows.get(j * splitSize + splitSize).getBytes(), regionLocation);
					splits.add(tablesplit);
					if (LOG.isDebugEnabled())
						LOG.debug((new StringBuilder())
								.append("getSplits: split -> ").append(i++)
								.append(" -> ").append(tablesplit).toString());
				}
				resultscanner.close();				
			}
		}
		return splits;
	}

通過配置設置需要拆分的split數。

iteye_5062

發佈了62 篇原創文章 · 獲贊 1 · 訪問量 7322

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

分拆TableSplit 讓多個mapper同時讀取

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

TDengine docker安裝方法

vue3組件通信與props

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

go語言 defer延遲機制

華爲交換機配置實驗項目筆記

hadoop的java.opts設置有誤導致job setup失敗

MySQL的Communications link failure

enable和disable表時出現表未disable/enable異常處理

ROOT不在線的另外一種原因及解決辦法

Centos下yum安裝wine

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結