從自定義排序深入理解單機hadoop執行mapreduce過程

我們對數據進行處理的過程中，最常見的一種操作是排序和統計，特別是在數據量大的場景，實現高效的排序是業務系統開發過程中非常重要的一塊。如何從hadoop中高效地提取有用的數據是工作中重要的一環。在自定義排序類的過程中，就遇到了一個小問題，而hadoop執行過程中對異常的處理往往是打印一個log，然後拋出封裝過的異常，而且異常的信息非常通用而不具體，所以如果不看日誌，往往比較難定位問題。面對這種情況，深入到源碼，也正好能熟悉一下整個mapReduce的執行過程。下面就以一個簡單的例子講講自定義排序要注意的一個小細節。

以hadoop權威指南中的輔助排序講講如何自定義排序，hadoop權威指南里面有個例子是這樣的，從氣象站數據中，找出每年的最高氣溫。如果僅僅根據年份排序，mapper輸出後，在reducer遍歷每年所有氣溫能在O(n)複雜度得出結果，但是有更好的辦法（當然mahout早有功能更強大的實現類），如果根據年份分區，並且氣溫降序排序，那麼在reducer中，就只要取第一條數據，就是最高氣溫了，也就是說，在reducer中可以以O(1)的時間複雜度得出結果。

假設有如下氣溫數據：

文件a:

1990 22

1990 33

1991 24

文件b：

1992 23

1992 26

1991 27

這裏每一行的第一個字段是年份，後面數字代表一年十二個月中，某個月的最高氣溫，月份我們不關心，就不寫出來了。根據這樣的情況，我們很容易寫出如下初始化job的代碼（不懂的請先補習一下MapReduce，另外只講解重要部分代碼，其它代碼請見附件）：

Configuration conf = getConf();
		Job job = new Job(conf);
		job.setJobName("SecondarySort");
		job.setJarByClass(SecondarySort.class);
		job.setMapperClass(SecondaryMapper.class);
		job.setReducerClass(SecodaryRecuder.class);
		job.setOutputKeyClass(MyPairComparable.class);
		job.setOutputValueClass(NullWritable.class);
		job.setPartitionerClass(SecondaryPartitioner.class);
		job.setSortComparatorClass(SecondaryComparator.class);
		job.setGroupingComparatorClass(SecondaryGroupCompator.class);
		String input = args[0];
		String output = args[1];
		FileInputFormat.addInputPath(job, new Path(input));
		FileOutputFormat.setOutputPath(job, new Path(output));
		return job.waitForCompletion(true) ? 0 : 1;

以上代碼的輸入輸出很簡單，就不細說了，這裏說說自定義的排序類：MyPairComparable，官方文檔的api有一段如下實現例子：

public class MyWritable implements Writable {
       // Some data     
       private int counter;
       private long timestamp;
       
       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }
       
       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }
       
       public static MyWritable read(DataInput in) throws IOException {
         MyWritable w = new MyWritable();
         w.readFields(in);
         return w;
       }
     }

根據上面這個官方文檔，遇到第一個坑，很容易寫成這樣的：

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.mahout.common.IntPairWritable;

public class MyPairComparable implements WritableComparable<MyPairComparable>,
		Cloneable {

	private int first;

	private int second;

	@Override
	public boolean equals(Object arg0) {
		// TODO Auto-generated method stub
		return super.equals(arg0);
	}

	@Override
	public int hashCode() {
		// TODO Auto-generated method stub
		return super.hashCode();
	}

	@Override
	public String toString() {
		return first + "\t" + second;
	}

	public MyPairComparable(int first, int second) {
		super();
		this.first = first;
		this.second = second;
	}

	public int getFirst() {
		return first;
	}

	public void setFirst(int first) {
		this.first = first;
	}

	public int getSecond() {
		return second;
	}

	public void setSecond(int second) {
		this.second = second;
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeInt(first);
		out.writeInt(second);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		first = in.readInt();
		second = in.readInt();
	}

	@Override
	public int compareTo(MyPairComparable arg0) {
		return this.compareTo(arg0);
	}

	static {
		WritableComparator.define(IntPairWritable.class,
				new SecondaryComparator());
	}

}

同樣，根據官方文檔，我們很容易把comparator寫成這樣：

public class SecondaryComparator extends WritableComparator {

		@Override
		public int compare(WritableComparable a, WritableComparable b) {
			MyPairComparable a1 = (MyPairComparable) a;
			MyPairComparable a2 = (MyPairComparable) b;
			if (a1.getFirst() != a2.getFirst()) {
				return a1.getFirst() - a2.getFirst();
			} else {
				return -(a1.getSecond() - a2.getSecond());
			}
		}
}

然後執行代碼，報錯了，空指針，遇到這樣的問題，正好閱讀下源碼，瞭解執行過程，下面從MapReduce執行的過程來看看上面的代碼出了什麼問題。MapReduce的整個流程大致如下: 通過FillenputFormt調用recorder讀取數據——》mapper處理——》在分區中排序（shuffle）——》reducer處理——》輸出。

先看啓動，job調用waitForCompletion，代碼如下：

public boolean waitForCompletion(boolean verbose
                                   ) throws IOException, InterruptedException,
                                            ClassNotFoundException {
    if (state == JobState.DEFINE) {
	 //提交作業到對列，生成jobId,校驗文件路徑，拷貝文件到文件系統，產生分片信息等等
		      submit();}
    if (verbose) {
      monitorAndPrintJob();//監聽作業情況，包括整個job運行的模式，執行進度，task執行情況，成功，失敗還是被kill等等一些相關信息，這裏我們不關心這個
    } else {
      // get the completion poll interval from the client.
      int completionPollIntervalMillis = 
        Job.getCompletionPollInterval(cluster.getConf());
      while (!isComplete()) {
        try {
          Thread.sleep(completionPollIntervalMillis);
        } catch (InterruptedException ie) {
        }
      }
    }
    return isSuccessful();
  }
  public void submit() 
         throws IOException, InterruptedException, ClassNotFoundException {
    ensureState(JobState.DEFINE);
    setUseNewAPI();
    connect();
    final JobSubmitter submitter = //這裏我們使用了本地文件系統
        getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
      public JobStatus run() throws IOException, InterruptedException, 
      ClassNotFoundException {
        return submitter.submitJobInternal(Job.this, cluster);
      }
    });
    state = JobState.RUNNING;
    LOG.info("The url to track the job: " + getTrackingURL());
   }

裏面主要做了兩件事情，一件是監聽作業情況，一件是提交作業，作業的提交是通過上面的submit函數實現的，邏輯就是產生一個作業，也就是LocalJobRunner.job，這個job會構建job的各種信息，包括讀取job的配置，得到本地job的工作地址，初始化分佈式緩存等等。然後就是讀取分片信息，創建MapTaskRunnable執行mapper任務，這個MapTaskRunnable是需要關心的，所有的mapper程序，都是從這個taskRunnable開始的。這個taskRunable的邏輯其實也很簡單，裏面最主要的方法是runNewMapper，也就是這個方法會真正跑我們重寫的Mapper方法，下面的類的名稱就更加常見和熟悉了，其代碼如下：

// 根據job和taskid獲取任務上下文
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, 
                                                                  getTaskID(),
                                                                  reporter);
   // 創建一個mapper實例。taskContext獲取到的mapper的類其實就是在配置job的時候配置進去的。
    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
    // 創建一個inputFormat，按照格式讀入數據，這裏我們沒有設置，會創建默認的TextInputFormat
    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
   // inputFormat用來讀取數據的recorder也是這裏創建
    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);
   // 無論設置的reducer數量是多少，後面都會去創建分區partition實例和排序用的comparator,這裏就是發現問題最重要的地方
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }
    .......
    // 這個run方法就最接近我們的代碼了，裏面執行的就是map方法
    mapper.run(mapperContext);

在上面的代碼中，可能會調用我們排序對象的是new NewOutputCollector這個構造器，進去裏面繼續看，第一句代碼如下：

collector = createSortingCollector(job, reporter);

顧名思義，創建排序的收集器，裏面初始化有一句如下：

        collector.init(context);

進入在collector.init裏面又看到：

comparator = job.getOutputKeyComparator();

進入裏面能看到：

ReflectionUtils.newInstance(theClass, this);

繼續進入：

    try {
      Constructor<T> meth = (Constructor<T>) CONSTRUCTOR_CACHE.get(theClass);
      if (meth == null) {
        meth = theClass.getDeclaredConstructor(EMPTY_ARRAY);
        meth.setAccessible(true);
        CONSTRUCTOR_CACHE.put(theClass, meth);
      }
      result = meth.newInstance();
    } catch (Exception e) {
      throw new RuntimeException(e);
    }

獲取比較器的構造器，並創建實例，並且一定是個默認構造函數創建的實例，這裏其實就隱含了一個官方文檔例子沒有特意提出來的問題，比較器SecondaryComparator必須有一個無參的構造函數。所以來到這裏發現了上面自定義的類SecondaryComparator其實少了無參構造函數，這樣程序就無法正常執行了，必須加上去。接着看代碼，發現排序比較的時候獲取key也是通過上面這段反射代碼獲取compareable的，所以MyPairComparable也必須加上無參默認構造函數。加上後程序能正常運行了。下面接着看代碼。把執行環境和後續需要用到的對象都創建好後，mapper會執行run方法，mapper的run方法如下：

 public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }

這裏context.getCurrentKey和context.getCurrentValue就是通過recorder讀取輸入分片的數據的，然後調用我們重寫的map方法。接着把map輸出的數據放到一個環形buffer裏面，當達到了buffer的閾值，就會把數據根據partition分片輸出，接着mapper任務執行完，就會執行下一階段的任務，也就是上面提到的在分區中排序。等所有mapper任務執行完後，會進入下一parse，排序並輸出到分區，會調用上面構造好的比較器來完成這些操作，等這些操作完成好了。進入下一階段，reduce，根據reducer設置的數量，產生reducer個數的runableTask，並加入到線程池中執行reducer任務：

public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKey()) {
        reduce(context.getCurrentKey(), context.getValues(), context);
        // If a back up store is used, reset it
        Iterator<VALUEIN> iter = context.getValues().iterator();
        if(iter instanceof ReduceContext.ValueIterator) {
          ((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();        
        }
      }
    } finally {
      cleanup(context);
    }
  }

這裏的nextKey就會調用groupingComparator比較獲取到的map輸出的值，通過我們自定義的groupingComparator，就很容易獲取到第一條記錄，也就是某年的最高溫度的記錄了。

到這裏reducer的reduce方法結束，後面的就是一些輸出文件，關閉流，更新job的狀態等等一些工作了，自此雖然沒有太深入，也大概瀏覽了一遍mapreduce的過程了。

從自定義排序深入理解單機hadoop執行mapreduce過程

如何基於surging跨網關跨語言進行緩存降級

2024合集

程序員天天 CURD，怎麼才能成長，職業發展的思考(2)

移位操作搞定兩數之商

教你用Perl實現Smgp協議

如何通過前端表格控件在10分鐘內完成一張分組報表？

win11關閉自動檢測病毒刪文件

通用代碼生成器簡介

lightdb 單機模式下數據庫平移

千兆寬帶實際網速能到達多少？

紅黑樹的刪除之3（共四篇）

紅黑樹的基本性質（共四篇）

紅黑樹的插入2（共四篇）

JDK8 HashMap的實現之4（共四篇）

最大平均值子數組

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結