Hadoop MR(In English)

參考文檔
https://netjs.blogspot.com/2018/02/how-mapreduce-works-in-hadoop.html

Map and Reduce tasks in Hadoop

input splits in Hadoop(processed by the map tasks in a completely parallel manner)

Map task

<K1, V1> -> map -> (K2, V2)

Reduce task

<K2, list(V2)> -> reduce -> <K3, V3>

How map task works in Hadoop

eg: https://netjs.blogspot.com/2018/02/word-count-mapreduce-program-in-hadoop.html

  • wordcount.txt
Hello wordcount MapReduce Hadoop program.
This is my first MapReduce program.

Each line will be passed to the map function in the following format.

<0, Hello wordcount MapReduce Hadoop program.>
<41, This is my first MapReduce program.>
  • map function
public void map(LongWritable key, Text value, Context context) 
        throws IOException, InterruptedException {
    // Splitting the line on spaces
    String[] stringArr = value.toString().split("\\s+");
    for (String str : stringArr) {
        word.set(str);
        context.write(word, new IntWritable(1));
    } 
}
  • map outputs
Line 1 <key, value> output

(Hello, 1) 
(wordcount, 1) 
(MapReduce, 1)
(Hadoop, 1)
(program., 1)
Line 2 <key, value> output

(This, 1) 
(is, 1) 
(my, 1) 
(first, 1) 
(MapReduce, 1) 
(program., 1)

Shuffling and sorting by Hadoop Framework

The output of map function doesn’t become input of the reduce function directly. It goes through shuffling and sorting by Hadoop framework. In this processing the data is sorted and grouped by keys.

After the internal processing the data will be in the following format. This is the input to reduce function.

<Hadoop, (1)>
<Hello, (1)>
<MapReduce, (1, 1)>
<This, (1)>
<first, (1)>
<is, (1)>
<my, (1)>
<program., (1, 1)>
<wordcount, (1)>

https://netjs.blogspot.com/2018/07/shuffle-and-sort-phases-hadoop-mapreduce.html

Shuffle And Sort Phases

Hadoop framework also guarantees that the map output is sorted by keys. This whole internal processing of sorting map output and transfering it to reducers is known as shuffle phase in Hadoop framework.

  1. Data from mappers is partitioned as per the number of reducers.
  2. Data is also sorted by keys with in a partition.
  3. Output from Maps is written to disk as many temporary files.
  4. Once the map task is finished all the files written to the disk are merged to create a single file.
  5. Data from a particular partition (from all mappers) is transferred to a reducer that is suppose to process that particular partition.
  6. If data transferred to a reducer exceeded the memory limit then it is copied to a disk.
  7. Once reducer has got its portion of data from all the mappers, data is again merged while still maintaining the sort order of keys to create reduce task input.

Shuffle phase process at mappers side

When the map task starts producing output it is not directly written to disk instead there is a memory buffer (size 100 MB by default) where map output is kept. This size is configurable and parameter that is used is – mapreduce.task.io.sort.mb

When that data from memory is spilled to disk is controlled by the configuration parameter **mapreduce.map.sort.spill.percent **(default is 80% of the memory buffer). Once this threshold of 80% is reached, a thread will begin to spill the contents to disk in the background.

eg: 4 mappers and 2 reducers for a MapReduce job
在這裏插入圖片描述

If there is a Combiner that is also executed in order to reduce the size of data written to the disk.

在這裏插入圖片描述

Shuffle phase process at Reducer side

By this time you have the Map output ready and stored on a local disk of the node where Map task was executed. Now the relevant partition of the output of all the mappers has to be transferred to the nodes where reducers are running.

Reducers don’t wait for all the map tasks to finish to start copying the data, as soon as a Map task is finished data transfer from that node is started.

Data copied from mappers is kept is memory buffer at the reducer side too. The size of the buffer is configured using the following parameter.
mapreduce.reduce.shuffle.input.buffer.percent

This merging of Map outputs is known as sort phase. During this phase the framework groups Reducer inputs by keys since different mappers may have produced the same key as output.

The threshold for triggering the merge to disk is configured using the following parameter.
mapreduce.reduce.merge.inmem.thresholds

The merged file, which is the combination of data written to the disk as well as the data still kept in memory constitutes the input for Reduce task.

How reduce task works in Hadoop

  • reduce function
public void reduce(Text key, Iterable<IntWritable> values, Context context) 
        throws IOException, InterruptedException {
  int sum = 0;
  for (IntWritable val : values) {
    sum += val.get();
  }
  result.set(sum);
  context.write(key, result);
}
  • reduce output
Hadoop     1
Hello    1
MapReduce   2
This       1
first       1
is         1
my         1
program.      2
wordcount   1
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章