HDPCD-Java-複習筆記(2)

2.編寫MapReduce應用程序(Writing MapReduce Applications)


一個MapReduce程序由兩個主要階段組成(A MapReduce program consists of two main phases):

Map phase -- 數據輸入到Mapper, 由Mapper轉換並將轉換後的數據提供給Reducer. (Data is input into the Mapper, where it is transformed and prepared for the Reducer.)

Reduce phase -- 從Mapper接收數據,並完成期望的計算或者分析. (Retrieves the data from the Mapper and performs the desired computations or analyses.)


編寫一個MapReduce程序,需要定義一個Mapper類用來處理map階段,一個Reducer類來處理reduce階段.(To write a MapReduce program, you  define a Mapper class to handle the map phase and  a Reducer class to handle the Reduce phase.)


當所有的Mapper處理完成後,中間結果<key, value> 對經過一個shuffle 和sort 階段(所有鍵相同的值被組合在一起並且被髮送到同一個Reducer)(After all of the Mappers finish executing, the intermediate <key, value> pairs go through a shuffle and sort phase where all the values that share a key are combined and sent to the same Reducer.)


Mapper的數量由InputFormat決定.(The number of Mappers is determined by the InputFormat.The number of map tasks in a MapReduce job is based on the number of Input Splits.)

Reducer的數量由MapReduce job 配置決定.(The number of Reducers is determined by the MapReduce job configuration.The number of Reducers is determined by the mapreduce.job.reduces property.)


Partioner用來決定鍵值對被髮往哪個Reducer.(A  Partitioner is used to determine which <key,value> pairs are sent to which Reducer.)

Combiner可以被配置用來組合Mapper的輸出,以此減少shuffle和sort階段的網絡流量,從而提升性能.(A  Combiner can be optionally configured to combine the output of the Mapper, which can increase performance by decreasing the network traffic of the shuffle and sort phase.)


The Key/Value Pairs of MapReduce



The Word Count Example

MapReduce工作任務的 “Hello, World”就是word count, 一個工作任務接收一個文本文件作爲輸入並且輸出文件中的每個單詞和每個單詞出現的次數。(The “Hello, World” of MapReduce jobs is word count, a job that inputs a text file and outputs every word in the file, along with the number of occurrences of each word.)

  • public class WordCountMapper
  •    extends Mapper<LongWritable, Text, Text, IntWritable> {
  •    @Override
  •   protected void map(LongWritable key, Text value,
  •                       Context context)
  • throws IOException, InterruptedException {
  • String currentLine = value.toString();
  • String [] words = currentLine.split(" ");
  • for(String word : words) {
  • Text outputKey = new Text(word);
  • context.write(outputKey, new IntWritable(1));
  • }
  •   }
  • }
  • public class WordCountReducer
  •    extends Reducer<Text, IntWritable, Text, IntWritable> {
  •    @Override
  •    protected void reduce(Text key,
  •                           Iterable<IntWritable> values,
  •                           Context context)
  • throws IOException, InterruptedException {
  • int sum = 0;
  • for(IntWritable count : values) {
  • sum += count.get();
  • }
  • IntWritable outputValue = new IntWritable(sum);
  • context.write(key, outputValue);
  •    }
  • }
  • public class WordCountJob extends Configured
  •                           implements Tool {
  • @Override
  • public int run(String[] args) throws Exception {
  • Job job = Job.getInstance(getConf(), "WordCountJob");
  • Configuration conf = job.getConfiguration();
  • job.setJarByClass(getClass());
  • Path in = new Path(args[0]);
  • Path out = new Path(args[1]);
  • FileInputFormat.setInputPaths(job, in);
  • FileOutputFormat.setOutputPath(job,  out);
  • job.setMapperClass(WordCountMapper.class);
  • job.setReducerClass(WordCountReducer.class);
  • job.setInputFormatClass(TextInputFormat.class);
  • job.setOutputFormatClass(TextOutputFormat.class);
  • job.setMapOutputKeyClass(Text.class);
  • job.setMapOutputValueClass(IntWritable.class);
  • job.setOutputKeyClass(Text.class);
  • job.setOutputValueClass(IntWritable.class);
  • return job.waitForCompletion(true)?0:1;
  • }
  • public static void main(String[] args) {
  • int result;
  • try {
  • result = ToolRunner.run(new Configuration(),  
  •         new WordCountJob(), args);
  • System.exit(result);
  • } catch (Exception e) {
  • e.printStackTrace();
  • }
  • }
  • }
Running a MapReduce Job:

1.將文件放入HDFS中(Put the input files into HDFS.)

2.如果輸出文件夾存在,則刪除(If the output directory exists, delete it.)

3.使用hadoop執行工作任務(Use hadoop to execute the job.)

4.查看輸出文件(View the output files.)


hadoop jar wordcount.jar my.WordCountJobinput/file.txt result


The Map Phase


Output Memory Buffer(Mapper)(環形緩衝區,已經被序列化)

Mapper的輸出內存緩衝區的大小是由mapreduce.task.io.sort.mb 屬性配置, 當緩衝區達到一個特定的容量就會發生溢出,將內容寫到磁盤,這個特定的容量由mapreduce.map.sort.spill.percent 配置。(The size of the Mapper’s output memory buffer is configurable with themapreduce.task.io.sort.mb property. A spill occurs when the buffer reaches a certain capacity configured by the mapreduce.map.sort.spill.percent property.)


The Reduce Phase


Reduce階段實際上被分成三個階段(The reduce phase can actually be broken down in three phases):

Shuffle -- 也被叫做fetch階段,Reducers使用Netty獲取Mappers的輸出,所有相同鍵的記錄被髮送至同一個Reducer.(Also referred to as the fetch phase,this is when Reducers retrieve the output of the Mappers using Netty. All records with the same key are combined and sent to the same Reducer.)

Sort -- 這個階段與shuffle階段同時進行,當記錄被獲取和合並時,他們也通過鍵排好了序.(This phase happens simultaneously with the shuffle phase. As the records are fetched and merged, they are sorted by key.)

Reduce -- Reduce 方法由每個鍵激發,所有相同鍵的記錄被合併成一個集合。(The reduce method is invoked for each key, with the records combined into an iterable collection.)


A few open-source projects that are currently being ported onto YARN for use inHadoop 2.x:

Tez -- Improves the execution of MapReduce jobs.

Slider -- For deploying existing distributed applications onto YARN.

Storm -- For real-time computing.

Spark -- A MapReduce-like cluster computing framework designed for low-latency iterative jobs and interactive use from an interpreter.

OpenMPI -- A high performance Message Passing Library that implements MPI-2.

ApacheGiraph -- A graph processing platform.


YARN Components



YARN consists of the following main components:

ResourceManager  (Scheduler, Applications Manager (AsM))

NodeManager

ApplicationsMaster 

Containers


Lifecycle of a YARN Application:




A Cluster View Example



Hadoop Streaming



 Running a Hadoop Streaming Job

  • hadoop  jar $HADOOP_HOME/lib/hadoop-streaming.jar
  •     -input input_directories
  •     -output output_directories
  •     -mapper mapper_script
  •     -reducer reducer_script




發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章