Hadoop 版本問題

最近一直再看《hadoop in action》這本書，這本書整體講的不錯，就是hadoop不同版本之間的區別比較大，大家學習時一定要用統一版本，否則事倍功半。

書上第4章第四節講的是版本間的區別，我這裏簡單整理一下：

去hadoop的官網可以找到如下信息：

1.0.X - current stable version, 1.0 release
1.1.X - current beta version, 1.1 release
2.X.X - current alpha version
0.23.X - simmilar to 2.X.X but missing NN HA.
0.22.X - does not include security
0.20.203.X - old legacy stable version
0.20.X - old legacy version

http://hadoop.apache.org/releases.html 寫作時間：2013-3-10 16：25

-------------------------------update 2014-7-22---------------------------------------------------------

無奈，以前對hadoop的版本總結的不是太清楚，這裏重新總結次。先上3個關於版本說明的圖：

上圖可以看出：

0.20這個分支最終演變爲1.X分支，並且在0.20這個分支中有幾個重要版本，也就是0.20.2增加了真正意義上的Append操作。到了0.20.205後直接重命名爲1.0，這兩個版本沒什麼變化，就僅僅是個rename的變化
0.23這個分支最終演變爲2.X分支，也就是現在說的Hadoop2.0，這個版本變化比較大，引入了YARN、HDFS Federation
hadoop 1.0 指的是1.x(0.20.x),0.21,0.22
hadoop 2.0 指的是2.x,0.23.x

參考鏈接：

http://blog.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/

http://blog.cloudera.com/blog/2012/04/apache-hadoop-versions-looking-ahead-3/

http://elephantscale.com/hadoop2_handbook/Hadoop_Versions.html

--------------------end---------------------

這說明hadoop的發展還是挺快的，有各種各樣的版本，alpha beta stable都有，這也說明了開源的hadoop是廣大程序員處理大數據的首選。

書上說最穩定的版本是0.18.3，但是鑑於這本書寫與09年，所有這個參考價值不是很大。但是0.20這個版本是個承上啓下的版本，它對於老版本的api全部支持，只是標註了deprecated的，但是0.20之後的版本直接就把老的api給刪去了，0.20同時也很好的支持新發布版本的api，所以這個版本可以用來學習使用。

0.20之前版本中，org.apache.hadoop.mapred包中的內容在新版本被移除了，放在了org.apache.hadoop.mapreduce這個新包中，許多類都在org.apache.hadoop.mapreduce.lib包中。如果我們使用了0.20以後的版本，我們就不能引用org.apache.hadoop.mapred包中的類了。

在新版本中，最有意義的變化是引入了context這個類，它可以代替OutputCollector和Reporter這兩個對象。

在新版本中，map()函數和reduce()被放到了抽象類Mapper和Reducer類中，這兩個抽象類代替了org.apache.hadoop.mapred.Mapper和org.apache.hadoop.mapred.Reducer這兩個接口。同時也代替了MapReduceBase這個類。

在新版本中，JobConf和JobClient被移除了。它們的功能被放到Configuration類和新增的Job類中去了（Configuration以前是JobConf的父類）。Configuration類只是用來配置一個job，而Job類用來定義和控制job的運行。

下面給出一些老版本與新版本代碼，以後大家些hadoop程序就可以按照這個模板了。

先給出老版本的代碼：

package com.ytu.old;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;

public class MyOldJob extends Configured implements Tool {

	public static class MapClass extends MapReduceBase implements
			Mapper<Text, Text, Text, Text> {

		@Override
		public void map(Text key, Text value, OutputCollector<Text, Text> output,
				Reporter arg3) throws IOException {
			// TODO Auto-generated method stub
			output.collect(value, key);
		}

	}
	public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

		@Override
		public void reduce(Text key, Iterator<Text> values,
				OutputCollector<Text, Text> output, Reporter reporter)
				throws IOException {
			// TODO Auto-generated method stub
			String csv = "";
			while(values.hasNext()) {
				if (csv.length()>0) {
					csv+=",";
				}
				csv+=values.next().toString();
			}
			output.collect(key, new Text(csv));
		}
		
	}
	@Override
	public int run(String[] args) throws Exception {
		// TODO Auto-generated method stub
		Configuration conf = this.getConf();
		JobConf job = new JobConf(conf, MyOldJob.class);
		
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		job.setJobName("MyOldJob");
		job.setMapperClass(MapClass.class);
		job.setReducerClass(Reduce.class);
		
		job.setInputFormat(KeyValueTextInputFormat.class);
		job.setOutputFormat(TextOutputFormat.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		job.set("key.value.separator.in.input.line", ",");
		
		JobClient.runJob(job);
		
		return 0;
	}
        public static void main(String[] args) throws Exception {
           int res = ToolRunner.run(new Configuration(), new MyOldJob(), args);
           System.exit(res);
    }

 }

然後在給出新版本的代碼：

package com.ytu.new1;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class MyNewJob extends Configured implements Tool {

	public static class MapClass extends Mapper<LongWritable, Text, Text, Text> {
		protected void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
			String[] citation = value.toString().split(",");
			context.write(new Text(citation[1]), new Text(citation[0]));
		};
	}

	public static class Reduce extends Reducer<Text, Text, Text, Text> {
		protected void reduce(Text key, Iterable<Text> values, Context context)
				throws IOException, InterruptedException {
			String csv = "";
			for (Text val : values) {
				if (csv.length() > 0) {
					csv += ",";
				}
				csv += val.toString();
			}
			context.write(key, new Text(csv));
		};
	}

	@Override
	public int run(String[] args) throws Exception {
		// TODO Auto-generated method stub
		Configuration conf = this.getConf();
		
		Job job = new Job(conf, "MyNewJob");
		job.setJarByClass(MyNewJob.class);
		
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		job.setMapperClass(MapClass.class);
		job.setReducerClass(Reduce.class);
		
		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		System.exit(job.waitForCompletion(true)?0:1);
		return 0;
	}

	public static void main(String[] args) throws Exception {
		int res = ToolRunner.run(new Configuration(), new MyNewJob(), args);
		System.exit(res);
	}
}

最後再說一點：

書上說KeyValueTextInputFormat這個類在0.20中被移除了，但是我現在用的是版本1.1.0.這個類照樣可以用，但是如果要想設置分隔符的方式不一樣，

對於hadoop 1.1.0 要用mapreduce.input.keyvaluelinerecordreader.key.value.separator

hadoop 0.2。0 要用 key.value.separator.in.input.line

其他用法一樣。

Hadoop 版本問題

ubuntu下新增一個用戶以及擁有sudo權限方法

Java中類成員初始化順序問題

2013第四屆藍橋杯 C/C++本科A組真題答案解析【交流帖】

Hadoop生態系統工具指南

Ubuntu10.10 隱藏桌面掛載的磁盤圖標

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結