詳細解析如何運行hadoop自帶例子

本文，簡單介紹下，如何運行hadoop自帶的mapreduce的那些例子。

本文針對的hadoop版本，是2.6.5版本，自帶的例子包名爲：hadoop-mapreduce-examples-2.6.5.jar；

位於/share/hadoop/mapreduce目錄下。

簡單來說，如果想要完成範例的運行，直接：

hadoop jar hadoop-mapreduce-examples-2.6.5.jar wordcount /input /output

其中，hadoop命令配置了環境變量。

那麼，這個執行具體是怎麼實現的呢；或者說，如果我使用其他的名稱，或者說WordCount首字母大寫，因爲解壓縮該包能夠看到WordCount.class文件的存在，能成功麼？

答案是：不能。

爲什麼不能？

我們仔細分析下這個整體的運行過程，目標在於，我們能夠實現在mapreduce-examples包內添加我們自己的類，而且能夠實現自己的運行邏輯。

執行命令很簡單，入口，看下hadoop的運行腳本：

elif [ "$COMMAND" = "jar" ] ; then
      CLASS=org.apache.hadoop.util.RunJar

很清楚，接下來需要注意的類，是RunJar類，而且指明瞭該類的位置，該命令，實際執行了RunJar中的main方法，我們看一下：

/**
	 * Run a Hadoop job jar. If the main class is not in the jar's manifest,
	 * then it must be provided on the command line.
	 */
	public static void main(String[] args) throws Throwable {
		new RunJar().run(args);
	}

這裏看的更清楚了，我們在運行其它程序的時候，或者說是自己隨意打出來的一個包，我們可以在後面直接指定className，就可以成功運行出結果了，注意，這裏需要的className必須是全名。

接下來，看看RunJar類中，run方法的內容：

public void run(String[] args) throws Throwable {
		String usage = "RunJar jarFile [mainClass] args...";
		if (args.length < 1) {
			System.err.println(usage);
			System.exit(-1);
		}

		int firstArg = 0;
		String fileName = args[firstArg++];
		File file = new File(fileName);
		if (!file.exists() || !file.isFile()) {
			System.err.println("Not a valid JAR: " + file.getCanonicalPath());
			System.exit(-1);
		}
		String mainClassName = null;
		JarFile jarFile;
		try {
			jarFile = new JarFile(fileName);
		} catch (IOException io) {
			throw new IOException("Error opening job jar: " + fileName)
					.initCause(io);
		}
		Manifest manifest = jarFile.getManifest();
		if (manifest != null) {
			mainClassName = manifest.getMainAttributes().getValue("Main-Class");
		}
		jarFile.close();

		if (mainClassName == null) {
			if (args.length < 2) {
				System.err.println(usage);
				System.exit(-1);
			}
			mainClassName = args[firstArg++];
		}
		mainClassName = mainClassName.replaceAll("/", ".");

		File tmpDir = new File(System.getProperty("java.io.tmpdir"));
		ensureDirectory(tmpDir);

		final File workDir;
		try {
			workDir = File.createTempFile("hadoop-unjar", "", tmpDir);
		} catch (IOException ioe) {
			// If user has insufficient perms to write to tmpDir, default
			// "Permission denied" message doesn't specify a filename.
			System.err.println("Error creating temp dir in java.io.tmpdir "
					+ tmpDir + " due to " + ioe.getMessage());
			System.exit(-1);
			return;
		}

		if (!workDir.delete()) {
			System.err.println("Delete failed for " + workDir);
			System.exit(-1);
		}
		ensureDirectory(workDir);
		ShutdownHookManager.get().addShutdownHook(new Runnable() {
			@Override
			public void run() {
				FileUtil.fullyDelete(workDir);
			}
		}, SHUTDOWN_HOOK_PRIORITY);

		unJar(file, workDir);

		ClassLoader loader = createClassLoader(file, workDir);

		Thread.currentThread().setContextClassLoader(loader);
		Class<?> mainClass = Class.forName(mainClassName, true, loader);
		Method main = mainClass.getMethod("main", new Class[] { Array
				.newInstance(String.class, 0).getClass() });
		String[] newArgs = Arrays.asList(args).subList(firstArg, args.length)
				.toArray(new String[0]);
		try {
			main.invoke(null, new Object[] { newArgs });
		} catch (InvocationTargetException e) {
			throw e.getTargetException();
		}
	}

縱觀全部的代碼，很容易看懂，主要是解析我們的參數，然後將提交的包解壓，其中需要注意這一句：

Manifest manifest = jarFile.getManifest();
		if (manifest != null) {
			mainClassName = manifest.getMainAttributes().getValue("Main-Class");
		}

跟上文對應，如果打包的文件中有這個ManiFest文件，且包含了Main-class的配置，那麼，就會運行這個Main-class的main方法。

Class<?> mainClass = Class.forName(mainClassName, true, loader);
		Method main = mainClass.getMethod("main", new Class[] { Array
				.newInstance(String.class, 0).getClass() });

最後的運行邏輯如下：

try {
			main.invoke(null, new Object[] { newArgs });
		} catch (InvocationTargetException e) {
			throw e.getTargetException();
		}

可以看出，使用反射機制來運行，暫且不提，按照這一段的說法，運行的是MAINFEST內定義的Main-class，那麼，我們看下，包裏面是否有MAINFEST文件呢？

有的，對於2.6.5版本的hadoop來說，文件內容如下：

Manifest-Version: 1.0
Archiver-Version: Plexus Archiver
Created-By: Apache Maven
Built-By: YYZYHC
Build-Jdk: 1.8.0_131
Main-Class: org.apache.hadoop.examples.ExampleDriver

非常清楚，接下來我們需要看一下，這個ExampleDriver是何方神聖：

public class ExampleDriver {
	public static void main(String argv[]) {
		int exitCode = -1;
		ProgramDriver pgd = new ProgramDriver();
		try {
			pgd.addClass("wordcount", WordCount.class,
					"A map/reduce program that counts the words in the input files.");
			pgd.addClass(
					"wordmean",
					WordMean.class,
					"A map/reduce program that counts the average length of the words in the input files.");
			pgd.addClass(
					"wordmedian",
					WordMedian.class,
					"A map/reduce program that counts the median length of the words in the input files.");
			pgd.addClass(
					"wordstandarddeviation",
					WordStandardDeviation.class,
					"A map/reduce program that counts the standard deviation of the length of the words in the input files.");
			pgd.addClass(
					"aggregatewordcount",
					AggregateWordCount.class,
					"An Aggregate based map/reduce program that counts the words in the input files.");
			pgd.addClass(
					"aggregatewordhist",
					AggregateWordHistogram.class,
					"An Aggregate based map/reduce program that computes the histogram of the words in the input files.");
			pgd.addClass("grep", Grep.class,
					"A map/reduce program that counts the matches of a regex in the input.");
			pgd.addClass("randomwriter", RandomWriter.class,
					"A map/reduce program that writes 10GB of random data per node.");
			pgd.addClass("randomtextwriter", RandomTextWriter.class,
					"A map/reduce program that writes 10GB of random textual data per node.");
			pgd.addClass("sort", Sort.class,
					"A map/reduce program that sorts the data written by the random writer.");

			pgd.addClass("pi", QuasiMonteCarlo.class,
					QuasiMonteCarlo.DESCRIPTION);
			pgd.addClass("bbp", BaileyBorweinPlouffe.class,
					BaileyBorweinPlouffe.DESCRIPTION);
			pgd.addClass("distbbp", DistBbp.class, DistBbp.DESCRIPTION);

			pgd.addClass("pentomino", DistributedPentomino.class,
					"A map/reduce tile laying program to find solutions to pentomino problems.");
			pgd.addClass("secondarysort", SecondarySort.class,
					"An example defining a secondary sort to the reduce.");
			pgd.addClass("sudoku", Sudoku.class, "A sudoku solver.");
			pgd.addClass("join", Join.class,
					"A job that effects a join over sorted, equally partitioned datasets");
			pgd.addClass("multifilewc", MultiFileWordCount.class,
					"A job that counts words from several files.");
			pgd.addClass("dbcount", DBCountPageView.class,
					"An example job that count the pageview counts from a database.");
			pgd.addClass("teragen", TeraGen.class,
					"Generate data for the terasort");
			pgd.addClass("terasort", TeraSort.class, "Run the terasort");
			pgd.addClass("teravalidate", TeraValidate.class,
					"Checking results of terasort");
			exitCode = pgd.run(argv);
		} catch (Throwable e) {
			e.printStackTrace();
		}
		System.exit(exitCode);
	}
}

這代碼也很容易看懂，裏面定義了一個ProgramDriver，這個類位於hadoop-common包下，具體內容不貼了，接下來介紹其中部分被ExampleDriver使用到的東西：

/**
	 * A description of a program based on its class and a human-readable
	 * description.
	 */
	Map<String, ProgramDescription> programs;

	public ProgramDriver() {
		programs = new TreeMap<String, ProgramDescription>();
	}

只有一個構造器，生成了一個TreeMap，key是String，value是ProgramDescription，其內沒什麼花樣，主要是一個main的Method，一個String的description。

ExampleDriver中，ProgramDriver中添加了很多的class，最後執行了run方法，所以，接下來我們看下ProgramDriver中的run方法：

/**
	 * This is a driver for the example programs. It looks at the first command
	 * line argument and tries to find an example program with that name. If it
	 * is found, it calls the main method in that class with the rest of the
	 * command line arguments.
	 * 
	 * @param args
	 *            The argument from the user. args[0] is the command to run.
	 * @return -1 on error, 0 on success
	 * @throws NoSuchMethodException
	 * @throws SecurityException
	 * @throws IllegalAccessException
	 * @throws IllegalArgumentException
	 * @throws Throwable
	 *             Anything thrown by the example program's main
	 */
	public int run(String[] args) throws Throwable {
		// Make sure they gave us a program name.
		if (args.length == 0) {
			System.out.println("An example program must be given as the"
					+ " first argument.");
			printUsage(programs);
			return -1;
		}

		// And that it is good.
		ProgramDescription pgm = programs.get(args[0]);
		if (pgm == null) {
			System.out.println("Unknown program '" + args[0] + "' chosen.");
			printUsage(programs);
			return -1;
		}

		// Remove the leading argument and call main
		String[] new_args = new String[args.length - 1];
		for (int i = 1; i < args.length; ++i) {
			new_args[i - 1] = args[i];
		}
		pgm.invoke(new_args);
		return 0;
	}

這裏，看的很清楚，先根據後續的參數，提取出對應的類，然後予以執行。

同樣，我們定義的wordcount實質上就是上文中TreeMap的key，必須精確指定，才能讓程序運行起來。

綜上所述，運行Hadoop自帶的MapReduce-examples的途徑有以下幾種：

修改打包的實現，刪除MAINFEST中的Main-class，或者是修改Main-class的名稱爲自己想要的名稱。
直接指定自己的className全名，簡易便捷。

土豆釗

發佈了61 篇原創文章 · 獲贊 166 · 訪問量 13萬+

私信關注

詳細解析如何運行hadoop自帶例子

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

最全Flume、ElasticSearch、Kibana實現日誌實時展示

聊聊Docker和虛擬機

白話RNN系列（二）

匈牙利算法-看這篇絕對就夠了！

白話RNN系列（一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結