本文,簡單介紹下,如何運行hadoop自帶的mapreduce的那些例子。
本文針對的hadoop版本,是2.6.5版本,自帶的例子包名爲:hadoop-mapreduce-examples-2.6.5.jar;
位於/share/hadoop/mapreduce目錄下。
簡單來說,如果想要完成範例的運行,直接:
hadoop jar hadoop-mapreduce-examples-2.6.5.jar wordcount /input /output
其中,hadoop命令配置了環境變量。
那麼,這個執行具體是怎麼實現的呢;或者說,如果我使用其他的名稱,或者說WordCount首字母大寫,因爲解壓縮該包能夠看到WordCount.class文件的存在,能成功麼?
答案是:不能。
爲什麼不能?
我們仔細分析下這個整體的運行過程,目標在於,我們能夠實現在mapreduce-examples包內添加我們自己的類,而且能夠實現自己的運行邏輯。
執行命令很簡單,入口,看下hadoop的運行腳本:
elif [ "$COMMAND" = "jar" ] ; then
CLASS=org.apache.hadoop.util.RunJar
很清楚,接下來需要注意的類,是RunJar類,而且指明瞭該類的位置,該命令,實際執行了RunJar中的main方法,我們看一下:
/**
* Run a Hadoop job jar. If the main class is not in the jar's manifest,
* then it must be provided on the command line.
*/
public static void main(String[] args) throws Throwable {
new RunJar().run(args);
}
這裏看的更清楚了,我們在運行其它程序的時候,或者說是自己隨意打出來的一個包,我們可以在後面直接指定className,就可以成功運行出結果了,注意,這裏需要的className必須是全名。
接下來,看看RunJar類中,run方法的內容:
public void run(String[] args) throws Throwable {
String usage = "RunJar jarFile [mainClass] args...";
if (args.length < 1) {
System.err.println(usage);
System.exit(-1);
}
int firstArg = 0;
String fileName = args[firstArg++];
File file = new File(fileName);
if (!file.exists() || !file.isFile()) {
System.err.println("Not a valid JAR: " + file.getCanonicalPath());
System.exit(-1);
}
String mainClassName = null;
JarFile jarFile;
try {
jarFile = new JarFile(fileName);
} catch (IOException io) {
throw new IOException("Error opening job jar: " + fileName)
.initCause(io);
}
Manifest manifest = jarFile.getManifest();
if (manifest != null) {
mainClassName = manifest.getMainAttributes().getValue("Main-Class");
}
jarFile.close();
if (mainClassName == null) {
if (args.length < 2) {
System.err.println(usage);
System.exit(-1);
}
mainClassName = args[firstArg++];
}
mainClassName = mainClassName.replaceAll("/", ".");
File tmpDir = new File(System.getProperty("java.io.tmpdir"));
ensureDirectory(tmpDir);
final File workDir;
try {
workDir = File.createTempFile("hadoop-unjar", "", tmpDir);
} catch (IOException ioe) {
// If user has insufficient perms to write to tmpDir, default
// "Permission denied" message doesn't specify a filename.
System.err.println("Error creating temp dir in java.io.tmpdir "
+ tmpDir + " due to " + ioe.getMessage());
System.exit(-1);
return;
}
if (!workDir.delete()) {
System.err.println("Delete failed for " + workDir);
System.exit(-1);
}
ensureDirectory(workDir);
ShutdownHookManager.get().addShutdownHook(new Runnable() {
@Override
public void run() {
FileUtil.fullyDelete(workDir);
}
}, SHUTDOWN_HOOK_PRIORITY);
unJar(file, workDir);
ClassLoader loader = createClassLoader(file, workDir);
Thread.currentThread().setContextClassLoader(loader);
Class<?> mainClass = Class.forName(mainClassName, true, loader);
Method main = mainClass.getMethod("main", new Class[] { Array
.newInstance(String.class, 0).getClass() });
String[] newArgs = Arrays.asList(args).subList(firstArg, args.length)
.toArray(new String[0]);
try {
main.invoke(null, new Object[] { newArgs });
} catch (InvocationTargetException e) {
throw e.getTargetException();
}
}
縱觀全部的代碼,很容易看懂,主要是解析我們的參數,然後將提交的包解壓,其中需要注意這一句:
Manifest manifest = jarFile.getManifest();
if (manifest != null) {
mainClassName = manifest.getMainAttributes().getValue("Main-Class");
}
跟上文對應,如果打包的文件中有這個ManiFest文件,且包含了Main-class的配置,那麼,就會運行這個Main-class的main方法。
Class<?> mainClass = Class.forName(mainClassName, true, loader);
Method main = mainClass.getMethod("main", new Class[] { Array
.newInstance(String.class, 0).getClass() });
最後的運行邏輯如下:
try {
main.invoke(null, new Object[] { newArgs });
} catch (InvocationTargetException e) {
throw e.getTargetException();
}
可以看出,使用反射機制來運行,暫且不提,按照這一段的說法,運行的是MAINFEST內定義的Main-class,那麼,我們看下,包裏面是否有MAINFEST文件呢?
有的,對於2.6.5版本的hadoop來說,文件內容如下:
Manifest-Version: 1.0
Archiver-Version: Plexus Archiver
Created-By: Apache Maven
Built-By: YYZYHC
Build-Jdk: 1.8.0_131
Main-Class: org.apache.hadoop.examples.ExampleDriver
非常清楚,接下來我們需要看一下,這個ExampleDriver是何方神聖:
public class ExampleDriver {
public static void main(String argv[]) {
int exitCode = -1;
ProgramDriver pgd = new ProgramDriver();
try {
pgd.addClass("wordcount", WordCount.class,
"A map/reduce program that counts the words in the input files.");
pgd.addClass(
"wordmean",
WordMean.class,
"A map/reduce program that counts the average length of the words in the input files.");
pgd.addClass(
"wordmedian",
WordMedian.class,
"A map/reduce program that counts the median length of the words in the input files.");
pgd.addClass(
"wordstandarddeviation",
WordStandardDeviation.class,
"A map/reduce program that counts the standard deviation of the length of the words in the input files.");
pgd.addClass(
"aggregatewordcount",
AggregateWordCount.class,
"An Aggregate based map/reduce program that counts the words in the input files.");
pgd.addClass(
"aggregatewordhist",
AggregateWordHistogram.class,
"An Aggregate based map/reduce program that computes the histogram of the words in the input files.");
pgd.addClass("grep", Grep.class,
"A map/reduce program that counts the matches of a regex in the input.");
pgd.addClass("randomwriter", RandomWriter.class,
"A map/reduce program that writes 10GB of random data per node.");
pgd.addClass("randomtextwriter", RandomTextWriter.class,
"A map/reduce program that writes 10GB of random textual data per node.");
pgd.addClass("sort", Sort.class,
"A map/reduce program that sorts the data written by the random writer.");
pgd.addClass("pi", QuasiMonteCarlo.class,
QuasiMonteCarlo.DESCRIPTION);
pgd.addClass("bbp", BaileyBorweinPlouffe.class,
BaileyBorweinPlouffe.DESCRIPTION);
pgd.addClass("distbbp", DistBbp.class, DistBbp.DESCRIPTION);
pgd.addClass("pentomino", DistributedPentomino.class,
"A map/reduce tile laying program to find solutions to pentomino problems.");
pgd.addClass("secondarysort", SecondarySort.class,
"An example defining a secondary sort to the reduce.");
pgd.addClass("sudoku", Sudoku.class, "A sudoku solver.");
pgd.addClass("join", Join.class,
"A job that effects a join over sorted, equally partitioned datasets");
pgd.addClass("multifilewc", MultiFileWordCount.class,
"A job that counts words from several files.");
pgd.addClass("dbcount", DBCountPageView.class,
"An example job that count the pageview counts from a database.");
pgd.addClass("teragen", TeraGen.class,
"Generate data for the terasort");
pgd.addClass("terasort", TeraSort.class, "Run the terasort");
pgd.addClass("teravalidate", TeraValidate.class,
"Checking results of terasort");
exitCode = pgd.run(argv);
} catch (Throwable e) {
e.printStackTrace();
}
System.exit(exitCode);
}
}
這代碼也很容易看懂,裏面定義了一個ProgramDriver,這個類位於hadoop-common包下,具體內容不貼了,接下來介紹其中部分被ExampleDriver使用到的東西:
/**
* A description of a program based on its class and a human-readable
* description.
*/
Map<String, ProgramDescription> programs;
public ProgramDriver() {
programs = new TreeMap<String, ProgramDescription>();
}
只有一個構造器,生成了一個TreeMap,key是String,value是ProgramDescription,其內沒什麼花樣,主要是一個main的Method,一個String的description。
ExampleDriver中,ProgramDriver中添加了很多的class,最後執行了run方法,所以,接下來我們看下ProgramDriver中的run方法:
/**
* This is a driver for the example programs. It looks at the first command
* line argument and tries to find an example program with that name. If it
* is found, it calls the main method in that class with the rest of the
* command line arguments.
*
* @param args
* The argument from the user. args[0] is the command to run.
* @return -1 on error, 0 on success
* @throws NoSuchMethodException
* @throws SecurityException
* @throws IllegalAccessException
* @throws IllegalArgumentException
* @throws Throwable
* Anything thrown by the example program's main
*/
public int run(String[] args) throws Throwable {
// Make sure they gave us a program name.
if (args.length == 0) {
System.out.println("An example program must be given as the"
+ " first argument.");
printUsage(programs);
return -1;
}
// And that it is good.
ProgramDescription pgm = programs.get(args[0]);
if (pgm == null) {
System.out.println("Unknown program '" + args[0] + "' chosen.");
printUsage(programs);
return -1;
}
// Remove the leading argument and call main
String[] new_args = new String[args.length - 1];
for (int i = 1; i < args.length; ++i) {
new_args[i - 1] = args[i];
}
pgm.invoke(new_args);
return 0;
}
這裏,看的很清楚,先根據後續的參數,提取出對應的類,然後予以執行。
同樣,我們定義的wordcount實質上就是上文中TreeMap的key,必須精確指定,才能讓程序運行起來。
綜上所述,運行Hadoop自帶的MapReduce-examples的途徑有以下幾種:
- 修改打包的實現,刪除MAINFEST中的Main-class,或者是修改Main-class的名稱爲自己想要的名稱。
- 直接指定自己的className全名,簡易便捷。