在hadoop-env.sh中和mapreduce.application.classpath、yarn.application.classpath將jar都設置進去了,這樣在本地執行hadoop jar命令時就就不會報缺少依賴錯誤,但關於他們具體的工作原理不太清楚了,就着這個機會,就準備好好分析一下hadoop運行原理,這篇先分析hadoop jar提交任務。
一、命令行hadoop jar *** 命令的hadoop腳本爲/usr/local/hadoop
#!/bin/bash
# Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in
SOURCE="${BASH_SOURCE[0]}"
BIN_DIR="$( dirname "$SOURCE" )"
while [ -h "$SOURCE" ]
do
SOURCE="$(readlink "$SOURCE")"
[[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE"
BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
done
BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
LIB_DIR=$BIN_DIR/../lib
# Autodetect JAVA_HOME if not defined
. $LIB_DIR/bigtop-utils/bigtop-detect-javahome
export HADOOP_LIBEXEC_DIR=//$LIB_DIR/hadoop/libexec
exec $LIB_DIR/hadoop/bin/hadoop "$@"
監測JAVAHOMR,設定HADOOP_LIBEXEC_DIR變量,執行實際上的hadoop腳本,該腳本位置在/opt/cloudera/parcels/CDH/lib/hadoop/bin/下($LIB_DIR的值爲/opt/cloudera/parcels/CDH/lib)二、/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop腳本執行流程
1、執行hadoop-config.sh腳本進行相關配置(hadoop-config.sh詳細解析請參考另外一篇博客:http://blog.csdn.net/a822631129/article/details/50038883)
(1)、設置hadoop、hdfs、yarn、mapred目錄的一些變量
(2)、設定配置文件目錄
(3)、執行hadoop-env.sh文件,項目中就在這個腳本里將依賴的jar添加進去了(hadoop-env.sh腳本的作用就是設置在執行用戶寫的mapreduce程序中使用的變量,其中設定CLASSPATH使得在提交節點執行用戶程序時能夠找到依賴,至於container中使用的默認依賴就要通過其他配置搞定了)
(4)、設置JAVA_HOME,JAVA_HEAP_SIZE等變量。
(5)、設置需要加載執行的類CLASS(即CLASSPATH變量,將COMMON、HDFS、YANR、MAPREDUCE、HADOOP_CLASSPATH中的jar都添加上了)和HADOOP_OPTS參數(hadoop.log.dir、hadoop.log.file、hadoop.home.dir、hadoop.root.logger、java.library.path、hadoop.policy.file),
2、獲得用戶命令COMMAND,給命令分類,本例是jar。
確定要執行的CLASS是org.apache.hadoop.util.RunJar
3、 export CLASSPATH=$CLASSPATH
exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
設定CLASSPATH,獲取參數,調用java執行類(RunJar)
/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop腳本:
# This script runs the hadoop core commands.
bin=`which $0`
bin=`dirname ${bin}`
bin=`cd "$bin"; pwd`
DEFAULT_LIBEXEC_DIR="$bin"/../libexec
HADOOP_LIBEXEC_DIR=${HADOOP_LIBEXEC_DIR:-$DEFAULT_LIBEXEC_DIR}
. $HADOOP_LIBEXEC_DIR/hadoop-config.sh
function print_usage(){
echo "Usage: hadoop [--config confdir] COMMAND"
echo " where COMMAND is one of:"
echo " fs run a generic filesystem user client"
echo " version print the version"
echo " jar <jar> run a jar file"
echo " checknative [-a|-h] check native hadoop and compression libraries availability"
echo " distcp <srcurl> <desturl> copy file or directories recursively"
echo " archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive"
echo " classpath prints the class path needed to get the"
echo " credential interact with credential providers"
echo " Hadoop jar and the required libraries"
echo " daemonlog get/set the log level for each daemon"
echo " trace view and modify Hadoop tracing settings"
echo " or"
echo " CLASSNAME run the class named CLASSNAME"
echo ""
echo "Most commands print help when invoked w/o parameters."
}
if [ $# = 0 ]; then
print_usage
exit
fi
COMMAND=$1
case $COMMAND in
# usage flags
--help|-help|-h)
print_usage
exit
;;
#hdfs commands
namenode|secondarynamenode|datanode|dfs|dfsadmin|fsck|balancer|fetchdt|oiv|dfsgroups|portmap|nfs3)
echo "DEPRECATED: Use of this script to execute hdfs command is deprecated." 1>&2
echo "Instead use the hdfs command for it." 1>&2
echo "" 1>&2
#try to locate hdfs and if present, delegate to it.
shift
if [ -f "${HADOOP_HDFS_HOME}"/bin/hdfs ]; then
exec "${HADOOP_HDFS_HOME}"/bin/hdfs ${COMMAND/dfsgroups/groups} "$@"
elif [ -f "${HADOOP_PREFIX}"/bin/hdfs ]; then
exec "${HADOOP_PREFIX}"/bin/hdfs ${COMMAND/dfsgroups/groups} "$@"
else
echo "HADOOP_HDFS_HOME not found!"
exit 1
fi
;;
#mapred commands for backwards compatibility
pipes|job|queue|mrgroups|mradmin|jobtracker|tasktracker|mrhaadmin|mrzkfc|jobtrackerha)
echo "DEPRECATED: Use of this script to execute mapred command is deprecated." 1>&2
echo "Instead use the mapred command for it." 1>&2
echo "" 1>&2
#try to locate mapred and if present, delegate to it.
shift
if [ -f "${HADOOP_MAPRED_HOME}"/bin/mapred ]; then
exec "${HADOOP_MAPRED_HOME}"/bin/mapred ${COMMAND/mrgroups/groups} "$@"
elif [ -f "${HADOOP_PREFIX}"/bin/mapred ]; then
exec "${HADOOP_PREFIX}"/bin/mapred ${COMMAND/mrgroups/groups} "$@"
else
echo "HADOOP_MAPRED_HOME not found!"
exit 1
fi
;;
#core commands
*)
# the core commands
if [ "$COMMAND" = "fs" ] ; then
CLASS=org.apache.hadoop.fs.FsShell
elif [ "$COMMAND" = "version" ] ; then
CLASS=org.apache.hadoop.util.VersionInfo
elif [ "$COMMAND" = "jar" ] ; then
CLASS=org.apache.hadoop.util.RunJar
elif [ "$COMMAND" = "key" ] ; then
CLASS=org.apache.hadoop.crypto.key.KeyShell
elif [ "$COMMAND" = "checknative" ] ; then
CLASS=org.apache.hadoop.util.NativeLibraryChecker
elif [ "$COMMAND" = "distcp" ] ; then
CLASS=org.apache.hadoop.tools.DistCp
CLASSPATH=${CLASSPATH}:${TOOL_PATH}
elif [ "$COMMAND" = "daemonlog" ] ; then
CLASS=org.apache.hadoop.log.LogLevel
elif [ "$COMMAND" = "archive" ] ; then
CLASS=org.apache.hadoop.tools.HadoopArchives
CLASSPATH=${CLASSPATH}:${TOOL_PATH}
elif [ "$COMMAND" = "credential" ] ; then
CLASS=org.apache.hadoop.security.alias.CredentialShell
elif [ "$COMMAND" = "trace" ] ; then
CLASS=org.apache.hadoop.tracing.TraceAdmin
elif [ "$COMMAND" = "classpath" ] ; then
if [ "$#" -eq 1 ]; then
# No need to bother starting up a JVM for this simple case.
echo $CLASSPATH
exit
else
CLASS=org.apache.hadoop.util.Classpath
fi
elif [[ "$COMMAND" = -* ]] ; then
# class and package names cannot begin with a -
echo "Error: No command named \`$COMMAND' was found. Perhaps you meant \`hadoop ${COMMAND#-}'"
exit 1
else
CLASS=$COMMAND
fi
shift
# Always respect HADOOP_OPTS and HADOOP_CLIENT_OPTS
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
#make sure security appender is turned off
HADOOP_OPTS="$HADOOP_OPTS -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,NullAppender}"
export CLASSPATH=$CLASSPATH
三、org.apache.hadoop.util.RunJar執行流程:
1.通過命令行參數獲取程序jar包名,再通過jar包名稱獲得jar包中主類名稱,若是jar包中沒有manifest文件,讀第二個參數爲主類名稱
2.準備運行環境,在hadoop.tmp.dir下創建hadoop-unjar*目錄,作爲作業執行的工作目錄;然後調用unjar方法把jar文件解壓到該目錄下
3.調用ClassLoader加載CLASSPATH,把算子jar解壓後的內容class、lib等加載到CLASSPATH。
4.根據java反射機制執行jar包主類的main方法。
/** Run a Hadoop job jar. If the main class is not in the jar's manifest,
* then it must be provided on the command line. */
public static void main(String[] args) throws Throwable {
//獲取jar包名稱,獲得jar包中主類名稱,若是jar包中沒有manifest文件,讀第二個參數爲主類名稱
String usage = "RunJar jarFile [mainClass] args...";
if (args.length < 1) {
System.err.println(usage);
System.exit(-1);
}
int firstArg = 0;
String fileName = args[firstArg++];
File file = new File(fileName);
if (!file.exists() || !file.isFile()) {
System.err.println("Not a valid JAR: " + file.getCanonicalPath());
System.exit(-1);
}
String mainClassName = null;
JarFile jarFile;
try {
jarFile = new JarFile(fileName);
} catch(IOException io) {
throw new IOException("Error opening job jar: " + fileName)
.initCause(io);
}
Manifest manifest = jarFile.getManifest();
if (manifest != null) {
mainClassName = manifest.getMainAttributes().getValue("Main-Class");
}
jarFile.close();
if (mainClassName == null) {
if (args.length < 2) {
System.err.println(usage);
System.exit(-1);
}
mainClassName = args[firstArg++];
}
mainClassName = mainClassName.replaceAll("/", ".");
File tmpDir = new File(new Configuration().get("hadoop.tmp.dir"));
ensureDirectory(tmpDir);
final File workDir;
try {
workDir = File.createTempFile("hadoop-unjar", "", tmpDir);
} catch (IOException ioe) {
// If user has insufficient perms to write to tmpDir, default
// "Permission denied" message doesn't specify a filename.
System.err.println("Error creating temp dir in hadoop.tmp.dir " + tmpDir + " due to " + ioe.getMessage());
System.exit(-1);
return;
}
if (!workDir.delete()) {
System.err.println("Delete failed for " + workDir);
System.exit(-1);
}
ensureDirectory(workDir);
ShutdownHookManager.get().addShutdownHook(
new Runnable() {
@Override
public void run() {
FileUtil.fullyDelete(workDir);
}
}, SHUTDOWN_HOOK_PRIORITY);
unJar(file, workDir);
ArrayList<URL> classPath = new ArrayList<URL>();
classPath.add(new File(workDir+"/").toURI().toURL());
classPath.add(file.toURI().toURL());
classPath.add(new File(workDir, "classes/").toURI().toURL());
File[] libs = new File(workDir, "lib").listFiles();
if (libs != null) {
for (int i = 0; i < libs.length; i++) {
classPath.add(libs[i].toURI().toURL());
}
}
ClassLoader loader = new URLClassLoader(classPath.toArray(new URL[0]));
Thread.currentThread().setContextClassLoader(loader);
Class<?> mainClass = Class.forName(mainClassName, true, loader);
Method main = mainClass.getMethod("main", new Class[] {Array.newInstance(String.class, 0).getClass()});
String[] newArgs = Arrays.asList(args).subList(firstArg, args.length).toArray(new String[0]);
try {
main.invoke(null, new Object[] { newArgs });
} catch (InvocationTargetException e) {
throw e.getTargetException();
}
}
四、由以上分析可知,在執行hadoop-config.sh腳本時,執行了hadoop-env.sh,就可將hadoop-env.sh中設置的CLASSPATH加載到了執行jar時的環境變量裏,而像mapreduce.application.classpath、yarn.application.classpath這兩個屬性設置的東西在此時卻是沒有加載的,這個應該是hadoop的container任務會用到,這個問題以後再具體分析。$HADOOP_HOME/bin/hadoop腳本實現了很多hadoop命令,但是還有很多命令是通過$HADOOP_HOME/bin/mapred或$HADOOP_HOME/bin/hdfs來執行的,有興趣的可以看看這些腳本,這裏暫不做分析。本文主要分析了hadoop執行jar文件的流程:通過解析jar文件的到主類,利用java反射機制執行主類的main方法,進而執行相關程序。進而hadoop如何提交job在下文再做介紹。