前言
無論Hive Cli還是HiveServer2,一個HQl語句都要經過Driver進行解析和執行,粗略如下圖:
Driver處理的流程如下:
HQL解析(生成AST語法樹) =>
語法分析(得到QueryBlock) =>
生成邏輯執行計劃(Operator) =>
邏輯優化(Logical Optimizer Operator) =>
生成物理執行計劃(Task Plan) =>
物理優化(Task Tree) =>
構建執行計劃(QueryPlan) =>
表以及操作鑑權 =>
執行引擎執行
流程涉及HQL解析,HQL編譯(語法分析,邏輯計劃和物理計劃,鑑權),執行器執行三個大的方面,在整個生命週期中,按執行順序會有如下鉤子函數:
Driver.run()之前的preDriverRun
該鉤子函數由配置 hive.exec.driver.run.hooks
控制,多個鉤子實現類以逗號間隔,鉤子需實現 org.apache.hadoop.hive.ql.HiveDriverRunHook
接口,該接口描述如下:
public interface HiveDriverRunHook extends Hook {
/**
* Invoked before Hive begins any processing of a command in the Driver,
* notably before compilation and any customizable performance logging.
*/
public void preDriverRun(
HiveDriverRunHookContext hookContext) throws Exception;
/**
* Invoked after Hive performs any processing of a command, just before a
* response is returned to the entity calling the Driver.
*/
public void postDriverRun(
HiveDriverRunHookContext hookContext) throws Exception;
}
可以看出鉤子還提供了 postDriverRun
方法供HQL執行完,數據返回前調用,這個在後面會說到
其參數在Hive裏使用的是 HiveDriverRunHookContext
的默認實現類 org.apache.hadoop.hive.ql.HiveDriverRunHookContextImpl
,裏面提供了兩個有用的參數,分別是HiveConf和要執行的Command,其調用信息如下:
HiveDriverRunHookContext hookContext = new HiveDriverRunHookContextImpl(conf, command);
// Get all the driver run hooks and pre-execute them.
List<HiveDriverRunHook> driverRunHooks;
try {
driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS,
HiveDriverRunHook.class);
for (HiveDriverRunHook driverRunHook : driverRunHooks) {
driverRunHook.preDriverRun(hookContext);
}
} catch (Exception e) {
errorMessage = "FAILED: Hive Internal Error: " + Utilities.getNameMessage(e);
SQLState = ErrorMsg.findSQLState(e.getMessage());
downstreamError = e;
console.printError(errorMessage + "\n"
+ org.apache.hadoop.util.StringUtils.stringifyException(e));
return createProcessorResponse(12);
}
語法分析之前的preAnalyze
在Driver開始run之後,HQL經過解析會進入編譯階段的語法分析,而在語法分析前會經過鉤子 HiveSemanticAnalyzerHook
的 preAnalyze
方法,該鉤子函數由 hive.semantic.analyzer.hook
配置,鉤子需實現 org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHook
接口,接口描述如下:
public interface HiveSemanticAnalyzerHook extends Hook {
public ASTNode preAnalyze(
HiveSemanticAnalyzerHookContext context,
ASTNode ast) throws SemanticException;
public void postAnalyze(
HiveSemanticAnalyzerHookContext context,
List<Task<? extends Serializable>> rootTasks) throws SemanticException;
}
可以看出鉤子類還提供了 postAnalyze
方法供語法分析完後調用,這個在後面會提到
其參數在Hive裏使用的是 HiveSemanticAnalyzerHookContext
的默認實現類 org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContextImpl
,裏面提供了HQL對應的輸入,輸出,提交用戶,HiveConf和客戶端IP等信息,輸入和輸出的表及分區信息需要做完語法分析後才能得到,在 preAnalyze
裏獲取不到,其調用信息如下:
List<HiveSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK,
HiveSemanticAnalyzerHook.class);
// Do semantic analysis and plan generation
if (saHooks != null) {
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
hookCtx.setUserName(userName);
hookCtx.setIpAddress(SessionState.get().getUserIpAddress());
hookCtx.setCommand(command);
for (HiveSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
// 此處開始進行語法分析,會涉及邏輯執行計劃和物理執行計劃的生成和優化
sem.analyze(tree, ctx);
// 更新分析器以便後續的postAnalyzer鉤子執行
hookCtx.update(sem);
for (HiveSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
} else {
sem.analyze(tree, ctx);
}
語法分析之後的postAnalyze
從 preAnalyze
的分析可以看出,postAnalyze
與其屬於同一個鉤子類,因此配置也相同,不同的是它位於Hive的語法分析之後,因此可以獲取到HQL的輸入和輸出表及分區信息,以及語法分析得到的Task信息,由此可以判斷是否是需要分佈式執行的任務,以及執行引擎是什麼,具體的代碼和配置可見上面的 preAnalyze
分析
生成執行計劃之前的redactor鉤子
這個鉤子函數是在語法分析之後,生成QueryPlan之前,所以執行它的時候語法分析已完成,具體要跑的任務已定,這個鉤子的目的在於完成QueryString的替換,比如QueryString中包含敏感的表或字段信息,在這裏都可以完成替換,從而在Yarn的RM界面或其他方式查詢該任務的時候,會顯示經過替換後的HQL
該鉤子由 hive.exec.query.redactor.hooks
配置,多個實現類以逗號間隔,鉤子需繼承 org.apache.hadoop.hive.ql.hooks.Redactor
抽象類,並替換 redactQuery
方法,接口描述如下:
public abstract class Redactor implements Hook, Configurable {
private Configuration conf;
public void setConf(Configuration conf) {
this.conf = conf;
}
public Configuration getConf() {
return conf;
}
/**
* Implementations may modify the query so that when placed in the job.xml
* and thus potenially exposed to admin users, the query does not expose
* sensitive information.
*/
public String redactQuery(String query) {
return query;
}
}
其調用信息如下:
public static String redactLogString(HiveConf conf, String logString)
throws InstantiationException, IllegalAccessException, ClassNotFoundException {
String redactedString = logString;
if (conf != null && logString != null) {
List<Redactor> queryRedactors = getHooks(conf, ConfVars.QUERYREDACTORHOOKS, Redactor.class);
for (Redactor redactor : queryRedactors) {
redactor.setConf(conf);
redactedString = redactor.redactQuery(redactedString);
}
}
return redactedString;
}
Task執行之前的preExecutionHook
在執行計劃QueryPlan生成完,並通過鑑權後,就會進行具體Task的執行,而Task執行之前會經過一個鉤子函數,鉤子函數由 hive.exec.pre.hooks
配置,多個鉤子實現類以逗號間隔,該鉤子的實現方式有兩個,分別是:
一、實現 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
接口
該接口會傳入 org.apache.hadoop.hive.ql.hooks.HookContext
的實例作爲參數,而參數類裏帶有執行計劃,HiveConf,Lineage信息,UGI,提交用戶名,輸入輸出表及分區信息等私有變量,爲我們實現自己的功能提供了很多幫助
接口描述如下:
public interface ExecuteWithHookContext extends Hook {
void run(HookContext hookContext) throws Exception;
}
二、實現 org.apache.hadoop.hive.ql.hooks.PreExecute
接口
該接口傳入參數包括SessionState,UGI和HQL的輸入輸出表及分區信息,目前該接口被標爲已過時的接口,相比上面的ExecuteWithHookContext,該接口提供的信息可能不完全能滿足我們的需求
其接口描述如下:
public interface PreExecute extends Hook {
/**
* The run command that is called just before the execution of the query.
*
* @param sess
* The session state.
* @param inputs
* The set of input tables and partitions.
* @param outputs
* The set of output tables, partitions, local and hdfs directories.
* @param ugi
* The user group security information.
*/
@Deprecated
public void run(SessionState sess, Set<ReadEntity> inputs,
Set<WriteEntity> outputs, UserGroupInformation ugi)
throws Exception;
}
該鉤子的調用信息如下:
SessionState ss = SessionState.get();
HookContext hookContext = new HookContext(plan, conf, ctx.getPathToCS(), ss.getUserName(), ss.getUserIpAddress());
hookContext.setHookType(HookContext.HookType.PRE_EXEC_HOOK);
for (Hook peh : getHooks(HiveConf.ConfVars.PREEXECHOOKS)) {
if (peh instanceof ExecuteWithHookContext) {
perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
((ExecuteWithHookContext) peh).run(hookContext);
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
} else if (peh instanceof PreExecute) {
perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
((PreExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),
Utils.getUGI());
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
}
}
Task執行失敗時的ON_FAILURE_HOOKS
Task執行完後,如果執行失敗了,那麼Hive會調用這個失敗的Hook。該鉤子由參數 hive.exec.failure.hooks
配置,多個鉤子實現類以逗號間隔,鉤子需實現 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
接口,該接口在上文已有描述。該鉤子主要用於在任務執行失敗時執行一些措施,比如統計等等
該鉤子的調用信息如下:
hookContext.setHookType(HookContext.HookType.ON_FAILURE_HOOK);
// Get all the failure execution hooks and execute them.
for (Hook ofh : getHooks(HiveConf.ConfVars.ONFAILUREHOOKS)) {
perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());
((ExecuteWithHookContext) ofh).run(hookContext);
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());
}
Task執行完畢的postExecutionHook
這個鉤子是在Task任務執行完畢後執行,如果Task失敗,會先執行ON_FAILURE_HOOKS這個鉤子,之後執行postExecutionHook,該鉤子由參數 hive.exec.post.hooks
配置,多個鉤子實現類以逗號間隔,該鉤子的實現方式也有兩個
一、實現 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
接口
這個與preExecutionHook一致
二、實現 org.apache.hadoop.hive.ql.hooks.PostExecute
接口
該接口傳入參數包括SessionState,UGI,列級的LineageInfo和HQL的輸入輸出表及分區信息,目前該接口被標爲已過時的接口,相比上面的ExecuteWithHookContext,該接口提供的信息可能不完全能滿足我們的需求
其接口描述如下:
public interface PostExecute extends Hook {
/**
* The run command that is called just before the execution of the query.
*
* @param sess
* The session state.
* @param inputs
* The set of input tables and partitions.
* @param outputs
* The set of output tables, partitions, local and hdfs directories.
* @param lInfo
* The column level lineage information.
* @param ugi
* The user group security information.
*/
@Deprecated
void run(SessionState sess, Set<ReadEntity> inputs,
Set<WriteEntity> outputs, LineageInfo lInfo,
UserGroupInformation ugi) throws Exception;
}
該鉤子的調用信息如下:
hookContext.setHookType(HookContext.HookType.POST_EXEC_HOOK);
// Get all the post execution hooks and execute them.
for (Hook peh : getHooks(HiveConf.ConfVars.POSTEXECHOOKS)) {
if (peh instanceof ExecuteWithHookContext) {
perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
((ExecuteWithHookContext) peh).run(hookContext);
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
} else if (peh instanceof PostExecute) {
perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
((PostExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),
(SessionState.get() != null ? SessionState.get().getLineageState().getLineageInfo()
: null), Utils.getUGI());
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
}
}
Task執行完畢結果返回之前的postDriverRun
該鉤子在Task執行完畢,而結果尚未返回之前執行,與preDriverRun相對應,由於是同一個接口,這裏不做詳細描述
最後
至此,整個HQL執行生命週期中的鉤子函數都講完了,執行順序和流程可梳理如下:
Driver.run()
=> HiveDriverRunHook.preDriverRun()(hive.exec.driver.run.hooks
)
=> Driver.compile()
=> HiveSemanticAnalyzerHook.preAnalyze()(hive.semantic.analyzer.hook
)
=> SemanticAnalyze(QueryBlock, LogicalPlan, PhyPlan, TaskTree)
=> HiveSemanticAnalyzerHook.postAnalyze()(hive.semantic.analyzer.hook
)
=> QueryString redactor(hive.exec.query.redactor.hooks
)
=> QueryPlan Generation
=> Authorization
=> Driver.execute()
=> ExecuteWithHookContext.run() || PreExecute.run() (hive.exec.pre.hooks
)
=> TaskRunner
=> if failed, ExecuteWithHookContext.run()(hive.exec.failure.hooks
)
=> ExecuteWithHookContext.run() || PostExecute.run() (hive.exec.post.hooks
)
=> HiveDriverRunHook.postDriverRun()(hive.exec.driver.run.hooks
)
歡迎閱讀轉載,轉載請註明出處:https://my.oschina.net/u/2539801/blog/1514648