你想了解的Hive Query生命週期--鉤子函數篇！原薦

前言

無論Hive Cli還是HiveServer2，一個HQl語句都要經過Driver進行解析和執行，粗略如下圖：

Driver處理的流程如下：

HQL解析(生成AST語法樹) => 語法分析(得到QueryBlock) => 生成邏輯執行計劃(Operator) => 邏輯優化(Logical Optimizer Operator) => 生成物理執行計劃(Task Plan) => 物理優化(Task Tree) => 構建執行計劃(QueryPlan) => 表以及操作鑑權 => 執行引擎執行

流程涉及HQL解析，HQL編譯(語法分析，邏輯計劃和物理計劃，鑑權)，執行器執行三個大的方面，在整個生命週期中，按執行順序會有如下鉤子函數：

Driver.run()之前的preDriverRun

該鉤子函數由配置 hive.exec.driver.run.hooks 控制，多個鉤子實現類以逗號間隔，鉤子需實現 org.apache.hadoop.hive.ql.HiveDriverRunHook 接口，該接口描述如下：

public interface HiveDriverRunHook extends Hook {
  /**
   * Invoked before Hive begins any processing of a command in the Driver,
   * notably before compilation and any customizable performance logging.
   */
  public void preDriverRun(
    HiveDriverRunHookContext hookContext) throws Exception;

  /**
   * Invoked after Hive performs any processing of a command, just before a
   * response is returned to the entity calling the Driver.
   */
  public void postDriverRun(
    HiveDriverRunHookContext hookContext) throws Exception;
}

可以看出鉤子還提供了 postDriverRun 方法供HQL執行完，數據返回前調用，這個在後面會說到

其參數在Hive裏使用的是 HiveDriverRunHookContext 的默認實現類 org.apache.hadoop.hive.ql.HiveDriverRunHookContextImpl，裏面提供了兩個有用的參數，分別是HiveConf和要執行的Command，其調用信息如下：

HiveDriverRunHookContext hookContext = new HiveDriverRunHookContextImpl(conf, command);
// Get all the driver run hooks and pre-execute them.
List<HiveDriverRunHook> driverRunHooks;
try {
  driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS,
      HiveDriverRunHook.class);
  for (HiveDriverRunHook driverRunHook : driverRunHooks) {
      driverRunHook.preDriverRun(hookContext);
  }
} catch (Exception e) {
  errorMessage = "FAILED: Hive Internal Error: " + Utilities.getNameMessage(e);
  SQLState = ErrorMsg.findSQLState(e.getMessage());
  downstreamError = e;
  console.printError(errorMessage + "\n"
      + org.apache.hadoop.util.StringUtils.stringifyException(e));
  return createProcessorResponse(12);
}

語法分析之前的preAnalyze

在Driver開始run之後，HQL經過解析會進入編譯階段的語法分析，而在語法分析前會經過鉤子 HiveSemanticAnalyzerHook 的 preAnalyze 方法，該鉤子函數由 hive.semantic.analyzer.hook 配置，鉤子需實現 org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHook 接口，接口描述如下：

public interface HiveSemanticAnalyzerHook extends Hook {
  public ASTNode preAnalyze(
    HiveSemanticAnalyzerHookContext context,
    ASTNode ast) throws SemanticException;

  public void postAnalyze(
    HiveSemanticAnalyzerHookContext context,
    List<Task<? extends Serializable>> rootTasks) throws SemanticException;
}

可以看出鉤子類還提供了 postAnalyze 方法供語法分析完後調用，這個在後面會提到

其參數在Hive裏使用的是 HiveSemanticAnalyzerHookContext 的默認實現類 org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContextImpl，裏面提供了HQL對應的輸入，輸出，提交用戶，HiveConf和客戶端IP等信息，輸入和輸出的表及分區信息需要做完語法分析後才能得到，在 preAnalyze 裏獲取不到，其調用信息如下：

List<HiveSemanticAnalyzerHook> saHooks =
    getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK,
        HiveSemanticAnalyzerHook.class);

// Do semantic analysis and plan generation
if (saHooks != null) {
  HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
  hookCtx.setConf(conf);
  hookCtx.setUserName(userName);
  hookCtx.setIpAddress(SessionState.get().getUserIpAddress());
  hookCtx.setCommand(command);
  for (HiveSemanticAnalyzerHook hook : saHooks) {
    tree = hook.preAnalyze(hookCtx, tree);
  }
  // 此處開始進行語法分析，會涉及邏輯執行計劃和物理執行計劃的生成和優化
  sem.analyze(tree, ctx);
  // 更新分析器以便後續的postAnalyzer鉤子執行
  hookCtx.update(sem);
  for (HiveSemanticAnalyzerHook hook : saHooks) {
    hook.postAnalyze(hookCtx, sem.getRootTasks());
  }
} else {
  sem.analyze(tree, ctx);
}

語法分析之後的postAnalyze

從 preAnalyze 的分析可以看出，postAnalyze 與其屬於同一個鉤子類，因此配置也相同，不同的是它位於Hive的語法分析之後，因此可以獲取到HQL的輸入和輸出表及分區信息，以及語法分析得到的Task信息，由此可以判斷是否是需要分佈式執行的任務，以及執行引擎是什麼，具體的代碼和配置可見上面的 preAnalyze 分析

生成執行計劃之前的redactor鉤子

這個鉤子函數是在語法分析之後，生成QueryPlan之前，所以執行它的時候語法分析已完成，具體要跑的任務已定，這個鉤子的目的在於完成QueryString的替換，比如QueryString中包含敏感的表或字段信息，在這裏都可以完成替換，從而在Yarn的RM界面或其他方式查詢該任務的時候，會顯示經過替換後的HQL

該鉤子由 hive.exec.query.redactor.hooks 配置，多個實現類以逗號間隔，鉤子需繼承 org.apache.hadoop.hive.ql.hooks.Redactor 抽象類，並替換 redactQuery 方法，接口描述如下：

public abstract class Redactor implements Hook, Configurable {

  private Configuration conf;
  
  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return conf;
  }

  /**
   * Implementations may modify the query so that when placed in the job.xml
   * and thus potenially exposed to admin users, the query does not expose
   * sensitive information.
   */
  public String redactQuery(String query) {
    return query;
  }
}

其調用信息如下：

public static String redactLogString(HiveConf conf, String logString)
    throws InstantiationException, IllegalAccessException, ClassNotFoundException {

  String redactedString = logString;

  if (conf != null && logString != null) {
    List<Redactor> queryRedactors = getHooks(conf, ConfVars.QUERYREDACTORHOOKS, Redactor.class);
    for (Redactor redactor : queryRedactors) {
      redactor.setConf(conf);
      redactedString = redactor.redactQuery(redactedString);
    }
  }

  return redactedString;
}

Task執行之前的preExecutionHook

在執行計劃QueryPlan生成完，並通過鑑權後，就會進行具體Task的執行，而Task執行之前會經過一個鉤子函數，鉤子函數由 hive.exec.pre.hooks 配置，多個鉤子實現類以逗號間隔，該鉤子的實現方式有兩個，分別是：

一、實現 `org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext` 接口

該接口會傳入 org.apache.hadoop.hive.ql.hooks.HookContext 的實例作爲參數，而參數類裏帶有執行計劃，HiveConf，Lineage信息，UGI，提交用戶名，輸入輸出表及分區信息等私有變量，爲我們實現自己的功能提供了很多幫助

接口描述如下：

public interface ExecuteWithHookContext extends Hook {

  void run(HookContext hookContext) throws Exception;
}

二、實現 `org.apache.hadoop.hive.ql.hooks.PreExecute` 接口

該接口傳入參數包括SessionState，UGI和HQL的輸入輸出表及分區信息，目前該接口被標爲已過時的接口，相比上面的ExecuteWithHookContext，該接口提供的信息可能不完全能滿足我們的需求

其接口描述如下：

public interface PreExecute extends Hook {

  /**
   * The run command that is called just before the execution of the query.
   *
   * @param sess
   *          The session state.
   * @param inputs
   *          The set of input tables and partitions.
   * @param outputs
   *          The set of output tables, partitions, local and hdfs directories.
   * @param ugi
   *          The user group security information.
   */
  @Deprecated
  public void run(SessionState sess, Set<ReadEntity> inputs,
      Set<WriteEntity> outputs, UserGroupInformation ugi)
    throws Exception;
}

該鉤子的調用信息如下：

SessionState ss = SessionState.get();
HookContext hookContext = new HookContext(plan, conf, ctx.getPathToCS(), ss.getUserName(), ss.getUserIpAddress());
hookContext.setHookType(HookContext.HookType.PRE_EXEC_HOOK);

for (Hook peh : getHooks(HiveConf.ConfVars.PREEXECHOOKS)) {
  if (peh instanceof ExecuteWithHookContext) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());

    ((ExecuteWithHookContext) peh).run(hookContext);

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
  } else if (peh instanceof PreExecute) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());

    ((PreExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),
        Utils.getUGI());

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
  }
}

Task執行失敗時的ON_FAILURE_HOOKS

Task執行完後，如果執行失敗了，那麼Hive會調用這個失敗的Hook。該鉤子由參數 hive.exec.failure.hooks 配置，多個鉤子實現類以逗號間隔，鉤子需實現 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext 接口，該接口在上文已有描述。該鉤子主要用於在任務執行失敗時執行一些措施，比如統計等等

該鉤子的調用信息如下：

hookContext.setHookType(HookContext.HookType.ON_FAILURE_HOOK);
// Get all the failure execution hooks and execute them.
for (Hook ofh : getHooks(HiveConf.ConfVars.ONFAILUREHOOKS)) {
  perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());

  ((ExecuteWithHookContext) ofh).run(hookContext);

  perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());
}

Task執行完畢的postExecutionHook

這個鉤子是在Task任務執行完畢後執行，如果Task失敗，會先執行ON_FAILURE_HOOKS這個鉤子，之後執行postExecutionHook，該鉤子由參數 hive.exec.post.hooks 配置，多個鉤子實現類以逗號間隔，該鉤子的實現方式也有兩個

一、實現 `org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext` 接口

這個與preExecutionHook一致

二、實現 `org.apache.hadoop.hive.ql.hooks.PostExecute` 接口

該接口傳入參數包括SessionState，UGI，列級的LineageInfo和HQL的輸入輸出表及分區信息，目前該接口被標爲已過時的接口，相比上面的ExecuteWithHookContext，該接口提供的信息可能不完全能滿足我們的需求

其接口描述如下：

public interface PostExecute extends Hook {

  /**
   * The run command that is called just before the execution of the query.
   *
   * @param sess
   *          The session state.
   * @param inputs
   *          The set of input tables and partitions.
   * @param outputs
   *          The set of output tables, partitions, local and hdfs directories.
   * @param lInfo
   *           The column level lineage information.
   * @param ugi
   *          The user group security information.
   */
  @Deprecated
  void run(SessionState sess, Set<ReadEntity> inputs,
      Set<WriteEntity> outputs, LineageInfo lInfo,
      UserGroupInformation ugi) throws Exception;
}

該鉤子的調用信息如下：

hookContext.setHookType(HookContext.HookType.POST_EXEC_HOOK);
// Get all the post execution hooks and execute them.
for (Hook peh : getHooks(HiveConf.ConfVars.POSTEXECHOOKS)) {
  if (peh instanceof ExecuteWithHookContext) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());

    ((ExecuteWithHookContext) peh).run(hookContext);

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
  } else if (peh instanceof PostExecute) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());

    ((PostExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),
        (SessionState.get() != null ? SessionState.get().getLineageState().getLineageInfo()
            : null), Utils.getUGI());

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
  }
}

Task執行完畢結果返回之前的postDriverRun

該鉤子在Task執行完畢，而結果尚未返回之前執行，與preDriverRun相對應，由於是同一個接口，這裏不做詳細描述

最後

至此，整個HQL執行生命週期中的鉤子函數都講完了，執行順序和流程可梳理如下：

Driver.run()

=> HiveDriverRunHook.preDriverRun()(hive.exec.driver.run.hooks)

=> Driver.compile()

=> HiveSemanticAnalyzerHook.preAnalyze()(hive.semantic.analyzer.hook)

=> SemanticAnalyze(QueryBlock, LogicalPlan, PhyPlan, TaskTree)

=> HiveSemanticAnalyzerHook.postAnalyze()(hive.semantic.analyzer.hook)

=> QueryString redactor(hive.exec.query.redactor.hooks)

=> QueryPlan Generation

=> Authorization

=> Driver.execute()

=> ExecuteWithHookContext.run() || PreExecute.run() (hive.exec.pre.hooks)

=> TaskRunner

=> if failed, ExecuteWithHookContext.run()(hive.exec.failure.hooks)

=> ExecuteWithHookContext.run() || PostExecute.run() (hive.exec.post.hooks)

=> HiveDriverRunHook.postDriverRun()(hive.exec.driver.run.hooks)

歡迎閱讀轉載，轉載請註明出處：https://my.oschina.net/u/2539801/blog/1514648

你想了解的Hive Query生命週期--鉤子函數篇！原薦

前言

Driver.run()之前的preDriverRun

語法分析之前的preAnalyze

語法分析之後的postAnalyze

生成執行計劃之前的redactor鉤子

Task執行之前的preExecutionHook

一、實現 `org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext` 接口

二、實現 `org.apache.hadoop.hive.ql.hooks.PreExecute` 接口

Task執行失敗時的ON_FAILURE_HOOKS

Task執行完畢的postExecutionHook

一、實現 `org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext` 接口

二、實現 `org.apache.hadoop.hive.ql.hooks.PostExecute` 接口

Task執行完畢結果返回之前的postDriverRun

最後

Hive map階段優化之一次詳細的優化分析過程原

Kerberos的那些報錯彙總原

從源碼角度看Spark on yarn client & cluster模式的本質區別原薦

KMS密鑰管理服務(Hadoop) 原

ClassLoader和雙親委派機制原薦

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

你想了解的Hive Query生命週期--鉤子函數篇！ 原 薦

前言

Driver.run()之前的preDriverRun

語法分析之前的preAnalyze

語法分析之後的postAnalyze

生成執行計劃之前的redactor鉤子

Task執行之前的preExecutionHook

一、實現 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext 接口

二、實現 org.apache.hadoop.hive.ql.hooks.PreExecute 接口

Task執行失敗時的ON_FAILURE_HOOKS

Task執行完畢的postExecutionHook

一、實現 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext 接口

二、實現 org.apache.hadoop.hive.ql.hooks.PostExecute 接口

Task執行完畢結果返回之前的postDriverRun

最後

你想了解的Hive Query生命週期--鉤子函數篇！原薦

一、實現 `org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext` 接口

二、實現 `org.apache.hadoop.hive.ql.hooks.PreExecute` 接口

一、實現 `org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext` 接口

二、實現 `org.apache.hadoop.hive.ql.hooks.PostExecute` 接口