你想了解的Hive Query生命週期--鉤子函數篇! 原 薦

前言

無論Hive Cli還是HiveServer2,一個HQl語句都要經過Driver進行解析和執行,粗略如下圖:

hive arch|center

Driver處理的流程如下:

HQL解析(生成AST語法樹) => 語法分析(得到QueryBlock) => 生成邏輯執行計劃(Operator) => 邏輯優化(Logical Optimizer Operator) => 生成物理執行計劃(Task Plan) => 物理優化(Task Tree) => 構建執行計劃(QueryPlan) => 表以及操作鑑權 => 執行引擎執行

流程涉及HQL解析,HQL編譯(語法分析,邏輯計劃和物理計劃,鑑權),執行器執行三個大的方面,在整個生命週期中,按執行順序會有如下鉤子函數:

Driver.run()之前的preDriverRun

該鉤子函數由配置 hive.exec.driver.run.hooks 控制,多個鉤子實現類以逗號間隔,鉤子需實現 org.apache.hadoop.hive.ql.HiveDriverRunHook 接口,該接口描述如下:

public interface HiveDriverRunHook extends Hook {
  /**
   * Invoked before Hive begins any processing of a command in the Driver,
   * notably before compilation and any customizable performance logging.
   */
  public void preDriverRun(
    HiveDriverRunHookContext hookContext) throws Exception;

  /**
   * Invoked after Hive performs any processing of a command, just before a
   * response is returned to the entity calling the Driver.
   */
  public void postDriverRun(
    HiveDriverRunHookContext hookContext) throws Exception;
}

可以看出鉤子還提供了 postDriverRun 方法供HQL執行完,數據返回前調用,這個在後面會說到

其參數在Hive裏使用的是 HiveDriverRunHookContext 的默認實現類 org.apache.hadoop.hive.ql.HiveDriverRunHookContextImpl,裏面提供了兩個有用的參數,分別是HiveConf和要執行的Command,其調用信息如下:

HiveDriverRunHookContext hookContext = new HiveDriverRunHookContextImpl(conf, command);
// Get all the driver run hooks and pre-execute them.
List<HiveDriverRunHook> driverRunHooks;
try {
  driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS,
      HiveDriverRunHook.class);
  for (HiveDriverRunHook driverRunHook : driverRunHooks) {
      driverRunHook.preDriverRun(hookContext);
  }
} catch (Exception e) {
  errorMessage = "FAILED: Hive Internal Error: " + Utilities.getNameMessage(e);
  SQLState = ErrorMsg.findSQLState(e.getMessage());
  downstreamError = e;
  console.printError(errorMessage + "\n"
      + org.apache.hadoop.util.StringUtils.stringifyException(e));
  return createProcessorResponse(12);
}

語法分析之前的preAnalyze

在Driver開始run之後,HQL經過解析會進入編譯階段的語法分析,而在語法分析前會經過鉤子 HiveSemanticAnalyzerHookpreAnalyze 方法,該鉤子函數由 hive.semantic.analyzer.hook 配置,鉤子需實現 org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHook 接口,接口描述如下:

public interface HiveSemanticAnalyzerHook extends Hook {
  public ASTNode preAnalyze(
    HiveSemanticAnalyzerHookContext context,
    ASTNode ast) throws SemanticException;

  public void postAnalyze(
    HiveSemanticAnalyzerHookContext context,
    List<Task<? extends Serializable>> rootTasks) throws SemanticException;
}

可以看出鉤子類還提供了 postAnalyze 方法供語法分析完後調用,這個在後面會提到

其參數在Hive裏使用的是 HiveSemanticAnalyzerHookContext 的默認實現類 org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContextImpl,裏面提供了HQL對應的輸入,輸出,提交用戶,HiveConf和客戶端IP等信息,輸入和輸出的表及分區信息需要做完語法分析後才能得到,在 preAnalyze 裏獲取不到,其調用信息如下:

List<HiveSemanticAnalyzerHook> saHooks =
    getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK,
        HiveSemanticAnalyzerHook.class);

// Do semantic analysis and plan generation
if (saHooks != null) {
  HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
  hookCtx.setConf(conf);
  hookCtx.setUserName(userName);
  hookCtx.setIpAddress(SessionState.get().getUserIpAddress());
  hookCtx.setCommand(command);
  for (HiveSemanticAnalyzerHook hook : saHooks) {
    tree = hook.preAnalyze(hookCtx, tree);
  }
  // 此處開始進行語法分析,會涉及邏輯執行計劃和物理執行計劃的生成和優化
  sem.analyze(tree, ctx);
  // 更新分析器以便後續的postAnalyzer鉤子執行
  hookCtx.update(sem);
  for (HiveSemanticAnalyzerHook hook : saHooks) {
    hook.postAnalyze(hookCtx, sem.getRootTasks());
  }
} else {
  sem.analyze(tree, ctx);
}

語法分析之後的postAnalyze

preAnalyze 的分析可以看出,postAnalyze 與其屬於同一個鉤子類,因此配置也相同,不同的是它位於Hive的語法分析之後,因此可以獲取到HQL的輸入和輸出表及分區信息,以及語法分析得到的Task信息,由此可以判斷是否是需要分佈式執行的任務,以及執行引擎是什麼,具體的代碼和配置可見上面的 preAnalyze 分析

生成執行計劃之前的redactor鉤子

這個鉤子函數是在語法分析之後,生成QueryPlan之前,所以執行它的時候語法分析已完成,具體要跑的任務已定,這個鉤子的目的在於完成QueryString的替換,比如QueryString中包含敏感的表或字段信息,在這裏都可以完成替換,從而在Yarn的RM界面或其他方式查詢該任務的時候,會顯示經過替換後的HQL

該鉤子由 hive.exec.query.redactor.hooks 配置,多個實現類以逗號間隔,鉤子需繼承 org.apache.hadoop.hive.ql.hooks.Redactor 抽象類,並替換 redactQuery 方法,接口描述如下:

public abstract class Redactor implements Hook, Configurable {

  private Configuration conf;
  
  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return conf;
  }

  /**
   * Implementations may modify the query so that when placed in the job.xml
   * and thus potenially exposed to admin users, the query does not expose
   * sensitive information.
   */
  public String redactQuery(String query) {
    return query;
  }
}

其調用信息如下:

public static String redactLogString(HiveConf conf, String logString)
    throws InstantiationException, IllegalAccessException, ClassNotFoundException {

  String redactedString = logString;

  if (conf != null && logString != null) {
    List<Redactor> queryRedactors = getHooks(conf, ConfVars.QUERYREDACTORHOOKS, Redactor.class);
    for (Redactor redactor : queryRedactors) {
      redactor.setConf(conf);
      redactedString = redactor.redactQuery(redactedString);
    }
  }

  return redactedString;
}

Task執行之前的preExecutionHook

在執行計劃QueryPlan生成完,並通過鑑權後,就會進行具體Task的執行,而Task執行之前會經過一個鉤子函數,鉤子函數由 hive.exec.pre.hooks 配置,多個鉤子實現類以逗號間隔,該鉤子的實現方式有兩個,分別是:

一、實現 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext 接口

該接口會傳入 org.apache.hadoop.hive.ql.hooks.HookContext 的實例作爲參數,而參數類裏帶有執行計劃,HiveConf,Lineage信息,UGI,提交用戶名,輸入輸出表及分區信息等私有變量,爲我們實現自己的功能提供了很多幫助

接口描述如下:

public interface ExecuteWithHookContext extends Hook {

  void run(HookContext hookContext) throws Exception;
}

二、實現 org.apache.hadoop.hive.ql.hooks.PreExecute 接口

該接口傳入參數包括SessionState,UGI和HQL的輸入輸出表及分區信息,目前該接口被標爲已過時的接口,相比上面的ExecuteWithHookContext,該接口提供的信息可能不完全能滿足我們的需求

其接口描述如下:

public interface PreExecute extends Hook {

  /**
   * The run command that is called just before the execution of the query.
   *
   * @param sess
   *          The session state.
   * @param inputs
   *          The set of input tables and partitions.
   * @param outputs
   *          The set of output tables, partitions, local and hdfs directories.
   * @param ugi
   *          The user group security information.
   */
  @Deprecated
  public void run(SessionState sess, Set<ReadEntity> inputs,
      Set<WriteEntity> outputs, UserGroupInformation ugi)
    throws Exception;
}

該鉤子的調用信息如下:

SessionState ss = SessionState.get();
HookContext hookContext = new HookContext(plan, conf, ctx.getPathToCS(), ss.getUserName(), ss.getUserIpAddress());
hookContext.setHookType(HookContext.HookType.PRE_EXEC_HOOK);

for (Hook peh : getHooks(HiveConf.ConfVars.PREEXECHOOKS)) {
  if (peh instanceof ExecuteWithHookContext) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());

    ((ExecuteWithHookContext) peh).run(hookContext);

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
  } else if (peh instanceof PreExecute) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());

    ((PreExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),
        Utils.getUGI());

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
  }
}

Task執行失敗時的ON_FAILURE_HOOKS

Task執行完後,如果執行失敗了,那麼Hive會調用這個失敗的Hook。該鉤子由參數 hive.exec.failure.hooks 配置,多個鉤子實現類以逗號間隔,鉤子需實現 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext 接口,該接口在上文已有描述。該鉤子主要用於在任務執行失敗時執行一些措施,比如統計等等

該鉤子的調用信息如下:

hookContext.setHookType(HookContext.HookType.ON_FAILURE_HOOK);
// Get all the failure execution hooks and execute them.
for (Hook ofh : getHooks(HiveConf.ConfVars.ONFAILUREHOOKS)) {
  perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());

  ((ExecuteWithHookContext) ofh).run(hookContext);

  perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());
}

Task執行完畢的postExecutionHook

這個鉤子是在Task任務執行完畢後執行,如果Task失敗,會先執行ON_FAILURE_HOOKS這個鉤子,之後執行postExecutionHook,該鉤子由參數 hive.exec.post.hooks 配置,多個鉤子實現類以逗號間隔,該鉤子的實現方式也有兩個

一、實現 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext 接口

這個與preExecutionHook一致

二、實現 org.apache.hadoop.hive.ql.hooks.PostExecute 接口

該接口傳入參數包括SessionState,UGI,列級的LineageInfo和HQL的輸入輸出表及分區信息,目前該接口被標爲已過時的接口,相比上面的ExecuteWithHookContext,該接口提供的信息可能不完全能滿足我們的需求

其接口描述如下:

public interface PostExecute extends Hook {

  /**
   * The run command that is called just before the execution of the query.
   *
   * @param sess
   *          The session state.
   * @param inputs
   *          The set of input tables and partitions.
   * @param outputs
   *          The set of output tables, partitions, local and hdfs directories.
   * @param lInfo
   *           The column level lineage information.
   * @param ugi
   *          The user group security information.
   */
  @Deprecated
  void run(SessionState sess, Set<ReadEntity> inputs,
      Set<WriteEntity> outputs, LineageInfo lInfo,
      UserGroupInformation ugi) throws Exception;
}

該鉤子的調用信息如下:

hookContext.setHookType(HookContext.HookType.POST_EXEC_HOOK);
// Get all the post execution hooks and execute them.
for (Hook peh : getHooks(HiveConf.ConfVars.POSTEXECHOOKS)) {
  if (peh instanceof ExecuteWithHookContext) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());

    ((ExecuteWithHookContext) peh).run(hookContext);

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
  } else if (peh instanceof PostExecute) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());

    ((PostExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),
        (SessionState.get() != null ? SessionState.get().getLineageState().getLineageInfo()
            : null), Utils.getUGI());

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
  }
}

Task執行完畢結果返回之前的postDriverRun

該鉤子在Task執行完畢,而結果尚未返回之前執行,與preDriverRun相對應,由於是同一個接口,這裏不做詳細描述

最後

至此,整個HQL執行生命週期中的鉤子函數都講完了,執行順序和流程可梳理如下:

Driver.run()

=> HiveDriverRunHook.preDriverRun()(hive.exec.driver.run.hooks)

=> Driver.compile()

=> HiveSemanticAnalyzerHook.preAnalyze()(hive.semantic.analyzer.hook)

=> SemanticAnalyze(QueryBlock, LogicalPlan, PhyPlan, TaskTree)

=> HiveSemanticAnalyzerHook.postAnalyze()(hive.semantic.analyzer.hook)

=> QueryString redactor(hive.exec.query.redactor.hooks)

=> QueryPlan Generation

=> Authorization

=> Driver.execute()

=> ExecuteWithHookContext.run() || PreExecute.run() (hive.exec.pre.hooks)

=> TaskRunner

=> if failed, ExecuteWithHookContext.run()(hive.exec.failure.hooks)

=> ExecuteWithHookContext.run() || PostExecute.run() (hive.exec.post.hooks)

=> HiveDriverRunHook.postDriverRun()(hive.exec.driver.run.hooks)

歡迎閱讀轉載,轉載請註明出處:https://my.oschina.net/u/2539801/blog/1514648

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章