Hive中配置Apache Tez運行MR

前言

  • Hive:2.3.0
  • Hadoop:2.7.7
  • JDK:1.8.0_221
  • Tez:0.9.1
  • 本次配置Apache Tez只是用於Hive執行MR任務,而非Hadoop全局配置,並且使用的是已編譯二進制壓縮包
  • Hadoop-Tez兼容性:Apache Tez 0.9.0中使用了部分Hadoop 2.7.0開發包,因此如果Hadoop是2.7.x版本,建議使用0.9.0及更新版本的Tez,避免發生兼容性問題。而對於Hadoop 2.6.x版本,官方建議使用Tez 0.8.3及更新版本的Tez
  • Hive-Tez各版本兼容信息:Hive-Tez Compatibility
  • Install/Deploy Instructions for Tez

配置過程

1)下載已編譯Tez壓縮包,並解壓

下載地址:

Apache Tez各版本下載地址:Apache TEZ Releases

備用下載地址:Apache Tez

解壓並更名:

tar -xzvf apache-tez-0.9.1-bin.tar.gz -C /opt/module/
mv /opt/module/apache-tez-0.9.1-bin /opt/module/tez-0.9.1

注意: 需要將Tez(客戶端)安裝在與Hive(客戶端)相同節點上

2)替換tez/lib路徑下的hadoop相關jar包

這一步操作時爲了避免jar包版本衝突,因爲後續這些不同版本的jar包都會添加到HADOOP_CLASSPATH中,如果不覆蓋,在Hive中使用MR引擎執行Job時會發生版本衝突而報錯

刪除tez-0.9.1/lib下的hadoop相關的jar包:

rm hadoop-mapreduce-client-core-2.7.0.jar
rm hadoop-mapreduce-client-common-2.7.0.jar

將集羣中hadoop中的對應jar包複製添加到tez-0.9.1/lib下(實測也可以不添加):

cp /opt/module/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.7.jar /opt/module/tez-0.9.1/lib
cp /opt/module/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.7.7.jar /opt/module/tez-0.9.1/lib

3)將壓縮包tez/share/tez.tar.gz上傳至HDFS中,並修改權限

hadoop fs -rm -R /apps/tez-0.9.1
hadoop fs -mkdir -p /apps/tez-0.9.1
hadoop fs -put -f /opt/module/tez-0.9.1/share/tez.tar.gz /apps/tez-0.9.1
hadoop fs -chmod -R 777 /apps

PS:如果是編譯Tez的Maven項目源碼,則是將壓縮包 tez/target/tez-x.y.z-SNAPSHOT.tar.gz 上傳到HDFS

4)在hive/conf目錄下創建tez-site.xml文件

hive/conf目錄下創建 tez-site.xml 文件,並配置相關參數

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- 設置tez依賴的jar包路徑,值爲上傳的Tez壓縮包所在的HDFS路徑 -->
    <property>
        <name>tez.lib.uris</name>
        <value>${fs.defaultFS}/apps/tez-0.9.1/tez.tar.gz</value>
        <description>
            String value to a file path.
            The location of the Tez libraries which will be localized for DAGs.
        </description>
        <type>string</type>
    </property>

    <!-- 設置是否使用集羣中的hadoop函數庫,如果爲false,則使用tez.lib.uris中包含的hadoop依賴 -->
    <property>
        <name>tez.use.cluster.hadoop-libs</name>
        <value>false</value>
        <description>
            Boolean value.
            Specify whether hadoop libraries required to run Tez should be the ones deployed on the cluster.
            This is disabled by default - with the expectation being that tez.lib.uris has a complete
            tez-deployment which contains the hadoop libraries.
        </description>
        <type>boolean</type>
    </property>

    <!-- 如果沒有設置 tez.am.launch.cmd-opts 參數,則便會使用此功能.
    此參數設定Tez Job所能使用的JVM堆內存佔整個Container內存大小的比例
    如果YARN中的container內存資源較少,則將此值適當減小,反之則適當增大. -->
    <property>
        <name>tez.container.max.java.heap.fraction</name>
        <value>0.2</value>
        <description>
            Double value. Tez automatically determines the Xmx for the JVMs used to run
            Tez tasks and app masters. This feature is enabled if the user has not
            specified Xmx or Xms values in the launch command opts. Doing automatic Xmx
            calculation is preferred because Tez can determine the best value based on
            actual allocation of memory to tasks the cluster. The value if used as a
            fraction that is applied to the memory allocated Factor to size Xmx based
            on container memory size. Value should be greater than 0 and less than 1.

            Set this value to -1 to allow Tez to use different default max heap fraction
            for different container memory size. Current policy is to use 0.7 for container
            smaller than 4GB and use 0.8 for larger container.
        </description>
        <type>float</type>
    </property>

    <!-- 設置Tez task的ApplicationMaster 所用內存,單位MB -->
    <!-- 由於主機內存只有1.5G可用,因此將此值減小 -->
    <!-- 默認值:1024 -->
    <property>
        <name>tez.am.resource.memory.mb</name>
        <value>1024</value>
        <description>
            Int value. The amount of memory in MB to be used by the AppMaster
        </description>
        <type>integer</type>
    </property>

    <!-- 設置Tez task的所用內存,單位MB-->
    <!-- 由於主機內存只有1.5G可用,因此將此值減小 -->
    <!-- 默認值:1024 -->
    <property>
        <name>tez.task.resource.memory.mb</name>
        <value>512</value>
        <description>
            Int value. The amount of memory in MB to be used by tasks. This applies to 
            all tasks across all vertices. Setting it to the same value for all tasks 
            is helpful for container reuse and thus good for performance typically.
        </description>
        <type>integer</type>
    </property>

</configuration>

5)在Hive客戶端節點上配置Tez相關環境變量,並添加HADOOP_CLASSPATH

直接在Hive安裝路徑下的conf/hive-env.sh文件結尾設置相關環境變量,故每次Hive啓動時,自動加載Tez相關環境變量。

  • TEZ_CONF_DIR:Tez配置文件 tez-site.xml 所在路徑
  • TEZ_JARS:Tez壓縮包解壓路徑
  • HADOOP_CLASSPATH:Hadoop運行時的classpath
# Tez classpath
TEZ_CONF_DIR=/opt/module/tez-0.9.1/conf/tez-site.xml
TEZ_JARS=/opt/module/tez-0.9.1
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*
# 如果使用某些額外的jar包,可以通過HIVE_AUX_JARS_PATH變量指定路徑
# 如hadoop-lzo依賴包等,此處額外依賴包都放在了/opt/libs/路徑下
export HIVE_AUX_JARS_PATH=/opt/libs/*

6)啓動/重啓Hadoop

7)啓動Hive CLI(或者啓動hiveserver服務和beeline),設置Hive的MR執行引擎

hive> set hive.execution.engine=tez;

8)執行測試命令,查看輸出

hive> SELECT deptno, avg(sal) as avg_sal FROM emp group by deptno;

9)設置Hive默認使用Tez執行MR Job(可選)

可以直接在hive/conf/hive-site.xml文件中設置參數hive.execution.engine值爲tez,即默認使用Tez執行MR Job:

    <property>
        <name>hive.execution.engine</name>
        <value>tez</value>
        <description>
            Expects one of [mr, tez, spark].
            Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR
            remains the default engine for historical reasons, it is itself a historical engine
            and is deprecated in Hive 2 line. It may be removed without further warning.
        </description>
    </property>

10)依舊使用MR執行Job(可選)

hive> set hive.execution.engine=mr;

End~

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章