此次博主爲大家帶來的是Hive性能調優中的表的優化。

一. 小表、大表Join

將key相對分散，並且數據量小的表放在join的左邊，這樣可以有效減少內存溢出錯誤發生的機率；再進一步，可以使用map join讓小的維度表（1000條以下的記錄條數）先進內存。在map端完成reduce。
$\color{#FF0000}{實際測試發現：新版的hive已經對小表JOIN大表和大表JOIN小表進行了優化。小表放在左邊和右邊已經沒有明顯區別。}$

案例實操

1. 需求
測試大表JOIN小表和小表JOIN大表的效率
2.建大表、小表和JOIN後表的語句

// 創建大表
create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
// 創建小表
create table smalltable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
// 創建join後表的語句
create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

3. 分別向大表和小表中導入數據

hive (default)> load data local inpath '/opt/module/datas/bigtable' into table bigtable;
hive (default)>load data local inpath '/opt/module/datas/smalltable' into table smalltable;

4. 關閉mapjoin功能（默認是打開的）

set hive.auto.convert.join = false;

5. 執行小表JOIN大表語句

insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
left join bigtable  b
on b.id = s.id;

6. 執行大表JOIN小表語句

insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable  b
left join smalltable  s
on s.id = b.id;

我們可見時間是差不多的，正好驗證了上面的結論。

二. 大表Join大表

2.1 空KEY過濾

有時join超時是因爲某些key對應的數據太多，而相同key對應的數據都會發送到相同的reducer上，從而導致內存不夠。此時我們應該仔細分析這些異常的key，很多情況下，這些key對應的數據是異常數據，我們需要在SQL語句中進行過濾。例如key對應的字段爲空，操作如下：
案例操作：

1. 配置歷史服務器

// 配置mapred-site.xml
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop001:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop001:19888</value>
</property>

// 啓動歷史服務器
sbin/mr-jobhistory-daemon.sh start historyserver

查看jobhistory http://hadoop001:19888/jobhistory

2. 創建原始數據表、空id表、合併後數據表

// 創建原始表
create table ori(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
// 創建空id表
create table nullidtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
// 創建join後表的語句
create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

3. 分別加載原始數據和空id數據到對應表中

hive (default)> load data local inpath '/opt/module/datas/ori' into table ori;
hive (default)> load data local inpath '/opt/module/datas/nullid' into table nullidtable;

4. 測試不過濾空id

hive (default)> insert overwrite table jointable select n.* from nullidtable n
left join ori o on n.id = o.id;

5. 測試過濾空id

hive (default)> insert overwrite table jointable select n.* from (select * from nullidtable where id is not null ) n  left join ori o on n.id = o.id;

2.2 空key轉換

有時雖然某個key爲空對應的數據很多，但是相應的數據不是異常數據，必須要包含在join的結果中，此時我們可以表a中key爲空的字段賦一個隨機的值，使得數據隨機均勻地分不到不同的reducer上。

案例實操：

1. 不隨機分佈空null值：

1. 設置5個reduce個數

set mapreduce.job.reduces = 5;

2. JOIN兩張表

insert overwrite table jointable
select n.* from nullidtable n left join ori b on n.id = b.id;

結果：如下圖所示，可以看出來，出現了數據傾斜，某些reducer的資源消耗遠大於其他reducer。

2. 隨機分佈空null值

1. 設置5個reduce個數

set mapreduce.job.reduces = 5;

2. JOIN兩張表

insert overwrite table jointable
select n.* from nullidtable n full join ori o on 
case when n.id is null then concat('hive', rand()) else n.id end = o.id;

結果：如下圖所示，可以看出來，消除了數據傾斜，負載均衡reducer的資源消耗

三. MapJoin（小表join大表）

如果不指定MapJoin或者不符合MapJoin的條件，那麼Hive解析器會將Join操作轉換成Common Join，即：在Reduce階段完成join。容易發生數據傾斜。可以用MapJoin把小表全部加載到內存在map端進行join，避免reducer處理。

3.1 開啓MapJoin參數設置

1. 設置自動選擇Mapjoin

set hive.auto.convert.join = true; 默認爲true

2. 大表小表的閾值設置（默認25M以下認爲是小表）：

set hive.mapjoin.smalltable.filesize=25000000;

3.1 MapJoin工作機制

案例實操：

1 . 開啓Mapjoin功能

set hive.auto.convert.join = true; 默認爲true

2. 執行小表JOIN大表語句

insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
join bigtable  b
on s.id = b.id;

3. 執行大表JOIN小表語句

insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable  b
join smalltable  s
on s.id = b.id;

四. Group By

默認情況下，Map階段同一Key數據分發給一個reduce，當一個key數據過大時就傾斜了。

並不是所有的聚合操作都需要在Reduce端完成，很多聚合操作都可以先在Map端進行部分聚合，最後在Reduce端得出最終結果。

開啓Map端聚合參數設置

1. 是否在Map端進行聚合，默認爲True

set hive.map.aggr = true

2. 在Map端進行聚合操作的條目數目

set hive.groupby.mapaggr.checkinterval = 100000

3. 有數據傾斜的時候進行負載均衡（默認是false）

set hive.groupby.skewindata = true

當選項設定爲 true，生成的查詢計劃會有兩個MR Job。第一個MR Job中，Map的輸出結果會隨機分佈到Reduce中，每個Reduce做部分聚合操作，並輸出結果，這樣處理的結果是相同的Group By Key有可能被分發到不同的Reduce中，從而達到負載均衡的目的；第二個MR Job再根據預處理的數據結果按照Group By Key分佈到Reduce中（這個過程可以保證相同的Group By Key被分佈到同一個Reduce中），最後完成最終的聚合操作。

優化前

hive (default)> select student.name from student group by student.name;

優化後

hive (default)> set hive.groupby.skewindata = true;
hive (default)> select student.name from student group by student.name;

五. Count(Distinct) 去重統計

數據量小的時候無所謂，數據量大的情況下，由於COUNT DISTINCT的全聚合操作，即使設定了reduce task個數，set mapred.reduce.tasks=100；hive也只會啓動一個reducer。，這就造成一個Reduce處理的數據量太大，導致整個Job很難完成，一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替換：

實例操作：

1. 創建一張大表

hive (default)> create table bigtable(id bigint, time bigint, uid string, keyword
string, url_rank int, click_num int, click_url string) row format delimited
fields terminated by '\t';

2. 加載數據

hive (default)> load data local inpath '/opt/module/datas/bigtable' into table
 bigtable;

3. 設置5個reduce個數

set mapreduce.job.reduces = 5;

4. 執行去重id查詢

hive (default)> select count(distinct id) from bigtable;

5. 採用GROUP by去重id

hive (default)> select count(id) from (select id from bigtable group by id) a;

雖然會多用一個Job來完成，但在數據量大的情況下，這個絕對是值得的。

六. 笛卡爾積

儘量避免笛卡爾積，join的時候不加on條件，或者無效的on條件，Hive只能使用1個reducer來完成笛卡爾積。

七. 行列過濾

列處理：在SELECT中，只拿需要的列，如果有，儘量使用分區過濾，少用SELECT *。
行處理：在分區剪裁中，當使用外關聯時，如果將副表的過濾條件寫在Where後面，那麼就會先全表關聯，之後再過濾.

案例實操：

1. 測試先關聯兩張表，再用where條件過濾

hive (default)> select o.id from bigtable b
join ori o on o.id = b.id
where o.id <= 10;

Time taken: 34.406 seconds, Fetched: 100 row(s)

2. 通過子查詢後，再關聯表

hive (default)> select b.id from bigtable b
join (select id from ori where id <= 10 ) o on b.id = o.id;

Time taken: 30.058 seconds, Fetched: 100 row(s)

八. 動態分區調整

關係型數據庫中，對分區表Insert數據時候，數據庫自動會根據分區字段的值，將數據插入到相應的分區中，Hive中也提供了類似的機制，即動態分區(Dynamic Partition)，只不過，使用Hive的動態分區，需要進行相應的配置。

8.1 開啓動態分區參數設置

1. 開啓動態分區功能（默認true，開啓）

hive.exec.dynamic.partition=true

2. 設置爲非嚴格模式（動態分區的模式，默認strict，表示必須指定至少一個分區爲靜態分區，nonstrict模式表示允許所有的分區字段都可以使用動態分區。）

hive.exec.dynamic.partition.mode=nonstrict

3. 在所有執行MR的節點上，最大一共可以創建多少個動態分區。默認1000

hive.exec.max.dynamic.partitions=1000

4. 在每個執行MR的節點上，最大可以創建多少個動態分區。該參數需要根據實際的數據來設定。比如：源數據中包含了一年的數據，即day字段有365個值，那麼該參數就需要設置成大於365，如果使用默認值100，則會報錯。

hive.exec.max.dynamic.partitions.pernode=100

5. 整個MR Job中，最大可以創建多少個HDFS文件。默認100000

hive.exec.max.created.files=100000

6. 當有空分區生成時，是否拋出異常。一般不需要設置。默認false

hive.error.on.empty.partition=false

8.2 實例操作

需求：將dept表中的數據按照地區（loc字段），插入到目標表dept_partition的相應分區中。

1. 創建目標分區表

hive (default)> create table dept_partition(id int, name string) partitioned
by (location int) row format delimited fields terminated by '\t';

2. 設置動態分區

set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition partition(location) select deptno, dname, loc from dept;

3. 查看目標分區表的分區情況

hive (default)> show partitions dept_partition;

本次的分享就到這裏了,

$\color{#FF0000}{看完就贊，養成習慣！！！}$ ^ _ ^ ❤️ ❤️ ❤️
碼字不易，大家的支持就是我堅持下去的動力。點贊後不要忘了關注我哦！

Hive快速入門系列(15) | Hive性能調優 [二] 表的優化

目錄

一. 小表、大表Join

二. 大表Join大表

2.1 空KEY過濾

2.2 空key轉換

1. 不隨機分佈空null值：

2. 隨機分佈空null值

三. MapJoin（小表join大表）

3.1 開啓MapJoin參數設置

3.1 MapJoin工作機制

四. Group By

五. Count(Distinct) 去重統計

六. 笛卡爾積

七. 行列過濾

八. 動態分區調整

8.1 開啓動態分區參數設置

8.2 實例操作

Spark快速入門系列(4) | Spark環境搭建—standalone(1) 集羣的搭建

Spark快速入門系列(3) | 簡單一文了解Spark核心概念

Spark快速入門系列(2) | Spark 運行模式之Local本地模式

Spark快速入門系列(1) | 深入淺出，一文讓你瞭解什麼是Spark

scala快速入門系列(1) | scala的簡單介紹

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結