大數據實戰-Hive-技巧實戰_2LgaeiFwLs7mCTwG5T3c9M

原創

2023-09-04 13:27

大數據實戰-Hive-技巧實戰_2LgaeiFwLs7mCTwG5T3c9M

大數據實戰-Hive-技巧實戰

1.union 和 union all

前者可以去重

select sex,address from test where dt='20210218' union all select sex,address from test where dt='20210218';
+------+----------+--+
| sex  | address  |
+------+----------+--+
| m    | A        |
| m    | A        |
| m    | B        |
| m    | B        |
| m    | B        |
| m    | B        |
+------+----------+--+

後者不會去重

select sex,address from test where dt='20210218' union select sex,address from test where dt='20210218';
+------+----------+--+
| sex  | address  |
+------+----------+--+
| m    | A        |
| m    | B        |
+------+----------+--+

2.sql後面的distribute by , sort by的作用

3.分桶表

clustered by (sno) sorted by (age desc) into 4 buckets

傳入數據只能用insert into /overwrite

2.1.1版本設置了強制分桶操作，因此人爲的修改reduce的個數不會影響最終文件的個數(文件個數由桶數決定)
–1. 在2.1.1版本里，底層實現了強制分桶，強制排序策略
– 即：正規寫法要帶上distribute by(分桶字段)[sort by 排序字段]，如果沒有帶上，也會分桶和排序。
–2. 使用insert into時可以不加關鍵字table. 使用insert overwrite時必須帶關鍵字table.
–3. 因爲底層實行了強制分桶策略，所以修改mapreduce.job.reduces的個數，不會影響桶文件數據。但是會影響真正執行時reduceTask的數量。是真正的reduceTask的數量是最接近mapreduce.job.reduces的數量的因子。如果是素數，就使用本身

4.動態分區小文件和OOM優化

INSERT OVERWRITE TABLE ris_relation_result_prod partition(rel_id)
SELECT get_json_object(relation, '$.relationHashcode') AS relation_hashcode,
get_json_object(relation, '$.targetVariableValue') AS target_variable_value,
get_json_object(relation, '$.relId') AS rel_id
FROM ris_relation_old_prod733 where get_json_object(relation, '$.relId') in (**********)

set hive.optimize.sort.dynamic.partition=true;

https://blog.csdn.net/lzw2016/article/details/97818080

5.hive 需要開闢很多內存的問題解決

https://blog.csdn.net/qq26442553/article/details/89343579

問題1: Hive/MR 任務報內存溢出

running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 3.9 GB of 4.2 GB virtual memory used. Killing container。

內存調優參數：https://blog.csdn.net/snzzy/article/details/43115681

6.hive的一些sql優化

https://blog.csdn.net/kwuganymede/article/details/51365002

map join優化

7.Hive插入小文件被kill現象

在hive 插入數據動態分區時候會產生很多小文件，被kill, 如下圖1, 另外在GC overhead limit execded

8.Hive處於Block狀態，超時

mapreduce.task.timeout

如果一個task在一定時間內沒有任何進入，即不會讀取新的數據，也沒有輸出數據，則認爲該 task 處於 block 狀態，可能是臨時卡住，也許永遠會卡住。爲了防止因爲用戶程序永遠 block 不退出，則強制設置了一個超時時間（單位毫秒），默認是600000，值爲 0 將禁用超時

9.Hive窗口函數不能大小寫混亂

max( ) over ( partition by prcid order by b.occurtime desc ) 不能大小寫混亂

10.hive客戶端日誌

hive --verbos=true
hive --hiveconf hive.root.logger=DEBUG,console

11.~/.beeline/history && ~/.hivehistory在2.1版本下，會oom

導致客戶端執行命令時候直接卡住，解決方式刪除或者移動備份這個文件

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

大數據實戰-Hive-技巧實戰_2LgaeiFwLs7mCTwG5T3c9M

大數據實戰-Hive-技巧實戰_2LgaeiFwLs7mCTwG5T3c9M

大數據實戰-Hive-技巧實戰

1.union 和 union all

2.sql後面的distribute by , sort by的作用

3.分桶表

4.動態分區小文件和OOM優化

5.hive 需要開闢很多內存的問題解決

6.hive的一些sql優化

7.Hive插入小文件被kill現象

8.Hive處於Block狀態，超時

9.Hive窗口函數不能大小寫混亂

10.hive客戶端日誌

11.~/.beeline/history && ~/.hivehistory在2.1版本下，會oom

使用Nginx做頁面採集, Kafka收集到對應Topic_6XwWe5qWHGM2PojVPUSejM

大數據開發-從Scala到Akka併發編程_jDW32G3c87fjEBtYNE7Z7f

大數據實戰-Hive-技巧實戰_2LgaeiFwLs7mCTwG5T3c9M

大數據開發-Go-新手常遇問題

大數據開發-Go-數組，切片

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結