Spark中DF落地到hive中進行動態分區以及小文件問題

原創

2020-06-27 12:33

五個注意點

hive的動態分區需要開啓非嚴格模式

set hive.exec.dynamic.partition.mode=nonstrict

insertInto方式不支持分區表數據導入

saveAsTable與insertInto的區別

指定的分區列必須爲當前表中的某個列

比如一張主題表中想保留Long類型的ct字段,又想根據ct轉化爲String類型的bdp_day來進行分區，
就需要在當前表中既有ct這一列，又要有bdp_day這一列

落地到hdfs上每個分區中可能會有很多小文件

需要在落地之前將DF.repartition(xx)      //根據實際文件大小,減少分區數即可

hive中的表不需要存在

如果想手動在hive中創建目標表,則需要將分區的列名創建到表中
其次,不需要創建分區表,只要將需要分區的列名添加到建表語句中即可

舉個例子:


df.map(perline => {
    val release_session: String = perline.getAs[String]("release_session")
    val release_status: String = perline.getAs[String]("release_status")
    val device_num: String = perline.getAs[String]("device_num")
    val device_type: String = perline.getAs[String]("device_type")
    Long = perline.getAs[Long]("ct")
    val bdp_day: String = new DateTime(ct.toString().substring(0, 10)
    (release_session, release_status, device_num, device_type, ct,bdp_day)
  }).toDF("release_session", "release_status", "device_num", "device_type","ct","bdp_day")
  .sort(asc("ct")).cache()
.repartition(2).write.mode(SaveMode.Overwrite).format("parquet").partitionBy("bdp_day").saveAsTable("dw_release.dw_release_customer")

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark中DF落地到hive中進行動態分區以及小文件問題

五個注意點

hive的動態分區需要開啓非嚴格模式

insertInto方式不支持分區表數據導入

指定的分區列必須爲當前表中的某個列

落地到hdfs上每個分區中可能會有很多小文件

hive中的表不需要存在

前端使用 Konva 實現可視化設計器（13）- 折線 - 最優路徑應用【思路篇】

bashdb的源碼安裝

hive文件存儲格式orc,parquet,avro對比

linux下python從2.6.6升級到2.7.5

Spark中DF落地到hive中進行動態分區以及小文件問題

解決linux安裝jdk後 java -version版本不一致問題

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結