Hive Sql 遷移到 Spark Sql 問題集合

合併小文件

spark sql 默認的 shuffle分區數是200 (spark.sql.shuffle.partitions)因此產生的小文件是很多的。
合併小文件方法:

  • 1、設置合併文件參數
    • spark.sql.adaptive.enabled=true
    • spark.sql.adaptive.shuffle.targetPostShuffleInputSize
  • 2、Spark 2.4 使用HINT
    • INSERT … SELECT / * + COALESCE(numPartitions)* / …
    • INSERT … SELECT / * + REPARTITION(numPartitions)* / …
  • 3、distribute by
    • distribute by 分區字段
    • distribute by 分區字段, cast(rand() * N as int)
    • distribute by cast(rand() * N as int)
1、ES外部表格式不同

hive

# hive
ADD JAR elasticsearch-hadoop-6.0.0.jar;

create external table db_name.table_name (
    ...
) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
    'es.resource' = '',
    'es.nodes'='',
    'es.port'='',
    'es.mapping.id' = ''
);

spark sql

# spark sql
ADD JAR elasticsearch-spark-20_2.11-6.4.2.jar;

create table db_name.table_name
using org.elasticsearch.spark.sql
options (resource '', nodes '');
2、表的元數據有更新 refreshtable
報錯信息
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

解決辦法:spark-sql> REFRESH TABLE table_name ;

3、丟失 jodd-core-3.5.2.jar 包
報錯信息
Caused by: java.lang.NoClassDefFoundError: jodd/datetime/JDateTime

解決辦法:找到 jodd-core-3.5.2.jar 放到 HDFS上供 Spark 任務使用

4、rand 函數在 on 條件使用
報錯信息
Error in query: nondeterministic expressions are only allowed in ..

例子:在Hive中如果字段做JOIN操作的時,有大量的NULL值,會造成數據傾斜。
Hive的處理辦法: nvl(t1.column_name, rand() * -9999) = t2.column_name
但是在spark sql裏會報錯,解決辦法:t1.column_name = t2.column_name

5、字段沒有別名
報錯信息
Error in query: Reference 'device_type' is ambiguous, could be: t1.device_type, t2.device_type..
6、內存溢出
報錯信息
Exception in thread "IPC Client (1723498053) connection to localhost:8032 from pcsjob" java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedReader.<init>(BufferedReader.java:105)
7、字段類型不對進行過濾

例如 user_id 是String型
如果 user_id > 0 進行過濾,則所有的數據都會過濾掉。正確做法: cast(user_id as bigint) > 0

8、create external 建表語句
報錯信息
CREATE EXTERNAL TABLE must be accompanied by LOCATION(line 1, pos 0)

在hive客戶端 建外部表可以不用指定location HDFS目錄,但是在spark-sql客戶端需要指定

9、spark和hive對decimal的精度不同

Hive 使用 int32
Spark 使用標準的 Parquet 實現:1<= 精度 <= 9 使用 int32,1<= 精度 <= 18 使用 int64

報錯信息:
Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://tesla-cluster/apps/hive/warehouse/db_name.db/table_name/part-00161-cb38abd4-cc8b-4086-bce0-f1b7ee80f80c-c000.snappy.parquet
        at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)

解決辦法:
set spark.sql.parquet.writeLegacyFormat=true;
10、spark 程序啓動太多
報錯信息:
java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries (starting from 4040)! Consider explicitly setting the appropriate port for the service 'SparkUI' (for example spark.ui.port for SparkUI) to an available port or incr

解決辦法:
在spark-defaults.conf中添加spark.port.maxRetries 100
11、笛卡爾積
報錯信息:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT OUTER join between logical plans LocalLimit

解決辦法:
set spark.sql.crossJoin.enabled=true
12、BroadcastJoin
報錯信息:
java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value

解決辦法:
set spark.sql.autoBroadcastJoinThreshold=-1
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章