测试数据
name | course | score |
---|---|---|
Darren | Chinese | 71 |
Darren | Math | 81 |
Darren | English | 91 |
Jonathan | Chinese | 72 |
Jonathan | Math | 82 |
Jonathan | English | 92 |
Tom | Chinese | 73 |
行转列
语法
SELECT
xxx
FROM
table_test
PIVOT(
聚合函数(value_column) FOR pivot_column in (<column_list>)
)
Example:
SELECT
*
FROM row_table
PIVOT(
MAX(score) FOR course in ('Chinese', 'Math', 'English')
)
结果:
name | Chinese | Math | English |
---|---|---|---|
Darren | 71 | 81 | 91 |
Jonathan | 72 | 82 | 92 |
Tom | 73 | null | null |
列转行
spark并不支持UNPIVOT,而是用stack()来实现列转行
语法:
SELECT
STACK
(
row_number,
'column1_value', column1_name,
...,
'columnn_value', columnn_name
) as (new_column1_name, new_column2_name)
Example:
SELECT
name
, STACK
(
3,
'Chinese', Chinese,
'Math', Math,
'English', English
) as (course, score)
FROM col_table
结果:
name | course | score |
---|---|---|
Darren | Chinese | 71 |
Darren | Math | 81 |
Darren | English | 91 |
Jonathan | Chinese | 72 |
Jonathan | Math | 82 |
Jonathan | English | 92 |
Tom | Chinese | 73 |
Tom | Math | null |
Tom | English | null |
注意:此时发现结果表和最原始的表比较,Tom多了两行值为null,所以应该再过滤掉null值就得到了和原来一样的表
spark.sql(f"""
SELECT
name
, STACK
(
3,
'Chinese', Chinese,
'Math', Math,
'English', English
) as (course, score)
FROM col_table
""").where("score is not null")
就能得到和原来一样的结果了。
参考:
https://queirozf.com/entries/spark-dataframe-examples-pivot-and-unpivot-data
https://sparkbyexamples.com/spark/how-to-pivot-table-and-unpivot-a-spark-dataframe/