lateral view與 explode函數按理說是不應該在數據庫裏存在的,因爲他違背了第一範式(每個屬性不可再分)。但是實際的場景,如一些大數據場景還是會存在將一些低頻使用但又不能丟失的數據存成json,這種場景下就需要解析json,將裏面的數組和多key值展開。
初始化一份數據
data = [
{
"id" : 1 ,
"name" : "XiaoHua" ,
"age" : 12 ,
"interests" : "game,read,tv" ,
"interests_socre" : { 'game' : 8 , 'read' : 7 , 'tv' : 8 } ,
"scores" : {
"scores" : [ {
"subject" : "math" ,
"score" : 80
} , {
"subject" : "language" ,
"score" : 90
} , {
"subject" : "sports" ,
"score" : 70
} ] ,
"count" : 3
} ,
"scores_str" : '[{"subject": "math", "score": 80}, {"subject": "language", "score": 90}, {"subject": "sports", "score": 70}]'
} ,
{
"id" : 2 ,
"name" : "QiangQiang" ,
"age" : 13 ,
"interests" : "game,read,fishing,pingpong" ,
"interests_socre" : { 'game' : 8 , 'read' : 7 , 'fishing' : 8 , 'pingpong' : 9 } ,
"scores" : {
"scores" : [ {
"subject" : "math" ,
"score" : 85
} , {
"subject" : "language" ,
"score" : 92
} , {
"subject" : "sports" ,
"score" : 73
} ] ,
"count" : 3
} ,
"scores_str" : '[{"subject": "math", "score": 85}, {"subject": "language", "score": 92}, {"subject": "sports", "score": 73}]'
} ,
{
"id" : 3 ,
"name" : "YuanYuan" ,
"age" : 12 ,
"interests" : "read,dance" ,
"interests_socre" : { 'read' : 7 , 'dance' : 9 } ,
"scores" : {
"scores" : [ {
"subject" : "math" ,
"score" : 82
} , {
"subject" : "language" ,
"score" : 94
} , {
"subject" : "sports" ,
"score" : 78
} ] ,
"count" : 3
} ,
"scores_str" : '[{"subject": "math", "score": 82}, {"subject": "language", "score": 94}, {"subject": "sports", "score": 78}]'
}
]
df = spark. createDataFrame( data)
df. createOrReplaceTempView( 'df' )
df. cache( )
/usr/lib/spark/python/pyspark/sql/session.py:346: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
warnings.warn("inferring schema from dict is deprecated,"
DataFrame[age: bigint, id: bigint, interests: string, interests_socre: map<string,bigint>, name: string, scores: map<string,array<map<string,string>>>, scores_str: string]
print ( df. schema)
StructType(List(StructField(age,LongType,true),StructField(id,LongType,true),StructField(interests,StringType,true),StructField(interests_socre,MapType(StringType,LongType,true),true),StructField(name,StringType,true),StructField(scores,MapType(StringType,ArrayType(MapType(StringType,StringType,true),true),true),true),StructField(scores_str,StringType,true)))
df. toPandas( ) . head( )
age
id
interests
interests_socre
name
scores
scores_str
0
12
1
game,read,tv
{'tv': 8, 'game': 8, 'read': 7}
XiaoHua
{'count': None, 'scores': [{'score': '80', 'su...
[{"subject": "math", "score": 80}, {"subject":...
1
13
2
game,read,fishing,pingpong
{'game': 8, 'read': 7, 'pingpong': 9, 'fishing...
QiangQiang
{'count': None, 'scores': [{'score': '85', 'su...
[{"subject": "math", "score": 85}, {"subject":...
2
12
3
read,dance
{'dance': 9, 'read': 7}
YuanYuan
{'count': None, 'scores': [{'score': '82', 'su...
[{"subject": "math", "score": 82}, {"subject":...
explode使用
spark. sql( """
select
id
,name
,explode(split(interests, ',')) as interest
from df
order by id
""" ) . toPandas( )
id
name
interest
0
1
XiaoHua
game
1
1
XiaoHua
read
2
1
XiaoHua
tv
3
2
QiangQiang
game
4
2
QiangQiang
read
5
2
QiangQiang
fishing
6
2
QiangQiang
pingpong
7
3
YuanYuan
read
8
3
YuanYuan
dance
spark. sql( """
select
id
,name
,explode(interests_socre) as (key, value)
from df
order by id
""" ) . toPandas( )
id
name
key
value
0
1
XiaoHua
tv
8
1
1
XiaoHua
game
8
2
1
XiaoHua
read
7
3
2
QiangQiang
game
8
4
2
QiangQiang
read
7
5
2
QiangQiang
pingpong
9
6
2
QiangQiang
fishing
8
7
3
YuanYuan
dance
9
8
3
YuanYuan
read
7
spark. sql( """
SELECT
id
,name
,score.subject
,score.score
FROM(
select
id
,name
,explode(scores.scores) as score
from df
) as base
""" ) . toPandas( )
id
name
subject
score
0
1
XiaoHua
math
80
1
1
XiaoHua
language
90
2
1
XiaoHua
sports
70
3
2
QiangQiang
math
85
4
2
QiangQiang
language
92
5
2
QiangQiang
sports
73
6
3
YuanYuan
math
82
7
3
YuanYuan
language
94
8
3
YuanYuan
sports
78
lateral view
explode結合lateral view
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (’,’ columnAlias)*
fromClause: FROM baseTable (lateralView)*
udtf
explode(ARRAY a)
explode(MAP<Tkey,Tvalue> m)
posexplode(ARRAY a)
inline(ARRAY<STRUCTf1:T1,...,fn:Tn > a)
stack(int r,T1 V1,…,Tn/r Vn)
json_tuple(string jsonStr,string k1,…,string kn)
parse_url_tuple(string urlStr,string p1,…,string pn)
spark. sql( """
select
id
,name
,sc.subject
,sc.score
from df
lateral view explode(scores.scores) t as sc
""" ) . toPandas( )
id
name
subject
score
0
1
XiaoHua
math
80
1
1
XiaoHua
language
90
2
1
XiaoHua
sports
70
3
2
QiangQiang
math
85
4
2
QiangQiang
language
92
5
2
QiangQiang
sports
73
6
3
YuanYuan
math
82
7
3
YuanYuan
language
94
8
3
YuanYuan
sports
78
json_tuple可以一次性解析多個字段,而get_json_object一次只能解析一個字段。
1st: regexp_replace(scores_str, ' ', '')
去掉字符串裏的空格
2nd: regexp_extract(1st, '^\\\\[(.+)\\\\]$', 1)
去掉中括號’[]’
3rd: regexp_replace(2nd, '\\\\}\\\\,\\\\{', '\\\\}\\\\|\\\\|\\\\{')
將 “},{” => “}||{”
4th: split(3rd, ‘\\|\\|’) 將數組切分爲一個個dict
5th: 分別取出dict裏的元素
spark. sql( """
select
id
,name
-- json_tuple
,v2.subject
,v2.score
-- get_json_object
,get_json_object(sc, '$.subject') as subject_2
,get_json_object(sc, '$.score') as score_2
-- json_tuple
,json_tuple(t.sc,'subject','score') as (subject_3, score_3)
from(
select
id
,name
,split(
regexp_replace(
regexp_extract(regexp_replace(scores_str, ' ', ''),'^\\\\[(.+)\\\\]$', 1),
'\\\\}\\\\,\\\\{',
'\\\\}\\\\|\\\\|\\\\{'
), '\\\\|\\\\|') as scores
from df
) as base
lateral view explode(base.scores) t as sc
lateral view json_tuple(t.sc,'subject','score') v2 as subject,score
""" ) . toPandas( )
id
name
subject
score
subject_2
score_2
subject_3
score_3
0
1
XiaoHua
math
80
math
80
math
80
1
1
XiaoHua
language
90
language
90
language
90
2
1
XiaoHua
sports
70
sports
70
sports
70
3
2
QiangQiang
math
85
math
85
math
85
4
2
QiangQiang
language
92
language
92
language
92
5
2
QiangQiang
sports
73
sports
73
sports
73
6
3
YuanYuan
math
82
math
82
math
82
7
3
YuanYuan
language
94
language
94
language
94
8
3
YuanYuan
sports
78
sports
78
sports
78
參考