SparkSQL | 表生成函數

lateral view與 explode函數按理說是不應該在數據庫裏存在的,因爲他違背了第一範式(每個屬性不可再分)。但是實際的場景,如一些大數據場景還是會存在將一些低頻使用但又不能丟失的數據存成json,這種場景下就需要解析json,將裏面的數組和多key值展開。

初始化一份數據

# 隨意造的一份數據,毫無意義
data = [
    {
        "id": 1,
        "name": "XiaoHua",
        "age": 12,
        "interests": "game,read,tv",
        "interests_socre": {'game': 8, 'read': 7, 'tv': 8},
        "scores": {
             "scores": [{
                    "subject": "math",
                    "score": 80
                }, {
                    "subject": "language",
                    "score": 90
                }, {
                    "subject": "sports",
                    "score": 70
            }],
            "count": 3
        },
        "scores_str": '[{"subject": "math", "score": 80}, {"subject": "language", "score": 90}, {"subject": "sports", "score": 70}]'
    },

    {
        "id": 2,
        "name": "QiangQiang",
        "age": 13,
        "interests": "game,read,fishing,pingpong",
        "interests_socre": {'game': 8, 'read': 7, 'fishing': 8, 'pingpong': 9},
        "scores": {
             "scores": [{
                    "subject": "math",
                    "score": 85
                }, {
                    "subject": "language",
                    "score": 92
                }, {
                    "subject": "sports",
                    "score": 73
            }],
            "count": 3
        },
        "scores_str": '[{"subject": "math", "score": 85}, {"subject": "language", "score": 92}, {"subject": "sports", "score": 73}]'
    },    

    {
        "id": 3,
        "name": "YuanYuan",
        "age": 12,
        "interests": "read,dance",
        "interests_socre": {'read': 7, 'dance': 9},
        "scores": {
             "scores": [{
                    "subject": "math",
                    "score": 82
                }, {
                    "subject": "language",
                    "score": 94
                }, {
                    "subject": "sports",
                    "score": 78
            }],
            "count": 3
        },
        "scores_str": '[{"subject": "math", "score": 82}, {"subject": "language", "score": 94}, {"subject": "sports", "score": 78}]'
    }      
]
df = spark.createDataFrame(data)
df.createOrReplaceTempView('df')
df.cache()
/usr/lib/spark/python/pyspark/sql/session.py:346: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"





DataFrame[age: bigint, id: bigint, interests: string, interests_socre: map<string,bigint>, name: string, scores: map<string,array<map<string,string>>>, scores_str: string]
print(df.schema)
StructType(List(StructField(age,LongType,true),StructField(id,LongType,true),StructField(interests,StringType,true),StructField(interests_socre,MapType(StringType,LongType,true),true),StructField(name,StringType,true),StructField(scores,MapType(StringType,ArrayType(MapType(StringType,StringType,true),true),true),true),StructField(scores_str,StringType,true)))
df.toPandas().head()
age id interests interests_socre name scores scores_str
0 12 1 game,read,tv {'tv': 8, 'game': 8, 'read': 7} XiaoHua {'count': None, 'scores': [{'score': '80', 'su... [{"subject": "math", "score": 80}, {"subject":...
1 13 2 game,read,fishing,pingpong {'game': 8, 'read': 7, 'pingpong': 9, 'fishing... QiangQiang {'count': None, 'scores': [{'score': '85', 'su... [{"subject": "math", "score": 85}, {"subject":...
2 12 3 read,dance {'dance': 9, 'read': 7} YuanYuan {'count': None, 'scores': [{'score': '82', 'su... [{"subject": "math", "score": 82}, {"subject":...

explode使用

# Array
spark.sql("""
select 
    id
    ,name
    ,explode(split(interests, ',')) as interest
from df
order by id
""").toPandas()
id name interest
0 1 XiaoHua game
1 1 XiaoHua read
2 1 XiaoHua tv
3 2 QiangQiang game
4 2 QiangQiang read
5 2 QiangQiang fishing
6 2 QiangQiang pingpong
7 3 YuanYuan read
8 3 YuanYuan dance
# Map
spark.sql("""
select 
    id
    ,name    
    ,explode(interests_socre) as (key, value)
from df
order by id
""").toPandas()
id name key value
0 1 XiaoHua tv 8
1 1 XiaoHua game 8
2 1 XiaoHua read 7
3 2 QiangQiang game 8
4 2 QiangQiang read 7
5 2 QiangQiang pingpong 9
6 2 QiangQiang fishing 8
7 3 YuanYuan dance 9
8 3 YuanYuan read 7
# struct
spark.sql("""
SELECT
    id
    ,name
    ,score.subject
    ,score.score
FROM(
    select 
        id
        ,name
        ,explode(scores.scores) as score
    from df
) as base
""").toPandas()
id name subject score
0 1 XiaoHua math 80
1 1 XiaoHua language 90
2 1 XiaoHua sports 70
3 2 QiangQiang math 85
4 2 QiangQiang language 92
5 2 QiangQiang sports 73
6 3 YuanYuan math 82
7 3 YuanYuan language 94
8 3 YuanYuan sports 78

lateral view

explode結合lateral view

  • lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (’,’ columnAlias)*
  • fromClause: FROM baseTable (lateralView)*

udtf

  • explode(ARRAY a)
  • explode(MAP<Tkey,Tvalue> m)
  • posexplode(ARRAY a)
  • inline(ARRAY<STRUCTf1:T1,...,fn:Tn> a)
  • stack(int r,T1 V1,…,Tn/r Vn)
  • json_tuple(string jsonStr,string k1,…,string kn)
  • parse_url_tuple(string urlStr,string p1,…,string pn)
# 上面struct可以結合lateral view而避免嵌套
spark.sql("""
select 
    id
    ,name
    ,sc.subject
    ,sc.score
from df
lateral view explode(scores.scores) t as sc
""").toPandas()
id name subject score
0 1 XiaoHua math 80
1 1 XiaoHua language 90
2 1 XiaoHua sports 70
3 2 QiangQiang math 85
4 2 QiangQiang language 92
5 2 QiangQiang sports 73
6 3 YuanYuan math 82
7 3 YuanYuan language 94
8 3 YuanYuan sports 78

json_tuple可以一次性解析多個字段,而get_json_object一次只能解析一個字段。

  • 1st: regexp_replace(scores_str, ' ', '') 去掉字符串裏的空格
  • 2nd: regexp_extract(1st, '^\\\\[(.+)\\\\]$', 1) 去掉中括號’[]’
  • 3rd: regexp_replace(2nd, '\\\\}\\\\,\\\\{', '\\\\}\\\\|\\\\|\\\\{') 將 “},{” => “}||{”
  • 4th: split(3rd, ‘\\|\\|’) 將數組切分爲一個個dict
  • 5th: 分別取出dict裏的元素
# json字符串解析
spark.sql("""
select
    id
    ,name
    -- json_tuple
    ,v2.subject
    ,v2.score
    -- get_json_object
    ,get_json_object(sc, '$.subject') as subject_2
    ,get_json_object(sc, '$.score') as score_2
    -- json_tuple
    ,json_tuple(t.sc,'subject','score') as (subject_3, score_3)
from(
    select 
        id
        ,name
        ,split(
            regexp_replace(
                regexp_extract(regexp_replace(scores_str, ' ', ''),'^\\\\[(.+)\\\\]$', 1),
                '\\\\}\\\\,\\\\{',
                '\\\\}\\\\|\\\\|\\\\{'
            ), '\\\\|\\\\|') as scores
    from df
) as base
lateral view explode(base.scores) t as sc
lateral view json_tuple(t.sc,'subject','score') v2 as subject,score
""").toPandas()
id name subject score subject_2 score_2 subject_3 score_3
0 1 XiaoHua math 80 math 80 math 80
1 1 XiaoHua language 90 language 90 language 90
2 1 XiaoHua sports 70 sports 70 sports 70
3 2 QiangQiang math 85 math 85 math 85
4 2 QiangQiang language 92 language 92 language 92
5 2 QiangQiang sports 73 sports 73 sports 73
6 3 YuanYuan math 82 math 82 math 82
7 3 YuanYuan language 94 language 94 language 94
8 3 YuanYuan sports 78 sports 78 sports 78

參考

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章