pysparksql_標記異常值_提取異常值_approxQuantile

            pysparksql_標記異常值_提取異常值_approxQuantile

sparksql_標記異常值_提取異常值
用 .approxQuantile(…) 方法計算四分位數
 

df_outliers = spark.createDataFrame([(1,143.5,5.3,28),
                                    (2,154.2,5.5,45),
                                    (3,342.3,5.1,99),
                                    (4,144.5,5.5,33),
                                    (5,133.2,5.4,54),
                                    (6,124.1,5.1,21),
                                    (7,129.2,5.3,42)],["id","weight","height","age"])
cols = ["weight","height","age"]
#bounds,用來存儲後面生成的各個字段值的邊界
bounds = {}
for col in cols:
    #涉及統計中的4分位。計算Q1和Q3
    quantiles = df_outliers.approxQuantile(col, [0.25,0.75], 0.05)
    #計算4分位距
    IQR = quantiles[1] - quantiles[0]
    #計算內限
    bounds[col] = [quantiles[0] - 1.5*IQR, quantiles[1] + 1.5*IQR]
    
print("bounds: ",bounds)
#判斷是否爲異常值,在內限之外的值爲異常值
outliers = df_outliers.select(*['id'] + \
                              [((df_outliers[c] < bounds[c][0]) | (df_outliers[c] > bounds[c][1]) )\
                               .alias(c +"_o") for c in cols])
outliers.show()
bounds: {'age': [-11.0, 93.0], 'height': [4.499999999999999, 6.1000000000000005], 'weight': [91.69999999999999, 191.7]}
+---+--------+--------+-----+
| id|weight_o|height_o|age_o|
+---+--------+--------+-----+
| 1| false| false|false|
| 2| false| false|false|
| 3| true| false| true|
| 4| false| false|false|
| 5| false| false|false|
| 6| false| false|false|
| 7| false| false|false|
+---+--------+--------+-----+

#查詢出異常值
df_outliers = df_outliers.join(outliers,on = 'id')
#上面的join語句不要寫成 df_outliers.join(outliers, df_outliers.id == outliers.id) 否則在
#新生成的 df_outliers中會有2列id,後面在select時會報錯AnalysisException: "Reference 'id' is ambiguous
df_outliers.show()
+---+------+------+---+--------+--------+-----+
| id|weight|height|age|weight_o|height_o|age_o|
+---+------+------+---+--------+--------+-----+
| 7| 129.2| 5.3| 42| false| false|false|
| 6| 124.1| 5.1| 21| false| false|false|
| 5| 133.2| 5.4| 54| false| false|false|
| 1| 143.5| 5.3| 28| false| false|false|
| 3| 342.3| 5.1| 99| true| false| true|
| 2| 154.2| 5.5| 45| false| false|false|
| 4| 144.5| 5.5| 33| false| false|false|
+---+------+------+---+--------+--------+-----+

df_outliers.filter('weight_o').select('id','weight').show()
+---+------+
| id|weight|
+---+------+
| 3| 342.3|
+---+------+

df_outliers.filter("age_o").select("id","age").show()
+---+---+
| id|age|
+---+---+
| 3| 99|
+---+---+

 

【參考】:https://www.jianshu.com/p/56cff9f6e0be

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章