spark.DataFrame離羣值處理

原創

2020-02-23 17:11

異常數據（離羣值）指那些與樣本其餘部分的分佈顯著偏離的觀測數據。
顯著的定義各不相同，但在最普遍的形式中，如果所有的值大致在Q1 - 1.5IQR和Q3 + 1.5IQR範圍內，IQR指四分位範圍，你可以認爲沒有離羣值。
上面的這些術語可以參考《理解箱線圖》進行理解

一、先運行下列代碼

from pyspark.sql import SparkSession

# 配置spark當前環境
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("outliers.py") \
    .getOrCreate()

sc = spark.sparkContext

df_outliers = spark.createDataFrame([
    (1, 143.5, 5.3, 28),
    (2, 154.2, 5.5, 45),
    (3, 342.3, 5.1, 99),
    (4, 144.5, 5.5, 33),
    (5, 133.2, 5.4, 54),
    (6, 124.1, 5.1, 21),
    (7, 129.2, 5.3, 42),
], ['id', 'weight', 'height', 'age'])

cols = ['weight', 'height', 'age']

bounds = {}

for col in cols:
    # 計算每個特徵的上下截斷點
    quantiles = df_outliers.approxQuantile(
        col=col,                        # 指定列名
        probabilities=[0.25, 0.75],     # 可以是[0,1]中的數，或者一個列表
        relativeError=0.05              # 每個度量的可接受的錯誤程度
    )

    IQR = quantiles[1] - quantiles[0]

    bounds[col] = [
        quantiles[0] - 1.5 * IQR,
        quantiles[1] + 1.5 * IQR
    ]

print("範圍是:")
print(bounds, sep="\n", end="\n\n")

outliers = df_outliers.select(*['id'] + [
    (
        (df_outliers[c] < bounds[c][0]) |
        (df_outliers[c] > bounds[c][1])
    ).alias(c+ '_o') for c in cols
])
print("離羣統計表:")
outliers.show()

輸出如下：

範圍是:
{'weight': [91.69999999999999, 191.7], 'height': [4.499999999999999, 6.1000000000000005], 'age': [-11.0, 93.0]}

離羣統計表:
+---+--------+--------+-----+
| id|weight_o|height_o|age_o|
+---+--------+--------+-----+
|  1|   false|   false|false|
|  2|   false|   false|false|
|  3|    true|   false| true|
|  4|   false|   false|false|
|  5|   false|   false|false|
|  6|   false|   false|false|
|  7|   false|   false|false|
+---+--------+--------+-----+

可見在weight中有一個離羣值，在age中有一個離羣值

二、再在Python Console中交互

對DataFrame的數據重塑（join/merge）不瞭解的可以參考《PANDAS 數據合併與重塑（join/merge篇》

In[1]:# 根據列id連接df_outlies和outliers
In[2]:df_outliers = df_outliers.join(outliers, on='id')
In[3]:# 根據weight_0列的bool值確定和其他剩餘分佈明顯不同的值
In[4]:df_outliers.filter('weight_o').select('id', 'weight').show()
In[5]:# 根據age_0列的bool值確定和其他剩餘分佈明顯不同的值
In[6]:df_outliers.filter('age_o').select('id', 'age').show()
In[7]:df_outliers.show()		#
+---+------+------+---+--------+--------+-----+
| id|weight|height|age|weight_o|height_o|age_o|
+---+------+------+---+--------+--------+-----+
|  7| 129.2|   5.3| 42|   false|   false|false|
|  6| 124.1|   5.1| 21|   false|   false|false|
|  5| 133.2|   5.4| 54|   false|   false|false|
|  1| 143.5|   5.3| 28|   false|   false|false|
|  3| 342.3|   5.1| 99|    true|   false| true|
|  2| 154.2|   5.5| 45|   false|   false|false|
|  4| 144.5|   5.5| 33|   false|   false|false|
+---+------+------+---+--------+--------+-----+

In[8]: df_outliers.filter('weight_o').show()
+---+------+------+---+--------+--------+-----+
| id|weight|height|age|weight_o|height_o|age_o|
+---+------+------+---+--------+--------+-----+
|  3| 342.3|   5.1| 99|    true|   false| true|
+---+------+------+---+--------+--------+-----+

In[9]: df_outliers.filter('weight_o').select('id','weight').show()
+---+------+
| id|weight|
+---+------+
|  3| 342.3|
+---+------+

In[10]: df_outliers.filter('age_o').select('id','age').show()
+---+---+
| id|age|
+---+---+
|  3| 99|
+---+---+

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark.DataFrame離羣值處理

一、先運行下列代碼

二、再在Python Console中交互

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

python使用xlrd和xlwt模塊對Excel文件讀寫（實例：將點座標轉爲無向圖距離）

matlab與python的交互

hdu2023求平均成績杭電OJ Compilation error

分別用numpy和pandas劃分數據集以完成交叉驗證

進程同步水果問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結