一、賽題數據
賽題以預測二手車的交易價格爲任務,數據集報名後可見並可下載,該數據來自某交易平臺的二手車交易記錄,總數據量超過40w,包含31列變量信息,其中15列爲匿名變量。爲了保證比賽的公平性,將會從中抽取15萬條作爲訓練集,5萬條作爲測試集A,5萬條作爲測試集B,同時會對name、model、brand和regionCode等信息進行脫敏
二、評測標準
評價標準爲MAE(Mean Absolute Error)。
MAE越小,說明模型預測得越準確。
二、 EDA-數據探索性分析
Tip:此部分爲零基礎入門數據挖掘的 Task2 EDA-數據探索性分析 部分,帶你來了解數據,熟悉數據,和數據做朋友,歡迎大家後續多多交流。
賽題:零基礎入門數據挖掘 - 二手車交易價格預測
2.1 EDA目標
-
EDA的價值主要在於熟悉數據集,瞭解數據集,對數據集進行驗證來確定所獲得數據集可以用於接下來的機器學習或者深度學習使用。
-
當了解了數據集之後我們下一步就是要去了解變量間的相互關係以及變量與預測值之間的存在關係。
-
引導數據科學從業者進行數據處理以及特徵工程的步驟,使數據集的結構和特徵集讓接下來的預測問題更加可靠。
-
完成對於數據的探索性分析,並對於數據進行一些圖表或者文字總結並打卡。
2.2 內容介紹
- 載入各種數據科學以及可視化庫:
- 數據科學庫 pandas、numpy、scipy;
- 可視化庫 matplotlib、seabon;
- 其他;
- 載入數據:
- 載入訓練集和測試集;
- 簡略觀察數據(head()+shape);
- 數據總覽:
- 通過describe()來熟悉數據的相關統計量
- 通過info()來熟悉數據類型
- 判斷數據缺失和異常
- 查看每列的存在nan情況
- 異常值檢測
- 瞭解預測值的分佈
- 總體分佈概況(無界約翰遜分佈等)
- 查看skewness and kurtosis
- 查看預測值的具體頻數
- 特徵分爲類別特徵和數字特徵,並對類別特徵查看unique分佈
- 數字特徵分析
- 相關性分析
- 查看幾個特徵得 偏度和峯值
- 每個數字特徵得分佈可視化
- 數字特徵相互之間的關係可視化
- 多變量互相迴歸關係可視化
- 類型特徵分析
- unique分佈
- 類別特徵箱形圖可視化
- 類別特徵的小提琴圖可視化
- 類別特徵的柱形圖可視化類別
- 特徵的每個類別頻數可視化(count_plot)
- 用pandas_profiling生成數據報告
2.3 代碼示例
2.3.1 載入各種數據科學以及可視化庫
以下庫都是pip install 安裝, 有特殊情況我會單獨說明 例如 pip install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple
1
#coding:utf-8
2
#導入warnings包,利用過濾器來實現忽略警告語句。
3
import warnings
4
warnings.filterwarnings('ignore')
5
6
import pandas as pd
7
import numpy as np
8
import matplotlib.pyplot as plt
9
import seaborn as sns
10
import missingno as msno
2.3.2 載入數據
1
## 1) 載入訓練集和測試集;
2
path = './datalab/231784/'
3
Train_data = pd.read_csv(path+'used_car_train_20200313.csv', sep=' ')
4
Test_data = pd.read_csv(path+'used_car_testA_20200313.csv', sep=' ')
所有特徵集均脫敏處理(方便大家觀看)
- name - 汽車編碼
- regDate - 汽車註冊時間
- model - 車型編碼
- brand - 品牌
- bodyType - 車身類型
- fuelType - 燃油類型
- gearbox - 變速箱
- power - 汽車功率
- kilometer - 汽車行駛公里
- notRepairedDamage - 汽車有尚未修復的損壞
- regionCode - 看車地區編碼
- seller - 銷售方
- offerType - 報價類型
- creatDate - 廣告發布時間
- price - 汽車價格
- v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' 【匿名特徵,包含v0-14在內15個匿名特徵】
1
## 2) 簡略觀察數據(head()+shape)
2
Train_data.head().append(Train_data.tail())
[3]:
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
149995 | 149995 | 163978 | 20000607 | 121.0 | 10 | 4.0 | 0.0 | 1.0 | 163 | 15.0 | ... | 0.280264 | 0.000310 | 0.048441 | 0.071158 | 0.019174 | 1.988114 | -2.983973 | 0.589167 | -1.304370 | -0.302592 |
149996 | 149996 | 184535 | 20091102 | 116.0 | 11 | 0.0 | 0.0 | 0.0 | 125 | 10.0 | ... | 0.253217 | 0.000777 | 0.084079 | 0.099681 | 0.079371 | 1.839166 | -2.774615 | 2.553994 | 0.924196 | -0.272160 |
149997 | 149997 | 147587 | 20101003 | 60.0 | 11 | 1.0 | 1.0 | 0.0 | 90 | 6.0 | ... | 0.233353 | 0.000705 | 0.118872 | 0.100118 | 0.097914 | 2.439812 | -1.630677 | 2.290197 | 1.891922 | 0.414931 |
149998 | 149998 | 45907 | 20060312 | 34.0 | 10 | 3.0 | 1.0 | 0.0 | 156 | 15.0 | ... | 0.256369 | 0.000252 | 0.081479 | 0.083558 | 0.081498 | 2.075380 | -2.633719 | 1.414937 | 0.431981 | -1.659014 |
149999 | 149999 | 177672 | 19990204 | 19.0 | 28 | 6.0 | 0.0 | 1.0 | 193 | 12.5 | ... | 0.284475 | 0.000000 | 0.040072 | 0.062543 | 0.025819 | 1.978453 | -3.179913 | 0.031724 | -1.483350 | -0.342674 |
10 rows × 31 columns
1
Train_data.shape
[4]:
(150000, 31)
1
Test_data.head().append(Test_data.tail())
[5]:
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 150000 | 66932 | 20111212 | 222.0 | 4 | 5.0 | 1.0 | 1.0 | 313 | 15.0 | ... | 0.264405 | 0.121800 | 0.070899 | 0.106558 | 0.078867 | -7.050969 | -0.854626 | 4.800151 | 0.620011 | -3.664654 |
1 | 150001 | 174960 | 19990211 | 19.0 | 21 | 0.0 | 0.0 | 0.0 | 75 | 12.5 | ... | 0.261745 | 0.000000 | 0.096733 | 0.013705 | 0.052383 | 3.679418 | -0.729039 | -3.796107 | -1.541230 | -0.757055 |
2 | 150002 | 5356 | 20090304 | 82.0 | 21 | 0.0 | 0.0 | 0.0 | 109 | 7.0 | ... | 0.260216 | 0.112081 | 0.078082 | 0.062078 | 0.050540 | -4.926690 | 1.001106 | 0.826562 | 0.138226 | 0.754033 |
3 | 150003 | 50688 | 20100405 | 0.0 | 0 | 0.0 | 0.0 | 1.0 | 160 | 7.0 | ... | 0.260466 | 0.106727 | 0.081146 | 0.075971 | 0.048268 | -4.864637 | 0.505493 | 1.870379 | 0.366038 | 1.312775 |
4 | 150004 | 161428 | 19970703 | 26.0 | 14 | 2.0 | 0.0 | 0.0 | 75 | 15.0 | ... | 0.250999 | 0.000000 | 0.077806 | 0.028600 | 0.081709 | 3.616475 | -0.673236 | -3.197685 | -0.025678 | -0.101290 |
49995 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | ... | 0.284664 | 0.130044 | 0.049833 | 0.028807 | 0.004616 | -5.978511 | 1.303174 | -1.207191 | -1.981240 | -0.357695 |
49996 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | ... | 0.268101 | 0.108095 | 0.066039 | 0.025468 | 0.025971 | -3.913825 | 1.759524 | -2.075658 | -1.154847 | 0.169073 |
49997 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | ... | 0.269432 | 0.105724 | 0.117652 | 0.057479 | 0.015669 | -4.639065 | 0.654713 | 1.137756 | -1.390531 | 0.254420 |
49998 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | ... | 0.261152 | 0.000490 | 0.137366 | 0.086216 | 0.051383 | 1.833504 | -2.828687 | 2.465630 | -0.911682 | -2.057353 |
49999 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | ... | 0.228730 | 0.000300 | 0.103534 | 0.080625 | 0.124264 | 2.914571 | -1.135270 | 0.547628 | 2.094057 | -1.552150 |
10 rows × 30 columns
1
Test_data.shape
[6]:
(50000, 30)
要養成看數據集的head()以及shape的習慣,這會讓你每一步更放心,導致接下里的連串的錯誤, 如果對自己的pandas等操作不放心,建議執行一步看一下,這樣會有效的方便你進行理解函數並進行操作
2.3.3 總覽數據概況
- describe種有每列的統計量,個數count、平均值mean、方差std、最小值min、中位數25% 50% 75% 、以及最大值 看這個信息主要是瞬間掌握數據的大概的範圍以及每個值的異常值的判斷,比如有的時候會發現999 9999 -1 等值這些其實都是nan的另外一種表達方式,有的時候需要注意下
- info 通過info來了解數據每列的type,有助於瞭解是否存在除了nan以外的特殊符號異常
1
## 1) 通過describe()來熟悉數據的相關統計量
2
Train_data.describe()
[7]:
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 150000.000000 | 150000.000000 | 1.500000e+05 | 149999.000000 | 150000.000000 | 145494.000000 | 141320.000000 | 144019.000000 | 150000.000000 | 150000.000000 | ... | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 |
mean | 74999.500000 | 68349.172873 | 2.003417e+07 | 47.129021 | 8.052733 | 1.792369 | 0.375842 | 0.224943 | 119.316547 | 12.597160 | ... | 0.248204 | 0.044923 | 0.124692 | 0.058144 | 0.061996 | -0.001000 | 0.009035 | 0.004813 | 0.000313 | -0.000688 |
std | 43301.414527 | 61103.875095 | 5.364988e+04 | 49.536040 | 7.864956 | 1.760640 | 0.548677 | 0.417546 | 177.168419 | 3.919576 | ... | 0.045804 | 0.051743 | 0.201410 | 0.029186 | 0.035692 | 3.772386 | 3.286071 | 2.517478 | 1.288988 | 1.038685 |
min | 0.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.168192 | -5.558207 | -9.639552 | -4.153899 | -6.546556 |
25% | 37499.750000 | 11156.000000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | ... | 0.243615 | 0.000038 | 0.062474 | 0.035334 | 0.033930 | -3.722303 | -1.951543 | -1.871846 | -1.057789 | -0.437034 |
50% | 74999.500000 | 51638.000000 | 2.003091e+07 | 30.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 110.000000 | 15.000000 | ... | 0.257798 | 0.000812 | 0.095866 | 0.057014 | 0.058484 | 1.624076 | -0.358053 | -0.130753 | -0.036245 | 0.141246 |
75% | 112499.250000 | 118841.250000 | 2.007111e+07 | 66.000000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | ... | 0.265297 | 0.102009 | 0.125243 | 0.079382 | 0.087491 | 2.844357 | 1.255022 | 1.776933 | 0.942813 | 0.680378 |
max | 149999.000000 | 196812.000000 | 2.015121e+07 | 247.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 19312.000000 | 15.000000 | ... | 0.291838 | 0.151420 | 1.404936 | 0.160791 | 0.222787 | 12.357011 | 18.819042 | 13.847792 | 11.147669 | 8.658418 |
8 rows × 30 columns
1
Test_data.describe()
[8]:
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 5.000000e+04 | 50000.000000 | 50000.000000 | 48587.000000 | 47107.000000 | 48090.000000 | 50000.000000 | 50000.000000 | ... | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 |
mean | 174999.500000 | 68542.223280 | 2.003393e+07 | 46.844520 | 8.056240 | 1.782185 | 0.373405 | 0.224350 | 119.883620 | 12.595580 | ... | 0.248669 | 0.045021 | 0.122744 | 0.057997 | 0.062000 | -0.017855 | -0.013742 | -0.013554 | -0.003147 | 0.001516 |
std | 14433.901067 | 61052.808133 | 5.368870e+04 | 49.469548 | 7.819477 | 1.760736 | 0.546442 | 0.417158 | 185.097387 | 3.908979 | ... | 0.044601 | 0.051766 | 0.195972 | 0.029211 | 0.035653 | 3.747985 | 3.231258 | 2.515962 | 1.286597 | 1.027360 |
min | 150000.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.160049 | -5.411964 | -8.916949 | -4.123333 | -6.112667 |
25% | 162499.750000 | 11203.500000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | ... | 0.243762 | 0.000044 | 0.062644 | 0.035084 | 0.033714 | -3.700121 | -1.971325 | -1.876703 | -1.060428 | -0.437920 |
50% | 174999.500000 | 52248.500000 | 2.003091e+07 | 29.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 109.000000 | 15.000000 | ... | 0.257877 | 0.000815 | 0.095828 | 0.057084 | 0.058764 | 1.613212 | -0.355843 | -0.142779 | -0.035956 | 0.138799 |
75% | 187499.250000 | 118856.500000 | 2.007110e+07 | 65.000000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | ... | 0.265328 | 0.102025 | 0.125438 | 0.079077 | 0.087489 | 2.832708 | 1.262914 | 1.764335 | 0.941469 | 0.681163 |
max | 199999.000000 | 196805.000000 | 2.015121e+07 | 246.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 20000.000000 | 15.000000 | ... | 0.291618 | 0.153265 | 1.358813 | 0.156355 | 0.214775 | 12.338872 | 18.856218 | 12.950498 | 5.913273 | 2.624622 |
8 rows × 29 columns
1
## 2) 通過info()來熟悉數據類型
2
Train_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150000 entries, 0 to 149999 Data columns (total 31 columns): SaleID 150000 non-null int64 name 150000 non-null int64 regDate 150000 non-null int64 model 149999 non-null float64 brand 150000 non-null int64 bodyType 145494 non-null float64 fuelType 141320 non-null float64 gearbox 144019 non-null float64 power 150000 non-null int64 kilometer 150000 non-null float64 notRepairedDamage 150000 non-null object regionCode 150000 non-null int64 seller 150000 non-null int64 offerType 150000 non-null int64 creatDate 150000 non-null int64 price 150000 non-null int64 v_0 150000 non-null float64 v_1 150000 non-null float64 v_2 150000 non-null float64 v_3 150000 non-null float64 v_4 150000 non-null float64 v_5 150000 non-null float64 v_6 150000 non-null float64 v_7 150000 non-null float64 v_8 150000 non-null float64 v_9 150000 non-null float64 v_10 150000 non-null float64 v_11 150000 non-null float64 v_12 150000 non-null float64 v_13 150000 non-null float64 v_14 150000 non-null float64 dtypes: float64(20), int64(10), object(1) memory usage: 35.5+ MB
1
Test_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 30 columns): SaleID 50000 non-null int64 name 50000 non-null int64 regDate 50000 non-null int64 model 50000 non-null float64 brand 50000 non-null int64 bodyType 48587 non-null float64 fuelType 47107 non-null float64 gearbox 48090 non-null float64 power 50000 non-null int64 kilometer 50000 non-null float64 notRepairedDamage 50000 non-null object regionCode 50000 non-null int64 seller 50000 non-null int64 offerType 50000 non-null int64 creatDate 50000 non-null int64 v_0 50000 non-null float64 v_1 50000 non-null float64 v_2 50000 non-null float64 v_3 50000 non-null float64 v_4 50000 non-null float64 v_5 50000 non-null float64 v_6 50000 non-null float64 v_7 50000 non-null float64 v_8 50000 non-null float64 v_9 50000 non-null float64 v_10 50000 non-null float64 v_11 50000 non-null float64 v_12 50000 non-null float64 v_13 50000 non-null float64 v_14 50000 non-null float64 dtypes: float64(20), int64(9), object(1) memory usage: 11.4+ MB
2.3.4 判斷數據缺失和異常
1
## 1) 查看每列的存在nan情況
2
Train_data.isnull().sum()
[11]:
SaleID 0 name 0 regDate 0 model 1 brand 0 bodyType 4506 fuelType 8680 gearbox 5981 power 0 kilometer 0 notRepairedDamage 0 regionCode 0 seller 0 offerType 0 creatDate 0 price 0 v_0 0 v_1 0 v_2 0 v_3 0 v_4 0 v_5 0 v_6 0 v_7 0 v_8 0 v_9 0 v_10 0 v_11 0 v_12 0 v_13 0 v_14 0 dtype: int64
1
Test_data.isnull().sum()
[12]:
SaleID 0 name 0 regDate 0 model 0 brand 0 bodyType 1413 fuelType 2893 gearbox 1910 power 0 kilometer 0 notRepairedDamage 0 regionCode 0 seller 0 offerType 0 creatDate 0 v_0 0 v_1 0 v_2 0 v_3 0 v_4 0 v_5 0 v_6 0 v_7 0 v_8 0 v_9 0 v_10 0 v_11 0 v_12 0 v_13 0 v_14 0 dtype: int64
1
# nan可視化
2
missing = Train_data.isnull().sum()
3
missing = missing[missing > 0]
4
missing.sort_values(inplace=True)
5
missing.plot.bar()