Kaggle項目之PUBG Finish Placement Prediction（一）——探索性分析

數據來自Kaggle，也可以在這裏取，提取碼wymx。比賽在一個月前結束，這裏拿來練練手~
附加python代碼！多圖預警！！

0、問題背景

在PUBG遊戲中，每場比賽最多有100名玩家（matchId）。玩家可以在團隊中（groupId）根據有多少其他團隊在被淘汰時還活着而在遊戲結束時排名（winPlacePerc）。在遊戲中，玩家可以拿起不同的彈藥，恢復被擊倒但未被擊倒的隊友，駕駛車輛，游泳，跑步，射擊，並體驗所有後果 - 例如跌得太遠或者自己跑過來消除自己。
您將獲得大量匿名的PUBG遊戲統計數據，其格式設置爲每行包含一個玩家的遊戲後統計數據。數據來自所有類型的比賽：獨奏，二重奏，小隊和自定義; 不保證每場比賽有100名球員，每組最多4名球員。
你必須創建一個模型，根據他們的最終統計數據預測球員的完成位置，從1（第一名）到0（最後一名）。

1、各變量含義

DBNOs - 擊倒多少敵人 
assists - 傷害過多少敵人（最終該敵人被隊友殺害）
boosts - 使用過多少個提升性的物品(boost items used)
damageDealt - 造成的總傷害-自己所受的傷害
headshotKills - 通過爆頭而殺死的敵人數量
heals - 使用了多少救援類物品
Id - 玩家ID
killPlace - 殺死敵人數量的排名
killPoints - 基於殺戮的玩家外部排名。將其視爲Elo排名，只有殺死纔有意義。如果rankPoints中的值不是-1，那麼killPoints中的任何0都應被視爲“無”。
killStreaks - 短時間內殺死敵人的最大數量
kills - 殺死的敵人的數量
longestKill - 玩家和玩家在死亡時被殺的最長距離。 這可能會產生誤導，因爲擊倒一名球員並開走可能會導致最長的殺戮統計數據。
matchDuration - 匹配用了多少秒
matchId - 匹配的ID（每一局一個ID）
matchType -  單排/雙排/四排；標準模式是“solo”，“duo”，“squad”，“solo-fpp”，“duo-fpp”和“squad-fpp”; 其他模式來自事件或自定義匹配。
rankPoints - 類似Elo的玩家排名。 此排名不一致，並且在API的下一個版本中已棄用，因此請謹慎使用。值-1表示“無”。
revives - 玩家救援隊友的次數
rideDistance - 玩家使用交通工具行駛了多少米
roadKills - 在交通工具上殺死了多少玩家
swimDistance - 游泳了多少米
teamKills - 該玩家殺死隊友的次數
vehicleDestroys - 毀壞了多少交通工具
walkDistance - 步行運動了多少米
weaponsAcquired - 撿了多少把槍
winPoints - 基於贏的玩家外部排名。將其視爲Elo排名，只有獲勝纔有意義。如果rankPoints中的值不是-1，那麼winPoints中的任何0都應被視爲“無”。
groupId - 隊伍的ID。 如果同一組玩家在不同的比賽中比賽，他們每次都會有不同的groupId。
numGroups - 在該局比賽中有玩家數據的隊伍數量
maxPlace - 在該局中已有數據的最差的隊伍名詞（可能與該局隊伍數不匹配，因爲數據收集有跳躍）
winPlacePerc - 預測目標，是以百分數計算的，介於0-1之間，1對應第一名，0對應最後一名。 它是根據maxPlace計算的，而不是numGroups，因此匹配中可能缺少某些隊伍。

import pandas as pd
origin_data = pd.read_csv('train_V2.csv')
print origin_data.shape
#origin_data.head()

(4446966, 29)

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns#數據可視化
import numpy as np

查看這些數據包含有47965局（47965次匹配）

print len(origin_data.groupby(['matchId']))

2、探索性分析——單個變量

（1）整形(int)變量的分佈

def feature_barplot(feature, df_train = origin_data, figsize=(15,6), rot = 90, saveimg = False): 
    feat_train = df_train[feature].value_counts()
    fig_feature, axis1, = plt.subplots(1,1,sharex=True, sharey = True, figsize = figsize)
    sns.barplot(feat_train.index.values, feat_train.values, ax = axis1)
    axis1.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)
    axis1.set_title(feature + ' of training dataset')
    axis1.set_ylabel('Counts')
    plt.tight_layout()
    if saveimg == True:
        figname = feature + ".png"
        fig_feature.savefig(figname, dpi = 75)

feature_barplot('DBNOs') #擊倒敵人的分佈

如果直接這樣作圖，會發現該圖像呈現極爲右偏的分佈，其實，DBNOs超過6，7和8的人數就已經非常少了，因此將DBNOs>x的全部歸爲一類，那麼x具體是多少更有說服力呢，一個精確的做法是：先找出99%的分位數，將大於99%分位數的歸爲一類。

origin_data['DBNOs_new'] = origin_data['DBNOs']
origin_data.loc[origin_data['DBNOs_new'] > origin_data['DBNOs_new'].quantile(0.99)] = 'larger'
plt.figure(figsize = (10,6))
sns.countplot(origin_data['DBNOs_new'].astype('str').sort_values())
plt.title('DBNOs')
plt.show

可見大部分玩家擊倒敵人的數量爲0

（2）連續變量的分佈

造成的傷害值是一個連續變量（總傷害-自身受到的傷害）

#造成的傷害值的分佈（總傷害-自身受到的傷害）
plt.figure(figsize=(10,6))
plt.title("Damage Dealt")
sns.distplot(origin_data['damageDealt']) #distplot直方圖
plt.show()

對於那些擊殺數爲0的玩家，他們造成的傷害如何？

data_kill_0 = origin_data[origin_data['kills']==0]
plt.figure(figsize=(10,6))
plt.title("Damage Dealt by 0 killers",fontsize=15)
sns.distplot(data_kill_0['damageDealt'])
plt.show()
#del data_kill_0

步行距離也是一類連續變量，先找到均值和百分之99的分位數

print('The average person walks for {:.1f}m, 99% of {}m or less, while the marathoner champion walked for {}m.'
      .format(origin_data['walkDistance'].mean(), 
              origin_data['walkDistance'].quantile(0.99), 
              origin_data['walkDistance'].max()
             )
     )

The average person walks for 1154.2m, 99% of 4396.0m or less, 
while the marathoner champion walked for 25780.0m.

99%的人的步行距離都在4396m以下，而最大步行距離爲25780m，爲了圖像不過分右偏，不考慮那1%的runners了

new_data = origin_data[origin_data['walkDistance'] < origin_data['walkDistance'].quantile(0.99)]
plt.figure(figsize=(10,6))
plt.title("The Running Distances")
sns.distplot(new_data['walkDistance']) #distplot直方圖
plt.show()
del new_data

再看一下預測變量winPlacePerc的分佈情況，winPlacePerc變量的取值在0-1之間

#得分winPlacePerc分佈
def winplace_rank(x):
    if x < 0.1:
        return 'rank10'
    elif x < 0.2:
        return 'rank9'
    elif x < 0.3:
        return 'rank8'
    elif x < 0.4:
        return 'rank7'
    elif x < 0.5:
        return 'rank6'
    elif x < 0.6:
        return 'rank5'
    elif x < 0.7:
        return 'rank4'
    elif x < 0.8:
        return 'rank3'
    elif x < 0.9:
        return 'rank2'
    else:
        return 'rank1'

origin_data['winplace_rank'] = origin_data['winPlacePerc'].apply(winplace_rank)
feature_barplot('winplace_rank') #可見因變量的分佈基本上是平衡的

可見最終的排名基本上是均衡的，不存在明顯的不平衡。

2、探索性分析——兩個變量

sns.jointplot(x = 'winPlacePerc', y = 'kills', data = origin_data, size=8, ratio = 3)
#ratio：Ratio of joint axes size to marginal axes height.
#kind : { “scatter” | “reg” | “resid” | “kde” | “hex” }, optional 
plt.show()

如果上圖看不明顯，可以用箱線圖來看擊殺敵人的數量和最終排名之間的關係

注意：
kills變量的分佈是極度不平衡的，而winPlacePer的分佈又基本上是平衡的
因此畫兩變量的箱線圖時，應該把kills作爲橫座標（自變量），而把winPlacePer作爲縱軸（因變量）
如果反過來就基本上看不出來二者的關係了（因爲kills爲0的人相對特別多，它會把整個水平帶偏，且極大值/異常值很多）如下圖所示

正確的做法

#用pd.cut做變量切分/變量分箱
origin_data['kills_rank'] = pd.cut(origin_data['kills'], [-1, 0, 2, 5, 10, 20, 60] ,labels = ['0_kills', '1-2_kills', '3-5_kills', '6-10_kills', '11-20_kills', '20+kills'])
plt.figure(figsize = (10, 6))
sns.boxplot(x = 'kills_rank', y = 'winPlacePerc', data = origin_data)
plt.show()

對於步行的距離和最終排名之間的關係

sns.jointplot(x = 'winPlacePerc', y = 'walkDistance', data = origin_data, size=8, ratio = 3, color = 'g')
plt.show()

print("{} players have won without a single kill!".format(
      len(origin_data[origin_data['winPlacePerc']==1])
     )
data_damage_0 = origin_data[origin_data['damageDealt'] == 0].copy()
print("{} players have won without dealing damage!".format(
    len(data_damage_0[data_damage_0['winPlacePerc']==1])
     )

127573 players have won without a single kill!
4770 players have won without dealing damage!

另外，點圖代表散點圖位置的數值變量的中心趨勢估計，並使用誤差線提供關於該估計的不確定性的一些指示。點圖可能比條形圖更有用於聚焦一個或多個分類變量的不同級別之間的比較。

下圖爲毀壞交通工具數量與最終排名之間的關係：

f,ax1 = plt.subplots(figsize =(10,6))
sns.pointplot(x = 'vehicleDestroys', y = 'winPlacePerc', data = origin_data, color='#606060', alpha=0.8)
plt.xlabel('Number of Vehicle Destroys',fontsize = 15,color='blue')
plt.ylabel('Win Percentage',fontsize = 15,color='blue')
plt.title('Vehicle Destroys/ Win Ratio',fontsize = 20,color='blue')
plt.grid()
plt.show()

可見毀壞過交通工具的玩家比沒有毀壞過交通工具的玩家的最終排名靠前，毀壞越多排名越前（額。本吃雞菜鳥也沒想通這是爲什麼。很暴力hh。）
還可以分幾個類別來看最終排名與擊殺敵人的數量的對比：

## teamKills擊殺隊友數
#print origin_data['teamKills'].value_counts()
origin_data['teamKills_rank'] = pd.cut(origin_data['teamKills'], [-1, 0, 13] 
 ,labels = ['No TeamKills', 'Kill Teammates'])
f,ax1 = plt.subplots(figsize =(10,6))
sns.pointplot(x = 'kills_rank', y = 'winPlacePerc', data = origin_data, hue = 'teamKills_rank')
#dodge=True可以使重疊的部分錯開

headshotKills爲爆頭擊殺敵人數量，區分是否有過爆頭擊殺來看擊殺敵人數量與最終得分的關係：

#print origin_data['headshotKills'].value_counts() #最大值爲64
origin_data['headshotKills_rank'] = pd.cut(origin_data['headshotKills'], [-1, 0, 65] ,labels = ['No headshotKills', 'Did headshotKills'])
f,ax1 = plt.subplots(figsize =(10,6))
sns.pointplot(x = 'kills_rank', y = 'winPlacePerc', data = origin_data, hue = 'headshotKills_rank')

基本上，在相同擊殺數的情況下，有過爆頭的玩家的得分會略高於沒有過爆頭的玩家的數量。

Heal和boost與最終得分的關係：

new_data = origin_data.copy()
new_data = new_data[new_data['heals'] < new_data['heals'].quantile(0.99)]
new_data = new_data[new_data['boosts'] < new_data['boosts'].quantile(0.99)]
f, ax1 = plt.subplots(figsize = (10, 6))
sns.pointplot(x = 'heals', y = 'winPlacePerc', data = new_data, color = 'lime', alpha = 0.8)
sns.pointplot(x = 'boosts', y = 'winPlacePerc', data = new_data, color = 'blue', alpha = 0.8)
plt.text(5, 0.45, 'Heals', color = 'lime', style = 'italic')
plt.text(5, 0.5, 'Boosts', color = 'blue', style = 'italic')
plt.xlabel('Number of heal/boost items',fontsize = 15,color='blue')
plt.ylabel('Win Percentage',fontsize = 15,color='blue')
plt.title('Heals vs Boosts',fontsize = 20,color='blue')
plt.grid()
plt.show()
del new_data

使用提升性物品（boosts）會提高最終排名；
使用4個以下治療性物品（救援包等）時，使用越多救援包的人最終排名越靠前，而但使用過多的救援包並不會一直提高最終排名，達到3-4個救援包時排名不再提升。

下面，區分不同的組局方式/匹配方式，來看各方式下的最終排名與各個變量之間的關係

一局的隊伍數如果大於50，認爲是單排solos
隊伍數在25到50之間，認爲是雙排duos
隊伍數小於等於25，認爲是四排squads

solos = origin_data[origin_data['numGroups']>50]
duos = origin_data[(origin_data['numGroups']>25) & (origin_data['numGroups']<=50)]
squads = origin_data[origin_data['numGroups']<=25]
print("There are {} ({:.2f}%) solo games, {} ({:.2f}%) duo games and {} ({:.2f}%) squad games."
      .format(len(solos), 
              100*len(solos)/len(origin_data), 
              len(duos), 
              100*len(duos)/len(origin_data), 
              len(squads), 
              100*len(squads)/len(origin_data)
             )
     )

There are 709111 (15.00%) solo games, 3295326 (74.00%) duo games and 442529 (9.00%) squad games.

各方式下的最終排名與擊殺敵人數量之間的關係：

f,ax1 = plt.subplots(figsize =(20,10))
sns.pointplot(x='kills',y='winPlacePerc',data=solos,color='black',alpha=0.8)
sns.pointplot(x='kills',y='winPlacePerc',data=duos,color='#CC0000',alpha=0.8)
sns.pointplot(x='kills',y='winPlacePerc',data=squads,color='#3399FF',alpha=0.8)
plt.text(37,0.6,'Solos',color='black',fontsize = 17,style = 'italic')
plt.text(37,0.55,'Duos',color='#CC0000',fontsize = 17,style = 'italic')
plt.text(37,0.5,'Squads',color='#3399FF',fontsize = 17,style = 'italic')
plt.xlabel('Number of kills',fontsize = 15,color='blue')
plt.ylabel('Win Percentage',fontsize = 15,color='blue')
plt.title('Solo vs Duo vs Squad Kills',fontsize = 20,color='blue')
plt.grid()
plt.show()

上圖所示，單排的局，最終排名隨着擊殺數量的增加而增加的趨勢更快（擊殺量過高要麼天選要麼開掛了。。暫不考慮，只考慮擊殺量在正常範圍內的情況，比如30以內），雙排次之，四排的局最終排名隨着擊殺數量的增加而增加的趨勢相對較緩。另外，單排和雙排的局，在擊殺量到達7個以上時，就能基本在前10%了（平均得分place在0.9以上），而四排的局，在擊殺量到7之前，排名隨擊殺量的提升而穩步提升，而在擊殺量到7以後，排名與擊殺量的關係不穩定了起來。

下面是最終排名與擊倒敵人的數量、傷害敵人（最終被隊友擊殺）的數量、救援隊友的數量之間的關係：

f,ax1 = plt.subplots(figsize =(20,10))
sns.pointplot(x='DBNOs',y='winPlacePerc',data=duos,color='#CC0000',alpha=0.8)
sns.pointplot(x='DBNOs',y='winPlacePerc',data=squads,color='#3399FF',alpha=0.8)
sns.pointplot(x='assists',y='winPlacePerc',data=duos,color='#FF6666',alpha=0.8)
sns.pointplot(x='assists',y='winPlacePerc',data=squads,color='#CCE5FF',alpha=0.8)
sns.pointplot(x='revives',y='winPlacePerc',data=duos,color='#660000',alpha=0.8)
sns.pointplot(x='revives',y='winPlacePerc',data=squads,color='#000066',alpha=0.8)
plt.text(14,0.5,'Duos - Assists',color='#FF6666',fontsize = 12,style = 'italic')
plt.text(14,0.45,'Duos - DBNOs',color='#CC0000',fontsize = 12,style = 'italic')
plt.text(14,0.4,'Duos - Revives',color='#660000',fontsize = 12,style = 'italic')
plt.text(14,0.35,'Squads - Assists',color='#CCE5FF',fontsize = 12,style = 'italic')
plt.text(14,0.3,'Squads - DBNOs',color='#3399FF',fontsize = 12,style = 'italic')
plt.text(14,0.25,'Squads - Revives',color='#000066',fontsize = 12,style = 'italic')
plt.xlabel('Number of DBNOs/Assits/Revives',fontsize = 10,color='blue')
plt.ylabel('Win Percentage',fontsize = 10,color='blue')
plt.title('Duo vs Squad DBNOs, Assists, and Revives',fontsize = 15,color='blue')
plt.grid()
plt.show()
del solos
del duos
del squads

3、探索性分析——多個變量(heatmap+pairplot)

熱力圖部分參數的含義：

cmap:從數字到色彩空間的映射，取值是matplotlib包裏的colormap名稱或顏色對象，或者表示顏色的列表；改參數默認值：根據center參數設定
center:數據表取值有差異時，設置熱力圖的色彩中心對齊值；通過設置center值，可以調整生成的圖像顏色的整體深淺；設置center數據時，如果有數據溢出，則手動設置的vmax、vmin會自動改變
annot(annotate的縮寫):默認取值False；如果是True，在熱力圖每個方格寫入數據；如果是矩陣，在熱力圖每個方格寫入該矩陣對應位置數據
square:設置熱力圖矩陣小塊形狀，默認值是False
fmt:字符串格式代碼，矩陣上標識數字的數據格式，比如保留小數點後幾位數字
annot_kws:默認取值False；如果是True，設置熱力圖矩陣上數字的大小顏色字體，matplotlib包text類下的字體設置；
cbar:是否在熱力圖側邊繪製顏色刻度條，默認值是True
cbar_kws:熱力圖側邊繪製顏色刻度條時，相關字體設置，默認值是None
cbar_ax:熱力圖側邊繪製顏色刻度條時，刻度條位置設置，默認值是None
xticklabels, yticklabels:xticklabels控制每列標籤名的輸出；yticklabels控制每行標籤名的輸出。默認值是auto。如果是True，則以DataFrame的列名作爲標籤名。如果是False，則不添加行標籤名。如果是列表，則標籤名改爲列表中給的內容。如果是整數K，則在圖上每隔K個標籤進行一次標註。如果是auto，則自動選擇標籤的標註間距，將標籤名不重疊的部分(或全部)輸出

下面使用origin_data.corr()獲得包含全部變量兩兩關係的相關係數矩陣
只取其中k個與因變量相關性最大的變量，作出其熱力圖

k = 10
cols = origin_data.corr().nlargest(k, 'winPlacePerc')['winPlacePerc'].index[1:k]
k_corr = np.corrcoef(origin_data[cols].values.T)
sns.set(font_scale=1.25)
f,ax = plt.subplots(figsize=(15, 15))
sns.heatmap(k_corr, cbar = True, annot = True, square = True, fmt = '.2f', center =0.6
           ,annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

pairplot圖

new_data = origin_data.loc[:,['weaponsAcquired','DBNOs','kills','matchType']]
sns.pairplot(new_data,hue='matchType')

new_data_0 = origin_data.loc[:,['DBNOs','heals','boosts','walkDistance']]
sns.pairplot(new_data_0)

注：

本文大部分的圖的畫法來源於Dimitrios Effrosynidis的分享
歡迎指正

Kaggle項目之PUBG Finish Placement Prediction（一）——探索性分析

0、問題背景

1、各變量含義

2、探索性分析——單個變量

（1）整形(int)變量的分佈

（2）連續變量的分佈

2、探索性分析——兩個變量

3、探索性分析——多個變量(heatmap+pairplot)

工作中用到的腳本合集

24-5-18 X

Python實現的爬取豆瓣電影信息功能案例

win10環境下基於anaconda3安裝tensorflow的方法以及踩的坑和解決辦法

Kaggle項目之PUBG Finish Placement Prediction（一）——探索性分析

神經網絡原理+從零創建兩層神經網絡（基於Python）

損失函數、梯度和學習率的理解及用python實現梯度下降法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結