python學習筆記(六) Pandas數據分析實戰——基於Kaggle電子遊戲銷量數據集

一、數據集初識

數據量： 共計16598條數據
數據來源：Video Games Sales
數據字段：

字段名	含義
Rank	遊戲排名
Name	遊戲名
Platform	發佈平臺
Year	發佈年份
Genre	遊戲種類
Publisher	發行商
NA_Sales	北美銷售量(以million爲單位)
EU_Sales	歐洲銷售量(以million爲單位)
JP_Sales	日本銷售量(以million爲單位)
Other_Sales	其他地區銷售量(以million爲單位)
Global_Sales	全球銷售總量(以million爲單位)

二、數據讀取與預處理

import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
os.chdir('C:/Users/dell/Desktop')
plt.style.use('ggplot')  #使用ggplot風格
na_values=['N/A']  #缺失值類型爲N/A
df=pd.read_csv('vgsales.csv',na_values=na_values)
df=df.dropna(how='any',axis=0)

三、描述性統計分析

(1) 查看遊戲類別、發佈平臺、發行商類別數

types=pd.DataFrame(df['Genre'].value_counts())

types.plot(kind='bar',alpha=0.6,color='Blue',figsize=(12,8),
legend=False,
title='The counts of all Genres')

a=np.arange(len(df))   #添加數據標籤
for index,count in zip(a,types['Genre']):
	plt.text(index,count+50,count,ha='center',va='bottom')
	
plt.show()

從上圖來看，動作類遊戲最多，其次是運動類遊戲。

按照上述方式，我們還可以得到發佈平臺、發行商的類別數。這裏，以各自的前五名進行展示

f,ax=plt.subplots(1,3,figsize=(15,6))
types=pd.DataFrame(df['Genre'].value_counts()[:5])
platform=pd.DataFrame(df['Platform'].value_counts()[:5])
publisher=pd.DataFrame(df['Publisher'].value_counts()[:5])

a=np.arange(len(types))
b=np.arange(len(platform))
c=np.arange(len(publisher))


types.plot(kind='bar',
           alpha=0.6,color='Blue',
           title='The counts of top 5 types',ax=ax[0])
for index,count in zip(a,types['Genre']):
    ax[0].text(index,count+50,count,ha='center',va='center')
    
platform.plot(kind='bar',
           alpha=0.6,color='red',
           title='The counts of top 5 Platforms',ax=ax[1])
for index,count in zip(b,platform['Platform']):
    ax[1].text(index,count+30,count,ha='center',va='center')

publisher.plot(kind='bar',
           alpha=0.6,color='Green',
           title='The counts of top 5 Publisher',ax=ax[2])
for index,count in zip(b,publisher['Publisher']):
    ax[2].text(index,count+20,count,ha='center',va='center')
    
plt.show()

發佈平臺較多的是DS和PS2，發行商較多的是Electirc Arts，Activision

(2) 查看受歡迎的遊戲類型、平臺、發行商
從類別數量來看，類別數較多的不一定受歡迎。這裏，我們以銷量爲縱軸，看看哪些遊戲類型、平臺、發行商較受歡迎。爲方便展示結果，這裏都取前五名演示

f,ax=plt.subplots(1,3,figsize=(15,6))
#分組統計，並按全球銷量降序排列,取前五名
types=pd.DataFrame(df.groupby('Genre').
                   agg({'Global_Sales':np.sum})).
                   sort_values(by='Global_Sales',
                   ascending=False)[:5]
                                                                                                   
platform=pd.DataFrame(df.groupby('Platform').
                     agg({'Global_Sales':np.sum})).
                     sort_values(by='Global_Sales',
                     ascending=False)[:5] 
                                                                                                         
publisher=pd.DataFrame(df.groupby('Publisher').
					   agg({'Global_Sales':np.sum})).
					   sort_values(by='Global_Sales',
					   ascending=False)[:5]
                                                                                 
font={
    'family':'DejaVu Sans',
    'weight':'normal',
    'size':12
}
types.plot(kind='bar',
           alpha=0.6,color='Blue',
           title='The Sales of top 5 Game Genres',
           ax=ax[0],legend=False)
ax[0].set_ylabel('Genre_Sales',font)

platform.plot(kind='bar',
           alpha=0.6,color='red',
           title='The Sales of top 5 Game Platforms',
           ax=ax[1],legend=False)
ax[1].set_ylabel('Platform_Sales',font)

publisher.plot(kind='bar',
           alpha=0.6,color='Green',
           title='The Sales of top 5 Game Publisher',
           ax=ax[2],legend=False)
ax[2].set_ylabel('Publisher_Sales',font)    

plt.subplots_adjust(wspace=0.3)  #調整各個子圖橫向間距
plt.show()

從銷量角度來看，最受歡迎的遊戲類型是動作類遊戲，最受歡迎的平臺爲PS2，最受歡迎的發行商爲Nintendo

(3) 按地區，查看受歡迎的遊戲類型、平臺、發行商

data=df.pivot_table(index='Genre',
                    values=['JP_Sales','EU_Sales',
                    'NA_Sales','Global_Sales'],
                    aggfunc=np.sum,)

data['NA_prop']=data['NA_Sales']/data['Global_Sales']
data['JP_prop']=data['JP_Sales']/data['Global_Sales']
data['EU_prop']=data['EU_Sales']/data['Global_Sales']

f,ax=plt.subplots(figsize=(12,8))
index=np.arange(len(data))
minColor = (31/256,78/256,95/256)   
midColor = (121/256,168/256,169/256)  
maxColor = (170/256,207/256,208/256) 

#繪製堆積柱形圖
plt.bar(index,data.NA_prop,color=minColor)
plt.bar(
        index,data.JP_prop,
        bottom=data.NA_prop, 
        color=midColor
        )
plt.bar(
        index,data.EU_prop,
        bottom=data.NA_prop, 
        color=maxColor
        )
font={
    'family':'DejaVu Sans',
    'weight':'normal',
    'size':12
}
plt.xticks(index,data.index,rotation=90)
plt.title('The Proportion of Different Areas',font)
plt.ylabel('Proportion',font)
plt.legend(['NA_Sales','JP_Sales','EU_Sales'],
loc='upper center',ncol=3,framealpha=0.6)
plt.show()

北美地區各類遊戲銷量都挺高的呢，日本也只有角色扮演類遊戲銷售較高。同樣的方法，可得到發佈平臺、發行商的地區銷售分佈

X360、GBA、XB這三個發佈平臺在北美更受歡迎，PS4，PC這兩個平臺在歐洲更受歡迎

發行商Activision、THQ、Microsoft Game Studios在北美更受歡迎，Namco Bandai Games 在日本更受歡迎

四、時序分析

(1) 全球銷量的時序變化

time=df.groupby('Year').agg({'Global_Sales':'sum'})

time.plot(alpha=0.6,figsize=(12,8),
         legend=False,color='Blue')

plt.xticks(np.arange(time.index.min(),   #調整橫軸時間間隔
           time.index.max()),
           rotation=90)

font1={'family' : 'Times New Roman',
'weight' : 'normal',
'size'   : 12,
}
plt.title('The Global_Sales Changing Chart',font1)
plt.ylabel('Global_Sales',font1)   

plt.vlines(time.idxmax(),0,          #添加垂線
           int(time['Global_Sales'].max()),
           linestyle='--',color='red')

plt.annotate('The Highest Sales is {} million'.  #標註最高銷量
             format(int(time.Global_Sales.max())),
             xy=(time.idxmax(),int(time.Global_Sales.max())),
             xytext=(time.idxmax()+1,int(time.Global_Sales.max())+15),
             arrowprops=dict(color='red',headwidth=8,headlength=8),
             family='fantasy')
plt.show()

從上圖來看，全球遊戲銷量在2008年總體上呈現出逐漸增大趨勢，2008年達到峯值6.78億銷量，之後銷量大幅度下滑，電子遊戲市場逐漸慘淡。

(2) 各地區銷量變化(北美、日本、歐洲)

#按年彙總各地數據
data=df.pivot_table(index='Year',
                    values=['JP_Sales','EU_Sales',
                    'NA_Sales','Global_Sales'],
                    aggfunc=np.sum)
#計算比例          
data['NA_prop']=data['NA_Sales']/data['Global_Sales']
data['JP_prop']=data['JP_Sales']/data['Global_Sales']
data['EU_prop']=data['EU_Sales']/data['Global_Sales']

f,ax=plt.subplots(figsize=(12,8))
index=np.arange(len(data))
minColor = (117/256,79/256,68/256) 
midColor = (236/256,115/256,87/256)
maxColor = (253/256,214/256,146/256)

plt.bar(index,data.NA_prop,color=minColor)
plt.bar(
        index,data.JP_prop,
        bottom=data.NA_prop, #通過bottom來設置這個柱子距離底部的高度
        color=midColor
        )
plt.bar(
        index,data.EU_prop,
        bottom=data.NA_prop, #通過bottom來設置這個柱子距離底部的高度
        color=maxColor
        )
font={
    'family':'DejaVu Sans',
    'weight':'normal',
    'size':12
}
plt.xticks(index,data.index,rotation=90)
plt.title('The Proportion of Different Areas',font)
plt.ylabel('Proportion',font)
plt.legend(['NA_Sales','JP_Sales','EU_Sales'],loc='upper center',ncol=3,framealpha=0.6)
plt.show()

f,ax=plt.subplots(figsize=(12,8))
plt.bar(index,data.NA_prop,color=minColor)

plt.bar(
        index,data.JP_prop,
        bottom=data.NA_prop, #通過bottom來設置這個柱子距離底部的高度
        color=midColor
        )
plt.bar(
        index,data.EU_prop,
        bottom=data.NA_prop,
        color=maxColor
        )
font={
    'family':'DejaVu Sans',
    'weight':'normal',
    'size':12
}
plt.xticks(index,data.index,rotation=90)
plt.title('The Proportion of Different Areas',font)
plt.ylabel('Proportion',font)

plt.legend(['NA_Sales','JP_Sales','EU_Sales'],
           loc='upper center',ncol=3,framealpha=0.6)
plt.show()

從堆積柱形圖來看，全球電子遊戲的銷售量主要來自於北美地區，在1983-1995年期間，日本地區的電子遊戲銷售市場佔有較大比重，之後就銷聲匿跡了；隨後，歐洲地區電子遊戲銷售市場於1996年開始，佔比逐漸增大。(像2017年、2020年全球電子遊戲銷量只來自於日本地區和北美地區，感覺最近幾年數據並沒有統計完全。)

(3) 建模預測

1.保證年份的連續性，剔除2020年的數據，並對序列作差分

new_data=pd.DataFrame(df.groupby('Year').agg({'Global_Sales':np.sum}))
data=data.drop(data.index[38])

從(1)的時序圖來看，序列是不平穩的，這裏首先做一階差分，進行平穩性檢驗並繪製差分後的時序圖。

from statsmodels.tsa.stattools import adfuller as ADF
diff_data=data.diff().dropna()   #去除NA值
print('一階差分後序列ADF檢驗P值爲{}'.format(ADF(new_data['Global_Sales'])[1]))
diff_data.plot()

從ADF檢驗結果來看，P值小於0.05，故有充分理由拒絕原假設，即可認爲差分後的序列是平穩的；此外，從差分時序圖也可看出，序列大致圍繞着0上下波動。

2.繪製ACF、PACF圖，進行初步定價

from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf 
plot_acf(new_data,lags=15).show()
plot_pacf(new_data,lags=15).show()

從自相關圖和偏自相關相關圖可知，都呈現拖尾性，可建立ARMA(1,1)

3.擬合與預測

這裏，僅對差分後的序列進行擬合。（實際上應該差分的還原再預測）

from statsmodels.tsa.arima_model import ARMA
model = ARMA(new_data,order=(1,1)).fit()
plt.plot(new_data,label='Origin_diff')
plt.plot(predict_ts,label='Predict_diff')
plt.legend(loc='best')
plt.show()

關於ARMA模型的差分還原，網上找了一些教程，最後執行出的結果都是NAN。這裏再用R進行預測一下，emm，後面幾年銷量直接爲負了…（可能原數據集最近幾年的數據根本沒有統計完全，2016年全球銷量有70.9million，2017年就只有0.05million了…總感覺數據有點異常）

python學習筆記(六) Pandas數據分析實戰——基於Kaggle電子遊戲銷量數據集

目錄

一、數據集初識

二、數據讀取與預處理

三、描述性統計分析

四、時序分析

python爬蟲實戰(四) python鬼滅漫畫爬取+簡單JS分析

Excel(一)之VLOOKUP用法集合——你真的會用VLOOKUP函數麼？

python學習筆記(四) 數據容器—列表、元組、字典、集合概述

python爬蟲實戰(二) selenium切換iframe爬取知網論文

python學習筆記(二)數據篩選

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結