Python數據分析之科比職業生涯分析

閱讀提示

前段時間，湖人當家球星 科比·布萊恩特不幸遇難。這對於無數的球迷來說無疑使晴天霹靂， 他逆天終究也沒能改命,但命運也從來都沒改得了他，曼巴精神會一直延續下去。 隨着大數據時代的到來，好像任何事情都可以和大數據這三個字掛鉤。早在很久以前，大數據分析就已經廣泛的應用在運動員職業生涯規劃、醫療、金融等方面，在本文中將會使用Python對球星科比進行對維度分析，向 “老大” 致敬！

1、前景提要

那天，是2020年1月27日凌晨，我失眠了，足足在牀上打滾到4點鐘還是睡不着，解鎖屏幕，盯着刺眼的手機打算刷刷微博，但卻得到了一個令人震驚的消息：球星科比不幸遇難。 換做是往常，我當然是舉報三連，這種標題黨罪有應得，但卻刷到了越來越多條類似的消息，直到看到官方發佈的消息。

正如我的文案所說，我沒有見過凌晨四點的洛杉磯，可我在凌晨四點聽聞了你去世的消息，1978-2020。

作爲球迷，我們能做的只有惋惜與緬懷。不散播謠言，不消費 “曼巴精神”

1、數據獲取

來源： NBA官方提供了的科比布萊恩特近二十年職業生涯數據資料集（數據量比較龐大，大約有3萬行）

2、數據處理

翻閱文檔時不難發現其中有很多空缺值，簡單粗暴的方式是直接刪除有空值的行，但爲了樣本完整性與預測結果的正確率。

首先我們對投籃距離做一個簡單的異常值檢測，這裏採用的是箱線圖呈現

#-*- coding: utf-8 -*-
catering_sale = '2.csv'
data = pd.read_csv(catering_sale, index_col = 'shot_id') #讀取數據，指定“shot_id”列爲索引列

import matplotlib.pyplot as plt #導入圖像庫
plt.rcParams['font.sans-serif'] = ['SimHei'] #用來正常顯示中文標籤
plt.rcParams['axes.unicode_minus'] = False #用來正常顯示負號
#
plt.figure() #建立圖像
p = data.boxplot(return_type='dict') #畫箱線圖，直接使用DataFrame的方法
x = p['fliers'][0].get_xdata() # 'flies'即爲異常值的標籤
y = p['fliers'][0].get_ydata()
y.sort() #從小到大排序，該方法直接改變原對象
print('共有30687個數據,其中異常值的個數爲{}'.format(len(y)))

#用annotate添加註釋
#其中有些相近的點，註解會出現重疊，難以看清，需要一些技巧來控制。

for i in range(len(x)):
  if i>0:
    plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+0.05 -0.8/(y[i]-y[i-1]),y[i]))
  else:
    plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+0.08,y[i]))

plt.show() #展示箱線圖

我們將得到這樣的結果：

根據判斷，該列數據有68個異常值，這裏採取的操作是將這些異常值所在行刪除，其他列屬性同理。

3、數據整合

將數據導入，並按我們的需求對數據進行合併、添加新列名的操作

import pandas as pd


allData = pd.read_csv('data.csv')
data = allData[allData['shot_made_flag'].notnull()].reset_index()

# 添加新的列名
data['game_date_DT'] = pd.to_datetime(data['game_date'])
data['dayOfWeek'] = data['game_date_DT'].dt.dayofweek
data['dayOfYear'] = data['game_date_DT'].dt.dayofyear
data['secondsFromPeriodEnd'] = 60 * data['minutes_remaining'] + data['seconds_remaining']
data['secondsFromPeriodStart'] = 60 * (11 - data['minutes_remaining']) + (60 - data['seconds_remaining'])
data['secondsFromGameStart'] = (data['period'] <= 4).astype(int) * (data['period'] - 1) * 12 * 60 + (
        data['period'] > 4).astype(int) * ((data['period'] - 4) * 5 * 60 + 3 * 12 * 60) + data['secondsFromPeriodStart']

'''
其中：
secondsFromPeriodEnd 一個週期結束後的秒
secondsFromPeriodStart 一個週期開始時的秒
secondsFromGameStart 一場比賽開始後的秒數
'''

#對數據進行驗證
print(data.loc[:10, ['period', 'minutes_remaining', 'seconds_remaining', 'secondsFromGameStart']])

運行有如下結果：

看起來還是一切正常的

    period  minutes_remaining  seconds_remaining  secondsFromGameStart
0        1                 10                 22                    98
1        1                  7                 45                   255
2        1                  6                 52                   308
3        2                  6                 19                  1061
4        3                  9                 32                  1588
5        3                  8                 52                  1628
6        3                  6                 12                  1788
7        3                  3                 36                  1944
8        3                  1                 56                  2044
9        1                 11                  0                    60
10       1                  7                  9                   291

Process finished with exit code 0

繪製投籃嘗試圖

根據不同的時間變化(從比賽開始)來繪製投籃的嘗試圖

這裏我們將用到matplotlib包

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


plt.rcParams['figure.figsize'] = (16, 16)
plt.rcParams['font.size'] = 16
binsSizes = [24, 12, 6]
plt.figure()

for k, binSizeInSeconds in enumerate(binsSizes):
    timeBins = np.arange(0, 60 * (4 * 12 + 3 * 5), binSizeInSeconds) + 0.01
    attemptsAsFunctionOfTime, b = np.histogram(data['secondsFromGameStart'], bins=timeBins)

    maxHeight = max(attemptsAsFunctionOfTime) + 30
    barWidth = 0.999 * (timeBins[1] - timeBins[0])
    plt.subplot(len(binsSizes), 1, k + 1)
    plt.bar(timeBins[:-1], attemptsAsFunctionOfTime, align='edge', width=barWidth)
    plt.title(str(binSizeInSeconds) + ' second time bins')
    plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,
                  4 * 12 * 60 + 3 * 5 * 60], ymin=0, ymax=maxHeight, colors='r')
    plt.xlim((-20, 3200))
    plt.ylim((0, maxHeight))
    plt.ylabel('attempts')
plt.xlabel('time [seconds from start of game]')
plt.show()

看下效果：

可以看出隨着比賽時間的進行，科比的出手次數呈現增長狀態。

繪製命中率對比圖

這裏們將做一個對比來判斷一下科比的命中率如何

# 在比賽中，根據時間的函數繪製出投籃精度。
# 繪製精度隨時間變化的函數
plt.rcParams['figure.figsize'] = (15, 10)
plt.rcParams['font.size'] = 16

binSizeInSeconds = 20
timeBins = np.arange(0, 60 * (4 * 12 + 3 * 5), binSizeInSeconds) + 0.01
attemptsAsFunctionOfTime, b = np.histogram(data['secondsFromGameStart'], bins=timeBins)
madeAttemptsAsFunctionOfTime, b = np.histogram(data.loc[data['shot_made_flag'] == 1, 'secondsFromGameStart'],
                                               bins=timeBins)
attemptsAsFunctionOfTime[attemptsAsFunctionOfTime < 1] = 1
accuracyAsFunctionOfTime = madeAttemptsAsFunctionOfTime.astype(float) / attemptsAsFunctionOfTime
accuracyAsFunctionOfTime[attemptsAsFunctionOfTime <= 50] = 0  # zero accuracy in bins that don't have enough samples

maxHeight = max(attemptsAsFunctionOfTime) + 30
barWidth = 0.999 * (timeBins[1] - timeBins[0])

plt.figure()
plt.subplot(2, 1, 1)
plt.bar(timeBins[:-1], attemptsAsFunctionOfTime, align='edge', width=barWidth);
plt.xlim((-20, 3200))
plt.ylim((0, maxHeight))

#上面圖的y軸 投籃次數
plt.ylabel('attempts')
plt.title(str(binSizeInSeconds) + ' second time bins')
plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,
              4 * 12 * 60 + 3 * 5 * 60], ymin=0, ymax=maxHeight, colors='r')
plt.subplot(2, 1, 2)
plt.bar(timeBins[:-1], accuracyAsFunctionOfTime, align='edge', width=barWidth);
plt.xlim((-20, 3200))
#下面圖的y軸 命中率
plt.ylabel('accuracy')
plt.xlabel('time [seconds from start of game]')
plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,
              4 * 12 * 60 + 3 * 5 * 60], ymin=0.0, ymax=0.7, colors='r')
plt.show()

看一下效果怎麼樣

分析可得出科比的投籃命中率大概徘徊在0.4左右，但這並不是我們想要的效果

爲了進一步對數據進行挖掘，我們需要使用一些算法了。

4、GMM聚類

那麼什麼是GMM聚類呢？

GMM是高斯混合模型（或者是混合高斯模型）的簡稱。大致的意思就是所有的分佈可以看做是多個高斯分佈綜合起來的結果。這樣一來，任何分佈都可以分成多個高斯分佈來表示。
因爲我們知道，按照大自然中很多現象是遵從高斯（即正態）分佈的，但是，實際上，影響一個分佈的原因是多個的，甚至有些是人爲的，可能每一個影響因素決定了一個高斯分佈，多種影響結合起來就是多個高斯分佈。（個人理解）
因此，混合高斯模型聚類的原理：通過樣本找到K個高斯分佈的期望和方差，那麼K個高斯模型就確定了。在聚類的過程中，不會明確的指定一個樣本屬於哪一類，而是計算這個樣本在某個分佈中的可能性。
高斯分佈一般還要結合EM算法作爲其似然估計算法。

想深入瞭解聚類算法的各位請移步：常見的三種聚類算法.

'''
現在，讓我們繼續我們的初步探索，研究一下科比投籃的空間位置。
我們將通過構建一個高斯混合模型來實現這一點，該模型試圖對科比的射門位置進行簡單的總結。
用GMM在科比的投籃位置上對他們的投籃嘗試進行聚類
'''

numGaussians = 13
gaussianMixtureModel = mixture.GaussianMixture(n_components=numGaussians, covariance_type='full',
                                               init_params='kmeans', n_init=50,
                                               verbose=0, random_state=5)
gaussianMixtureModel.fit(data.loc[:, ['loc_x', 'loc_y']])

# 將GMM集羣作爲字段添加到數據集中
data['shotLocationCluster'] = gaussianMixtureModel.predict(data.loc[:, ['loc_x', 'loc_y']])

5、球場可視化

這裏借鑑了MichaelKrueger的excelent腳本里的draw_court()函數

draw_court()函數

def draw_court(ax=None, color='black', lw=2, outer_lines=False):
    # 如果沒有提供用於繪圖的axis對象，就獲取當前對象
    if ax is None:
        ax = plt.gca()

    # 創建一個NBA的球場
    # 建一個籃筐
    # 直徑是18，半徑是9
    # 7.5在座標系內
    hoop = Circle((0, 0), radius=7.5, linewidth=lw, color=color, fill=False)

    # 創建籃筐
    backboard = Rectangle((-30, -7.5), 60, -1, linewidth=lw, color=color)

    # The paint
    # 爲球場外部上色， width=16ft, height=19ft
    outer_box = Rectangle((-80, -47.5), 160, 190, linewidth=lw, color=color,
                          fill=False)
    # 爲球場內部上色, width=12ft, height=19ft
    inner_box = Rectangle((-60, -47.5), 120, 190, linewidth=lw, color=color,
                          fill=False)


    #創建發球頂弧
    top_free_throw = Arc((0, 142.5), 120, 120, theta1=0, theta2=180,
                         linewidth=lw, color=color, fill=False)

    #創建發球底弧
    bottom_free_throw = Arc((0, 142.5), 120, 120, theta1=180, theta2=0,
                            linewidth=lw, color=color, linestyle='dashed')

    # 這是一個距離籃筐中心4英尺半徑的弧線
    restricted = Arc((0, 0), 80, 80, theta1=0, theta2=180, linewidth=lw,
                     color=color)

    # 三分線
    # 創建邊3pt的線，14英尺長
    corner_three_a = Rectangle((-220, -47.5), 0, 140, linewidth=lw,
                               color=color)
    corner_three_b = Rectangle((220, -47.5), 0, 140, linewidth=lw, color=color)
 
    # 圓弧到圓心是個圓環，距離爲23'9"
    # 調整一下thetal的值，直到它們與三分線對齊
    three_arc = Arc((0, 0), 475, 475, theta1=22, theta2=158, linewidth=lw,
                    color=color)

 
    # 中場部分
    center_outer_arc = Arc((0, 422.5), 120, 120, theta1=180, theta2=0,
                           linewidth=lw, color=color)
    center_inner_arc = Arc((0, 422.5), 40, 40, theta1=180, theta2=0,
                           linewidth=lw, color=color)

  
    # 要繪製到座標軸上的球場元素的列表
    court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw,
                      bottom_free_throw, restricted, corner_three_a,
                      corner_three_b, three_arc, center_outer_arc,
                      center_inner_arc]

    if outer_lines:

        # 劃出半場線、底線和邊線
        outer_lines = Rectangle((-250, -47.5), 500, 470, linewidth=lw,
                                color=color, fill=False)
        court_elements.append(outer_lines)


    # 將球場元素添加到軸上
    for element in court_elements:
        ax.add_patch(element)

    return ax

二維高斯圖

建立繪製畫二維高斯圖的函數

Draw2DGaussians（）

def Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages):
    fig, h = plt.subplots()
    for i, (mean, covarianceMatrix) in enumerate(zip(gaussianMixtureModel.means_, gaussianMixtureModel.covariances_)):
        # 得到協方差矩陣的特徵向量和特徵值
        v, w = np.linalg.eigh(covarianceMatrix)
        v = 2.5 * np.sqrt(v)  # go to units of standard deviation instead of variance 用標準差的單位代替方差

        # 計算橢圓角和兩軸長度並畫出它
        u = w[0] / np.linalg.norm(w[0])
        angle = np.arctan(u[1] / u[0])
        angle = 180 * angle / np.pi  # convert to degrees 轉換成度數
        currEllipse = mpl.patches.Ellipse(mean, v[0], v[1], 180 + angle, color=ellipseColors[i])
        currEllipse.set_alpha(0.5)
        h.add_artist(currEllipse)
        h.text(mean[0] + 7, mean[1] - 1, ellipseTextMessages[i], fontsize=13, color='blue')

下面開始繪製2D高斯投籃次數圖，圖中的每個橢圓都是離高斯分佈中心2.5個標準差遠的計數，每個藍色的數字代表從該高斯分佈觀察到的所佔百分比

# 顯示投籃嘗試的高斯混合橢圓
plt.rcParams['figure.figsize'] = (13, 10)
plt.rcParams['font.size'] = 15

ellipseTextMessages = [str(100 * gaussianMixtureModel.weights_[x])[:4] + '%' for x in range(numGaussians)]
ellipseColors = ['red', 'green', 'purple', 'cyan', 'magenta', 'yellow', 'blue', 'orange', 'silver', 'maroon', 'lime',
                 'olive', 'brown', 'darkblue']
Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)
draw_court(outer_lines=True)
plt.ylim(-60, 440)
plt.xlim(270, -270)
plt.title('shot attempts')
plt.show()

看一下成果：

我們可以看到，着色後的2D高斯圖中，科比在球場的左側（或者從他看來是右側）做了更多的投籃嘗試。這可能是因爲他是右撇子。此外，我們還可以看到，大量的投籃嘗試（16.8%）是直接從籃下進行的，5.06%的額外投籃嘗試是從非常接近籃下的位置投出去的。

它看起來並不完美，但確實顯示了一些有用的東西

對於繪製的每個高斯集羣的投籃精度，藍色數字將代表從這個集羣中獲取到的準確性，因此我們可以瞭解哪些是容易的，哪些是困難的。

對於每個集羣，計算一下它的精度並繪圖

plt.rcParams['figure.figsize'] = (13, 10)
plt.rcParams['font.size'] = 15

variableCategories = data['shotLocationCluster'].value_counts().index.tolist()

clusterAccuracy = {}
for category in variableCategories:
    shotsAttempted = np.array(data['shotLocationCluster'] == category).sum()
    shotsMade = np.array(data.loc[data['shotLocationCluster'] == category, 'shot_made_flag'] == 1).sum()
    clusterAccuracy[category] = float(shotsMade) / shotsAttempted

ellipseTextMessages = [str(100 * clusterAccuracy[x])[:4] + '%' for x in range(numGaussians)]
Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)
draw_court(outer_lines=True)
plt.ylim(-60, 440)
plt.xlim(270, -270)
plt.title('shot accuracy')
plt.show()

看一下效果圖

我們可以清楚地看到投籃距離和精度之間的關係。

繪製二維時空圖

另一個有趣的事實是：科比不僅在右側做了更多的投籃嘗試（從他看來的那邊），而且他在這些投籃嘗試上更擅長

現在讓我們繪製一個科比職業生涯的二維時空圖。在X軸上，將從比賽開始時計時；在y軸上有科比投籃的集羣指數(根據集羣精度排序)；圖片的深度將反映科比在那個特定的時間從那個特定的集羣中嘗試的次數；圖中的紅色垂線分割比賽的每節

# 制科比整個職業生涯比賽中的二維時空直方圖
plt.rcParams['figure.figsize'] = (18, 10) #設置圖像顯示的大小
plt.rcParams['font.size'] = 18 #字體大小


# 根據集羣的準確性對它們進行排序
sortedClustersByAccuracyTuple = sorted(clusterAccuracy.items(), key=operator.itemgetter(1), reverse=True)
sortedClustersByAccuracy = [x[0] for x in sortedClustersByAccuracyTuple]

binSizeInSeconds = 12
timeInUnitsOfBins = ((data['secondsFromGameStart'] + 0.0001) / binSizeInSeconds).astype(int)
locationInUintsOfClusters = np.array(
    [sortedClustersByAccuracy.index(data.loc[x, 'shotLocationCluster']) for x in range(data.shape[0])])


# 建立科比比賽的時空直方圖
shotAttempts = np.zeros((gaussianMixtureModel.n_components, 1 + max(timeInUnitsOfBins)))
for shot in range(data.shape[0]):
    shotAttempts[locationInUintsOfClusters[shot], timeInUnitsOfBins[shot]] += 1


# 讓y軸有更大的面積，這樣會更明顯
shotAttempts = np.kron(shotAttempts, np.ones((5, 1)))

# 每節結束的位置
vlinesList = 0.5001 + np.array([0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60]).astype(
    int) / binSizeInSeconds

plt.figure(figsize=(13, 8)) #設置寬和高
plt.imshow(shotAttempts, cmap='copper', interpolation="nearest")  #設置了邊界的模糊度，或者是圖片的模糊度
plt.xlim(0, float(4 * 12 * 60 + 6 * 60) / binSizeInSeconds)
plt.vlines(x=vlinesList, ymin=-0.5, ymax=shotAttempts.shape[0] - 0.5, colors='r')
plt.xlabel('time from start of game [sec]')
plt.ylabel('cluster (sorted by accuracy)')
plt.show()

看一下運行結果：

集羣按精度降序排序。高準確度的投籃在最上面，而低準確度的半場投籃在最下面,我們現在可以看到，在第一、第二和第三節中的“最後一秒出手”實際上是從很遠的地方“絕殺”, 然而，有趣的是，在第4節中，最後一秒的投籃並不屬於“絕殺”的投籃羣，而是屬於常規的3分投籃（這仍然比較難命中，但不是毫無希望的)。

在以後的分析中，我們將根據投籃屬性來評估投籃難度(如投籃類型和投籃距離）

下面將爲投籃難度模型創建一個新表格

def FactorizeCategoricalVariable(inputDB, categoricalVarName):
    opponentCategories = inputDB[categoricalVarName].value_counts().index.tolist()

    outputDB = pd.DataFrame()
    for category in opponentCategories:
        featureName = categoricalVarName + ': ' + str(category)
        outputDB[featureName] = (inputDB[categoricalVarName] == category).astype(int)

    return outputDB


featuresDB = pd.DataFrame()
featuresDB['homeGame'] = data['matchup'].apply(lambda x: 1 if (x.find('@') < 0) else 0)
featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'opponent')], axis=1)
featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'action_type')], axis=1)
featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_type')], axis=1)
featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'combined_shot_type')], axis=1)
featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_basic')], axis=1)
featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_area')], axis=1)
featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_range')], axis=1)
featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shotLocationCluster')], axis=1)

featuresDB['playoffGame'] = data['playoffs']
featuresDB['locX'] = data['loc_x']
featuresDB['locY'] = data['loc_y']
featuresDB['distanceFromBasket'] = data['shot_distance']
featuresDB['secondsFromPeriodEnd'] = data['secondsFromPeriodEnd']

featuresDB['dayOfWeek_cycX'] = np.sin(2 * np.pi * (data['dayOfWeek'] / 7))
featuresDB['dayOfWeek_cycY'] = np.cos(2 * np.pi * (data['dayOfWeek'] / 7))
featuresDB['timeOfYear_cycX'] = np.sin(2 * np.pi * (data['dayOfYear'] / 365))
featuresDB['timeOfYear_cycY'] = np.cos(2 * np.pi * (data['dayOfYear'] / 365))

labelsDB = data['shot_made_flag']

根據FeaturesDB表構建模型，並確保它不會過度匹配（即訓練誤差與測試誤差相同）

使用一個額外的分類器

建立一個簡單的模型，並確保它不超載

randomSeed = 1
numFolds = 4

stratifiedCV = model_selection.StratifiedKFold(n_splits=numFolds, shuffle=True, random_state=randomSeed)

mainLearner = ensemble.ExtraTreesClassifier(n_estimators=500, max_depth=5,
                                            min_samples_leaf=120, max_features=120,
                                            criterion='entropy', bootstrap=False,
                                            n_jobs=-1, random_state=randomSeed)

startTime = time.time()
trainAccuracy = []
validAccuracy = []
trainLogLosses = []
validLogLosses = []
for trainInds, validInds in stratifiedCV.split(featuresDB, labelsDB):
    # 分割訓練和有效的集合
    X_train_CV = featuresDB.iloc[trainInds, :]
    y_train_CV = labelsDB.iloc[trainInds]
    X_valid_CV = featuresDB.iloc[validInds, :]
    y_valid_CV = labelsDB.iloc[validInds]

    # 訓練
    mainLearner.fit(X_train_CV, y_train_CV)

    # 作出預測
    y_train_hat_mainLearner = mainLearner.predict_proba(X_train_CV)[:, 1]
    y_valid_hat_mainLearner = mainLearner.predict_proba(X_valid_CV)[:, 1]

    # 儲存結果
    trainAccuracy.append(accuracy(y_train_CV, y_train_hat_mainLearner > 0.5))
    validAccuracy.append(accuracy(y_valid_CV, y_valid_hat_mainLearner > 0.5))
    trainLogLosses.append(log_loss(y_train_CV, y_train_hat_mainLearner))
    validLogLosses.append(log_loss(y_valid_CV, y_valid_hat_mainLearner))

print("-----------------------------------------------------")
print("total (train,valid) Accuracy = (%.5f,%.5f). took %.2f minutes" % (
    np.mean(trainAccuracy), np.mean(validAccuracy), (time.time() - startTime) / 60))
print("total (train,valid) Log Loss = (%.5f,%.5f). took %.2f minutes" % (
    np.mean(trainLogLosses), np.mean(validLogLosses), (time.time() - startTime) / 60))
print("-----------------------------------------------------")

mainLearner.fit(featuresDB, labelsDB)
data['shotDifficulty'] = mainLearner.predict_proba(featuresDB)[:, 1]

# 爲了深入瞭解，我們來看看特性選擇
featureInds = mainLearner.feature_importances_.argsort()[::-1]
featureImportance = pd.DataFrame(
    np.concatenate((featuresDB.columns[featureInds, None], mainLearner.feature_importances_[featureInds, None]),
                   axis=1),
    columns=['featureName', 'importanceET'])

print(featureImportance.iloc[:30, :])**看看運行結果如何**：

total (train,valid) Accuracy = (0.67912,0.67860). took 0.29 minutes
total (train,valid) Log Loss = (0.60812,0.61100). took 0.29 minutes
-----------------------------------------------------
                         featureName importanceET
0             action_type: Jump Shot     0.578036
1            action_type: Layup Shot     0.173274
2           combined_shot_type: Dunk     0.113341
3                           homeGame    0.0288043
4             action_type: Dunk Shot    0.0161591
5             shotLocationCluster: 9    0.0136386
6          combined_shot_type: Layup   0.00949568
7                 distanceFromBasket    0.0084703
8         shot_zone_range: 16-24 ft.    0.0072107
9        action_type: Slam Dunk Shot   0.00690316
10     combined_shot_type: Jump Shot   0.00592586
11              secondsFromPeriodEnd   0.00589391
12    action_type: Running Jump Shot   0.00544904
13           shotLocationCluster: 11   0.00449125
14                              locY   0.00388509
15   action_type: Driving Layup Shot   0.00364757
16  shot_zone_range: Less Than 8 ft.   0.00349615
17      combined_shot_type: Tip Shot   0.00260399
18         shot_zone_area: Center(C)    0.0011585
19                     opponent: DEN  0.000882106
20    action_type: Driving Dunk Shot  0.000848156
21  shot_zone_basic: Restricted Area  0.000650022
22            shotLocationCluster: 2  0.000513476
23             action_type: Tip Shot  0.000489918
24        shot_zone_basic: Mid-Range  0.000487306
25     action_type: Pullup Jump shot  0.000453641
26         shot_zone_range: 8-16 ft.  0.000452574
27                   timeOfYear_cycX  0.000432267
28                    dayOfWeek_cycX   0.00039668
29            shotLocationCluster: 8  0.000254077

Process finished with exit code 0

在這裏想談談科比·布萊恩特在決策過程中的一些問題；爲此，我們將收集兩組不同的效果圖，並分析它們之間的差異：

在一次成功的投籃後馬上繼續投籃
在一次不成功的投籃後馬上馬上投籃

考慮到科比投進或投失了最後一球，我收集了一些數據

timeBetweenShotsDict = {}
timeBetweenShotsDict['madeLast'] = []
timeBetweenShotsDict['missedLast'] = []

changeInDistFromBasketDict = {}
changeInDistFromBasketDict['madeLast'] = []
changeInDistFromBasketDict['missedLast'] = []

changeInShotDifficultyDict = {}
changeInShotDifficultyDict['madeLast'] = []
changeInShotDifficultyDict['missedLast'] = []

afterMadeShotsList = []
afterMissedShotsList = []

for shot in range(1, data.shape[0]):

    # 確保當前的投籃和最後的投籃都在同一場比賽的同一時間段
    sameGame = data.loc[shot, 'game_date'] == data.loc[shot - 1, 'game_date']
    samePeriod = data.loc[shot, 'period'] == data.loc[shot - 1, 'period']

    if samePeriod and sameGame:
        madeLastShot = data.loc[shot - 1, 'shot_made_flag'] == 1
        missedLastShot = data.loc[shot - 1, 'shot_made_flag'] == 0

        timeDifferenceFromLastShot = data.loc[shot, 'secondsFromGameStart'] - data.loc[shot - 1, 'secondsFromGameStart']
        distDifferenceFromLastShot = data.loc[shot, 'shot_distance'] - data.loc[shot - 1, 'shot_distance']
        shotDifficultyDifferenceFromLastShot = data.loc[shot, 'shotDifficulty'] - data.loc[shot - 1, 'shotDifficulty']

        # check for currupt data points (assuming all samples should have been chronologically ordered)
        # 檢查數據(假設所有樣本都按時間順序排列)
        if timeDifferenceFromLastShot < 0:
            continue

        if madeLastShot:
            timeBetweenShotsDict['madeLast'].append(timeDifferenceFromLastShot)
            changeInDistFromBasketDict['madeLast'].append(distDifferenceFromLastShot)
            changeInShotDifficultyDict['madeLast'].append(shotDifficultyDifferenceFromLastShot)
            afterMadeShotsList.append(shot)

        if missedLastShot:
            timeBetweenShotsDict['missedLast'].append(timeDifferenceFromLastShot)
            changeInDistFromBasketDict['missedLast'].append(distDifferenceFromLastShot)
            changeInShotDifficultyDict['missedLast'].append(shotDifficultyDifferenceFromLastShot)
            afterMissedShotsList.append(shot)

afterMissedData = data.iloc[afterMissedShotsList, :]
afterMadeData = data.iloc[afterMadeShotsList, :]

shotChancesListAfterMade = afterMadeData['shotDifficulty'].tolist()
totalAttemptsAfterMade = afterMadeData.shape[0]
totalMadeAfterMade = np.array(afterMadeData['shot_made_flag'] == 1).sum()

shotChancesListAfterMissed = afterMissedData['shotDifficulty'].tolist()
totalAttemptsAfterMissed = afterMissedData.shape[0]
totalMadeAfterMissed = np.array(afterMissedData['shot_made_flag'] == 1).sum()

柱狀圖

爲他們繪製“上次投籃後的時間”的柱狀圖

plt.rcParams['figure.figsize'] = (13, 10)

jointHist, timeBins = np.histogram(timeBetweenShotsDict['madeLast'] + timeBetweenShotsDict['missedLast'], bins=200)
barWidth = 0.999 * (timeBins[1] - timeBins[0])

timeDiffHist_GivenMadeLastShot, b = np.histogram(timeBetweenShotsDict['madeLast'], bins=timeBins)
timeDiffHist_GivenMissedLastShot, b = np.histogram(timeBetweenShotsDict['missedLast'], bins=timeBins)
maxHeight = max(max(timeDiffHist_GivenMadeLastShot), max(timeDiffHist_GivenMissedLastShot)) + 30

plt.figure()
plt.subplot(2, 1, 1)
plt.bar(timeBins[:-1], timeDiffHist_GivenMadeLastShot, width=barWidth)
plt.xlim((0, 500))
plt.ylim((0, maxHeight))
plt.title('made last shot')
plt.ylabel('counts')
plt.subplot(2, 1, 2)
plt.bar(timeBins[:-1], timeDiffHist_GivenMissedLastShot, width=barWidth)
plt.xlim((0, 500))
plt.ylim((0, maxHeight))
plt.title('missed last shot')
plt.xlabel('time since last shot')
plt.ylabel('counts')
plt.show()

看一下運行結果：

從圖中可以看出：科比投了一個球之後有些着急去投下一個，而圖中的一些比較平緩的值可能是球權在另一隻隊伍手中，需要一些時間來奪回。

累計柱狀圖

爲了更好地可視化柱狀圖之間的差異，我們來看看累積柱狀圖。

plt.rcParams['figure.figsize'] = (13, 6)

timeDiffCumHist_GivenMadeLastShot = np.cumsum(timeDiffHist_GivenMadeLastShot).astype(float)
timeDiffCumHist_GivenMadeLastShot = timeDiffCumHist_GivenMadeLastShot / max(timeDiffCumHist_GivenMadeLastShot)
timeDiffCumHist_GivenMissedLastShot = np.cumsum(timeDiffHist_GivenMissedLastShot).astype(float)
timeDiffCumHist_GivenMissedLastShot = timeDiffCumHist_GivenMissedLastShot / max(timeDiffCumHist_GivenMissedLastShot)

maxHeight = max(timeDiffCumHist_GivenMadeLastShot[-1], timeDiffCumHist_GivenMissedLastShot[-1])

plt.figure()
madePrev = plt.plot(timeBins[:-1], timeDiffCumHist_GivenMadeLastShot, label='made Prev')
plt.xlim((0, 500))
missedPrev = plt.plot(timeBins[:-1], timeDiffCumHist_GivenMissedLastShot, label='missed Prev')
plt.xlim((0, 500))
plt.ylim((0, 1))
plt.title('cumulative density function - CDF')
plt.xlabel('time since last shot')
plt.legend(loc='lower right')
plt.show()

運行效果如下：

雖然可以觀察到密度有差異，但好像不太清楚，所以還是轉換成高斯格式來顯示數據吧

# 顯示投中後和失球后的投籃次數
plt.rcParams['figure.figsize'] = (13, 10)

variableCategories = afterMadeData['shotLocationCluster'].value_counts().index.tolist()
clusterFrequency = {}
for category in variableCategories:
    shotsAttempted = np.array(afterMadeData['shotLocationCluster'] == category).sum()
    clusterFrequency[category] = float(shotsAttempted) / afterMadeData.shape[0]

ellipseTextMessages = [str(100 * clusterFrequency[x])[:4] + '%' for x in range(numGaussians)]
Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)
draw_court(outer_lines=True)
plt.ylim(-60, 440)
plt.xlim(270, -270)
plt.title('after made shots')

variableCategories = afterMissedData['shotLocationCluster'].value_counts().index.tolist()
clusterFrequency = {}
for category in variableCategories:
    shotsAttempted = np.array(afterMissedData['shotLocationCluster'] == category).sum()
    clusterFrequency[category] = float(shotsAttempted) / afterMissedData.shape[0]

ellipseTextMessages = [str(100 * clusterFrequency[x])[:4] + '%' for x in range(numGaussians)]
Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)
draw_court(outer_lines=True)
plt.ylim(-60, 440)
plt.xlim(270, -270)
plt.title('after missed shots')
plt.show()

讓我們來看看最終結果：

結論

現在很明顯，在投丟一個球后，科比更可能直接從籃下投出下一球。在圖中也可以看出，科比在投藍進球後，下一球更有可能嘗試投個三分球，但本次案例中並沒有有效的數據可以證明科比有熱手效應。不難看出，科比還是一個注重籃下以及罰球線周邊功夫的球員，而且是一個十分自信的領袖，不愧爲我們的老大！

需要改進的地方

本次獲取到的數據集十分龐大，裏面的內容也很充足，甚至包括了每一種投籃姿勢、上籃姿勢的詳細數據，對於本數據中還未挖掘到的信息各位讀者如果有興趣可以自行嘗試，相信一定會收穫滿滿！

注：可能本次分析中存在一些問題，還請各位讀者指正，感謝閱讀。

Python數據分析之科比職業生涯分析

閱讀提示

目錄

1、前景提要

1、數據獲取

2、數據處理

3、數據整合

繪製投籃嘗試圖

繪製命中率對比圖

4、GMM聚類

5、球場可視化

二維高斯圖

繪製二維時空圖

柱狀圖

累計柱狀圖

結論

需要改進的地方

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

感謝陪伴，各自安好

Python數據分析與挖掘進階篇3——數據的預處理（清洗、集成、變換）附實例！

Python數據分析之商品數據分析

牛客網刷題筆記1——數據結構與概率統計（線性表的概念、哈夫曼樹、基礎排序、時間複雜度等）

Linux超詳細指令大全

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python數據分析之科比職業生涯分析

閱讀提示

目錄

1、前景提要

1、數據獲取

2、 數據處理

3、數據整合

繪製投籃嘗試圖

繪製命中率對比圖

4、GMM聚類

5、球場可視化

二維高斯圖

繪製二維時空圖

柱狀圖

累計柱狀圖

結論

需要改進的地方

2、數據處理