上接:2020美賽F獎論文(二):傳球網絡模型(PNM)的建立和影響因子分析
全文:
4 足球團隊指標和基於機器學習的球隊表現預測
足球隊中成功團隊合作有許多指標,通過數據分析和實際經驗,我們主要考慮以下indicators:靜態指標和動態指標。首先,我們使用G o a l ( G i ) Goal(G_{i}) G o a l ( G i )
評價一場比賽的球隊整體發揮,作爲單場比賽表現標籤,定義G o a l ( G i ) Goal(G_{i}) G o a l ( G i ) :
G o a l ( G i ) = { − 1 , O w n S c o r e − O p p o n e n t S c o r e < − 1 0 , O w n S c o r e − O p p o n e n t S c o r e ∈ [ − 1 , 1 ] 1 , O w n S c o r e − O p p o n e n t S c o r e > 1
Goal(G_{i}) = \left\{ \begin{matrix} - 1,\ \ OwnScore - OpponentScore < - 1 \\0,\ \ OwnScore - OpponentScore \in \left\lbrack - 1,1 \right\rbrack \\1,\ \ OwnScore - OpponentScore > 1 \\\end{matrix} \right.
G o a l ( G i ) = ⎩ ⎨ ⎧ − 1 , O w n S c o r e − O p p o n e n t S c o r e < − 1 0 , O w n S c o r e − O p p o n e n t S c o r e ∈ [ − 1 , 1 ] 1 , O w n S c o r e − O p p o n e n t S c o r e > 1
4.1 靜態指標
爲了考慮球員位置分佈,我們採出每個球員在整個賽季中的位置座標,做出球員運動位置的熱點圖,熱力圖每個點的值定義如下:
Heatmap p k [ i , j ] = 1 4 δ 2 ∫ x − δ x + δ ∫ y − δ y + δ { 1 , p l a y e r h a s b e e n h e r e 0 , p l a y e r n e v e r p a s s e d dxdy , δ > 0
\text{Heatmap}_{p_{k}}\left\lbrack i,j \right\rbrack = \frac{1}{4\delta^{2}}\int_{x - \delta}^{x + \delta}{\int_{y - \delta}^{y + \delta}\left\{ \begin{matrix}
1,player\ has\ been\ here \\
0,\ player\ never\ passed \\
\end{matrix}\text{dxdy} \right.\ },\delta > 0
Heatmap p k [ i , j ] = 4 δ 2 1 ∫ x − δ x + δ ∫ y − δ y + δ { 1 , p l a y e r h a s b e e n h e r e 0 , p l a y e r n e v e r p a s s e d dxdy , δ > 0
顏色越深則表示出現在此處的頻率較大,越淺表示出現的頻率越小。經過Heatmap p k [ i , j ] \text{Heatmap}_{p_{k}}\left\lbrack
i,j \right\rbrack Heatmap p k [ i , j ] 的計算,主力11人的位置熱點圖如下:
在一場球賽中,球隊的陣型對團隊協作起到重要作用,我們考慮在一場球賽中球員陣型,我們採取每一場比賽中每一位球員的運動座標,採用座標對時間積分的方法,找出每場球賽中,每一位球員平均座標。將在數據中可以獲取(球員出現在Origin/Destination)的時間點作爲新的橫座標,X或Y座標作爲新的縱座標,得出函數X ( t ) a n d Y ( t ) X\left(t \right)\ and\ Y(t) X ( t ) a n d Y ( t ) 。我們近似認爲在任意兩個有記錄的時間點,球員在X或Y方向上勻速移動,這樣就將離散型的數據集轉換爲了連續性的數據集(每個)。因此平均座標,以X座標爲例,Y座標同理:
X ( t ) is a piece w i s e f u n c t i o n , X t is the X exactly when t.
X\left( t \right)\text{\ is\ a\ }\text{piece}wise\ function,\ X_{t}\text{\ is\ the\ X\ exactly\ when\ t.}
X ( t ) is a piece w i s e f u n c t i o n , X t is the X exactly when t.
{ AvgX ( p i ) = ∫ 0 90 m i n X ( t ) dt ≈ ∑ i = 1 n [ 1 2 ( t i + 1 − t i − 1 ) × X t ] n = n u m o f o u r e v e n t s
\left\{ \begin{matrix}
\text{AvgX}\left( p_{i} \right) = \int_{0}^{90min}{X\left( t \right)\text{dt}} \approx \sum_{i = 1}^{n}\left\lbrack \frac{1}{2}\left( t_{i + 1} - t_{i - 1} \right) \times X_{t} \right\rbrack \\
n = num\ of\ our\ events \\
\end{matrix} \right.\
{ AvgX ( p i ) = ∫ 0 9 0 m i n X ( t ) dt ≈ ∑ i = 1 n [ 2 1 ( t i + 1 − t i − 1 ) × X t ] n = n u m o f o u r e v e n t s
將這11位球員的位置標在圖中繪製出每場球賽的陣型圖,部分陣型圖如下:
4.2 動態指標
動態指標包括了球隊人爲影響因素和在比賽裏產生的技術數據:人爲影響因素包括了教練、對手水平、主客場,技術數據包括了射門、傳球、解圍在內的各種events統計。原始的數據以單個事件作爲樣本的單位,而我們將其分類統計爲以一場比賽爲單位的動態類型數據,通過觀察以新結構存儲的數據,提取出其中的若干特徵信息。
4.2.1數據清洗和特徵工程
在特徵工程中,爲了降低特徵的維度,不僅使用PCA篩選並剔除影響不顯著的特徵,還可以使用ChiMerge這一特徵分箱的方法,將EventSubTypes分爲傳球,進攻,防守和Fail四個方面,與教練、主客場、對手水平一起作爲一場比賽的特徵。通過標準化、啞變量、結合分析等方法處理統計後的數據來量化比賽的特徵:
(1)統計型數據 Statistical data
D e f e n c e ( G i ) = C l e a r a n c e + B l o c k s + I n t e r r u p t i o n + A e r i a l D u a l + S a v e s
Defence(G_{i}) = Clearance + Blocks + Interruption + Aerial\ Dual + Saves
D e f e n c e ( G i ) = C l e a r a n c e + B l o c k s + I n t e r r u p t i o n + A e r i a l D u a l + S a v e s
A t t a c k ( G i ) = S h o t s + D r i b b l e s + T o u c h + C o r n e r s + O f f s i d e
Attack(G_{i}) = Shots + Dribbles + Touch + Corners + Offside
A t t a c k ( G i ) = S h o t s + D r i b b l e s + T o u c h + C o r n e r s + O f f s i d e
F a i l ( G i ) = L o s s o f P o s s e s s i o n + F o u l s
Fail(G_{i}) = Loss\ of\ Possession + Fouls
F a i l ( G i ) = L o s s o f P o s s e s s i o n + F o u l s
Oppo ( G i ) = P t s ( OpponentID ) + ∑ j = 1 38 GD j ( OpponentID )
\text{Oppo}\left( G_{i} \right) = Pts\left( \text{OpponentID} \right) + \sum_{j = 1}^{38}{\text{GD}_{j}\left( \text{OpponentID} \right)}
Oppo ( G i ) = P t s ( OpponentID ) + j = 1 ∑ 3 8 GD j ( OpponentID )
(2)多事件結合分析型數據 Multi-event combined analysis data
P o s s e s s i o n ( G i ) = 1 90 m i n ∑ i = 2 n ( t i − t i + 1 ) , ( n i s t h e n u m b e r o f H u s k i e s ′ d a t a )
Possession(G_{i}) = \frac{1}{90min}\sum_{i = 2}^{n}{(t_{i} - t_{i + 1})},(n\ is\ the\ number\ of\ Huskies^{'}data)
P o s s e s s i o n ( G i ) = 9 0 m i n 1 i = 2 ∑ n ( t i − t i + 1 ) , ( n i s t h e n u m b e r o f H u s k i e s ′ d a t a )
(3)One-Hot編碼啞變量數據 One-Hot encoded dummy variable data
Side ( G i ) = { 0 , h ome 1 , a w a y = { [ 1 , 0 ] , h o m e [ 0 , 1 ] , a w a y
\text{Side}\left( G_{i} \right) = \left\{ \begin{matrix}
0,h\text{ome} \\
1,away \\
\end{matrix} \right.\ = \left\{ \begin{matrix}
\left\lbrack 1,0 \right\rbrack,home \\
\left\lbrack 0,1 \right\rbrack,away \\
\end{matrix} \right.\
Side ( G i ) = { 0 , h ome 1 , a w a y = { [ 1 , 0 ] , h o m e [ 0 , 1 ] , a w a y
{ Coac h ( 1 ) = [ 1 , 0 , 0 ] Coac h ( 2 ) = [ 0 , 1 , 0 ] Coac h ( 3 ) = [ 0 , 0 , 1 ]
\left\{ \begin{matrix}
\text{Coac}h\left( 1 \right) = \left\lbrack 1,0,0 \right\rbrack \\
\text{Coac}h\left( 2 \right) = \left\lbrack 0,1,0 \right\rbrack \\
\text{Coac}h\left( 3 \right) = \left\lbrack 0,0,1 \right\rbrack \\
\end{matrix} \right.\
⎩ ⎨ ⎧ Coac h ( 1 ) = [ 1 , 0 , 0 ] Coac h ( 2 ) = [ 0 , 1 , 0 ] Coac h ( 3 ) = [ 0 , 0 , 1 ]
4.2.2 可視化分析
分析Side ( G i ) \text{Side}\left( G_{i} \right) Side ( G i ) 對 於 對於 對 於 Goal ( G i ) a n d R a t i n g s ( G i ) \text{\ Goal}\left( G_{i} \right)\ and\ Ratings(G_{i}) Goal ( G i ) a n d R a t i n g s ( G i ) 影響:
Side ( G i ) = 0 \text{Side}\left( G_{i} \right) = 0 Side ( G i ) = 0 時Goal ( G i ) = 0 o r 1 \text{Goal}\left( G_{i} \right) = 0\ or\ 1 Goal ( G i ) = 0 o r 1 的分佈更多,R a t i n g s ( G i ) Ratings(G_{i}) R a t i n g s ( G i ) 分佈更高,因此主場表現結果整體上比客場要好。
分析不同Coach的執教水平以及對於球隊Attack ( G i ) , D e f e n c e ( G i ) , P a s s e s ( G i ) a n d F a i l ( G i ) \text{Attack}\left( G_{i} \right),Defence\left( G_{i} \right),Passes\left( G_{i} \right)\ and\ Fail(G_{i}) Attack ( G i ) , D e f e n c e ( G i ) , P a s s e s ( G i ) a n d F a i l ( G i ) 的指導成效:
從boxen圖我們可以看出,在Coach 3指導下,球隊Goal ( G i ) , A t t a c k ( G i ) \text{Goal}\left( G_{i}
\right),Attack\left( G_{i} \right) Goal ( G i ) , A t t a c k ( G i ) 等數據較好,其次是Coach 2和Coach 1。我們還可以得出教練們的執教風格,例如:教練1更具侵略性,防守就顯得平庸;教練2強調強硬防守;教練3則較爲平衡,戰績最佳。
分析Attack ( G i ) \text{Attack}\left( G_{i} \right) Attack ( G i ) 、Passes ( G i ) \text{Passes}\left( G_{i}
\right) Passes ( G i ) 對於 Goal ( G i ) \text{\ Goal}\left( G_{i} \right) Goal ( G i ) 的貢獻:
從圖中我們可以看出,在不同淨勝球數下,進攻和傳球大體上爲線性相關,斜率爲正。
{ Passes ( G i ) in [ 0.0 , 1.0 ] , A t t a c k ( G i ) in [ 0.0 , 0.9 ] , G o a l ( G i ) < 0 Passes ( G i ) in [ 0.0 , 1.0 ] , A t t a c k ( G i ) in [ 0.1 , 1.0 ] , G o a l ( G i ) = 0 Passes ( G i ) in [ 0.5 , 0.8 ] , A t t a c k ( G i ) in [ 0.6 , 1.0 ] , G o a l ( G i ) > 0 , M a i n l y
\left\{ \begin{matrix}
\text{Passes}\left( G_{i} \right)\text{\ in\ }\left\lbrack 0.0,1.0 \right\rbrack,Attack\left( G_{i} \right)\text{\ in\ }\left\lbrack 0.0,0.9 \right\rbrack,Goal\left( G_{i} \right) < 0 \\
\text{Passes}\left( G_{i} \right)\text{\ in\ }\left\lbrack 0.0,1.0 \right\rbrack,Attack\left( G_{i} \right)\text{\ in\ }\left\lbrack 0.1,1.0 \right\rbrack,Goal\left( G_{i} \right) = 0 \\
\text{Passes}\left( G_{i} \right)\text{\ in\ }\left\lbrack 0.5,0.8 \right\rbrack,Attack\left( G_{i} \right)\text{\ in\ }\left\lbrack 0.6,1.0 \right\rbrack,Goal\left( G_{i} \right) > 0 \\
\end{matrix},\ Mainly \right.\
⎩ ⎨ ⎧ Passes ( G i ) in [ 0 . 0 , 1 . 0 ] , A t t a c k ( G i ) in [ 0 . 0 , 0 . 9 ] , G o a l ( G i ) < 0 Passes ( G i ) in [ 0 . 0 , 1 . 0 ] , A t t a c k ( G i ) in [ 0 . 1 , 1 . 0 ] , G o a l ( G i ) = 0 Passes ( G i ) in [ 0 . 5 , 0 . 8 ] , A t t a c k ( G i ) in [ 0 . 6 , 1 . 0 ] , G o a l ( G i ) > 0 , M a i n l y
G o a l ( G i ) Goal(G_{i}) G o a l ( G i ) 與Passes ( G i ) and Attack ( G i ) \text{Passes}\left( G_{i} \right)\text{\ and\ Attack}\left(
G_{i} \right) Passes ( G i ) and Attack ( G i ) 呈正相關,且分佈越集中,Passes ( G i ) and Attack ( G i ) \text{Passes}\left( G_{i}
\right)\text{\ and\ Attack}\left( G_{i}
\right) Passes ( G i ) and Attack ( G i ) 的方差較小。我們可以得出結論:在一場球賽乃至整個賽季,G o a l ( G i ) Goal(G_{i}) G o a l ( G i ) 越多,大概率有着更高的Passes ( G i ) and Attack ( G i ) \text{Passes}\left(G_{i} \right)\text{\ and\ Attack}\left( G_{i} \right) Passes ( G i ) and Attack ( G i ) 。
分析Defence ( G i ) \text{Defence}\left( G_{i} \right) Defence ( G i ) 、Fail ( G i ) \text{Fail}\left( G_{i}
\right) Fail ( G i ) 對於 Goal ( G i ) \text{\ Goal}\left( G_{i} \right) Goal ( G i ) 的貢獻:
{ Fail ( G i ) in [ − 1.0 , − 0.2 ] , D e f e n c e ( G i ) in [ 0.0 , 0.5 ] , G o a l ( G i ) < 0 Fail ( G i ) in [ − 1.0 , 0.0 ] , D e f e n c e ( G i ) in [ 0.0 , 1.0 ] , G o a l ( G i ) = 0 Fail ( G i ) in [ − 0.6 , − 0.2 ] , D e f e n c e ( G i ) in [ 0.0 , 0.7 ] , G o a l ( G i ) > 0 , M a i n l y
\left\{ \begin{matrix}
\text{Fail}\left( G_{i} \right)\text{\ in\ }\left\lbrack - 1.0, - 0.2 \right\rbrack,Defence\left( G_{i} \right)\text{\ in\ }\left\lbrack 0.0,0.5 \right\rbrack,Goal\left( G_{i} \right) < 0 \\
\text{Fail}\left( G_{i} \right)\text{\ in\ }\left\lbrack - 1.0,0.0 \right\rbrack,Defence\left( G_{i} \right)\text{\ in\ }\left\lbrack 0.0,1.0 \right\rbrack,Goal\left( G_{i} \right) = 0 \\
\text{Fail}\left( G_{i} \right)\text{\ in\ }\left\lbrack - 0.6, - 0.2 \right\rbrack,Defence\left( G_{i} \right)\text{\ in\ }\left\lbrack 0.0,0.7 \right\rbrack,Goal\left( G_{i} \right) > 0 \\
\end{matrix},\ Mainly \right.\
⎩ ⎨ ⎧ Fail ( G i ) in [ − 1 . 0 , − 0 . 2 ] , D e f e n c e ( G i ) in [ 0 . 0 , 0 . 5 ] , G o a l ( G i ) < 0 Fail ( G i ) in [ − 1 . 0 , 0 . 0 ] , D e f e n c e ( G i ) in [ 0 . 0 , 1 . 0 ] , G o a l ( G i ) = 0 Fail ( G i ) in [ − 0 . 6 , − 0 . 2 ] , D e f e n c e ( G i ) in [ 0 . 0 , 0 . 7 ] , G o a l ( G i ) > 0 , M a i n l y
Goal ( G i ) \text{Goal}\left( G_{i} \right) Goal ( G i ) 與Defence ( G i ) \text{Defence}\left( G_{i}
\right) Defence ( G i ) 呈正相關,與∣ Fail ( G i ) ∣ \left| \text{Fail}\left( G_{i} \right)
\right| ∣ Fail ( G i ) ∣ 呈負相關,且分佈越集中,Defence ( G i ) and Fail ( G i ) \text{Defence}\left( G_{i} \right) \text{and}\text{\ Fail}\left( G_{i}\right) Defence ( G i ) and Fail ( G i ) 的方差較小。觀察發現:圖2左1的點分佈在下方,因此防守不好會導致輸球;右1左半邊沒有點,因此期望贏球則失誤不能多。
以Attack ( G i ) , D e f e n c e ( G i ) , P a s s e s ( G i ) \text{Attack}\left( G_{i} \right),Defence\left( G_{i} \right),Passes\left(
G_{i} \right) Attack ( G i ) , D e f e n c e ( G i ) , P a s s e s ( G i ) 作爲考察球隊整體表現的positive指標,結合Passes ( G i ) , O p p o ( G i ) \text{Passes}\left(
G_{i} \right),Oppo\left( G_{i} \right) Passes ( G i ) , O p p o ( G i ) 指標進行多角度分析:
從左圖中我們可以看出數據重心分佈在右下角,認爲整個賽季上Attack ( G i ) \text{Attack}\left(
G_{i} \right) Attack ( G i ) (進攻表現)顯著優於Defence ( G i ) \text{Defence}\left( G_{i}
\right) Defence ( G i ) (防守表現)。從右圖中我們可以看出不論是在主場還是客場,Passes ( G i ) ∝ [ α 1 Oppo ( G i ) + β ] \text{Passes}\left(
G_{i} \right) \propto \left\lbrack \alpha\frac{1}{\text{Oppo}\left( G_{i}\right)} + \beta\right\rbrack Passes ( G i ) ∝ [ α Oppo ( G i ) 1 + β ] ,但主場更可能有較小提升;結論是對手水平越高,我方傳球率越低。
綜合所有處理得到的特徵,通過Pearson相關係數的計算來估計出變量間兩兩特徵相關性。
r xy = N ∑ x i y i − ∑ x i ∑ y i N ∑ x i 2 − ( ∑ x i ) 2 N ∑ y i 2 − ( ∑ y i ) 2
r_{\text{xy}} = \frac{N\sum_{}^{}{x_{i}y_{i} - \sum_{}^{}{x_{i}\sum_{}^{}y_{i}}}}{\sqrt{N\sum_{}^{}x_{i}^{2} - \left( \sum_{}^{}x_{i} \right)^{2}}\sqrt{N\sum_{}^{}y_{i}^{2} - \left( \sum_{}^{}y_{i} \right)^{2}}}
r xy = N ∑ x i 2 − ( ∑ x i ) 2 N ∑ y i 2 − ( ∑ y i ) 2 N ∑ x i y i − ∑ x i ∑ y i
令矩陣Arr [ i , j ] = r ij \text{Arr}\left\lbrack i,j \right\rbrack = r_{\text{ij}} Arr [ i , j ] = r ij ,得:
4.2.3 model建立and訓練
我們以G o a l ( G i ) Goal(G_{i}) G o a l ( G i ) 作爲每場比賽評價標籤,希望學習後的模型能夠基於處理後的數據對比賽進行分類,對應到G o a l ( G i ) Goal(G_{i}) G o a l ( G i ) 的標籤。由於M = 10 M=10 M = 1 0 個特徵數量較多,且與標籤相關性不一,不宜採用線性模型進行分類;且樣本數據N = 38 N=38 N = 3 8 數量極少,在嘗試一些深度學習算法時容易有較大偏差。綜上,我們選擇隨機森林模型建立G o a l ( G i ) Goal(G_{i}) G o a l ( G i ) 標籤分類器。
隨機森林是一個包含多個決策樹的分類器,
並且其輸出的類別是由個別樹輸出的類別的衆數而定。對於很多種資料,它可以產生高準確度的分類器;它可以在決定類別時,評估變數的重要性;在建造森林時,它可以在內部對於一般化後的誤差產生不偏差的估計。建立隨機森林分類器Random
Forest Classifier的方法如下:
輸入特徵數目m m m ,用於確定決策樹上一個節點的決策結果m < M 2 m < \sqrt[2]{M} m < 2 M ;
利用Bootstrap取樣,從N N N 個訓練用例中以有放回抽樣的方式,取樣N N N 次,形成一個訓練集,並用未抽到的用例作預測,評估其誤差;
對於每一個節點,隨機選擇m個特徵,決策樹上每個節點的決定都是基於這些特徵確定的。根據這m個特徵,計算其最佳的分裂方式;
每棵樹都會完整成長而不會剪枝,這有可能在建完一棵正常樹狀分類器後會被採用。
隨機森林分類器的訓練後,使用網格搜索grid search進行參數調優,選定
{ n _ e s t i m a t o r = 50 rando m rate = 0 m a x _ d e p t h = 3 m a x _ f e a t u r e = M 2
\left\{ \begin{matrix}
n\_ estimator = 50 \\
\text{rando}m_{\text{rate}} = 0 \\
max\_ depth = 3 \\
max\_ feature = \sqrt[2]{M} \\
\end{matrix} \right.\
⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ n _ e s t i m a t o r = 5 0 rando m rate = 0 m a x _ d e p t h = 3 m a x _ f e a t u r e = 2 M
作爲參數,利用K折交叉驗證驗計算其accuracy score,用於評估模型準確率。
經過一定的數據調整和多次模擬結果,平均情況下得分爲65.8 % 65.8\% 6 5 . 8 % ,最好的數據情況下可以達到80 − 90 % 80- 90\% 8 0 − 9 0 % 的得分,在樣本規模僅有N = 38 N = 38 N = 3 8 的情況下,我們可以接受這一模型通過動態指標對比賽淨勝球情況進行預測的準確率。
下接:2020美賽F獎論文(四):模擬退火算法驅動的結構策略設計
全文: