機器學習技法之矩陣分解（Matrix Factorization）

線性神經網絡（Linear Network Hypothesis）

這裏用推薦系統的應用實例引出矩陣分解：

現在有一個電影評分預測問題，那麼數據集的組成爲：

$\left\{ \left( \tilde { \mathbf { x } } _ { n } = ( n ) , y _ { n } = r _ { n m } \right) : \text { user } n \text { rated movie } m \right\}$

其中 $\tilde { \mathbf { x } } _ { n } = (n)$ 是一種抽象的類別（categorical）特徵。

什麼是類別特徵呢？舉例來說：比如ID號，血型（A，B，AB，O），編程語言種類（C++，Python，Java）。

但是大部分機器學習算法都是基於數值型特徵實現的，當然決策樹除外。所以現在需要將類別特徵轉換（編碼，encoding）爲數值特徵。這裏需要轉換的特徵是ID號。使用的工具是二值向量編碼（binary vector encoding），也就是向量的每個元素只有兩種數值選擇，這裏選擇的是 0/1 向量編碼，對應關係是向量中的第 ID 個元素爲 1，其他元素均爲 0。

那麼第 m 個電影編碼後的數據集 $\mathcal D_m$ 可以表示爲:

$\left\{ \left( \mathbf { x } _ { n } = \text { Binary VectorEncoding } ( n ) , y _ { n } = r _ { n m } \right) : \text { user } n \text { rated movie } m \right\}$

如果將全部的電影數據整合到一起的數據集 $\mathcal D$ 可以表示爲：

$\left\{ \left( \mathbf { x } _ { n } = \text { Binary VectorEncoding } ( n ) , \mathbf { y } _ { n } = \left[ \begin{array} { l l l l l l } r _ { n 1 } & ? & ? & r _ { n 4 } & r _ { n 5 } & \ldots & r _ { n M } \end{array} \right] ^ { T } \right) \right\}$

其中 $?$ 代表了該電影未評分。

現在的想法是使用一個 $N - \tilde { d } - M$ 神經網絡進行特徵提取：

現在先使用線性的激活函數，那麼由此得到的線性神經網絡的結構示意圖爲：

基本矩陣分解（Basic Matrix Factorization）

那麼現在將權重矩陣進行重命名：

$\mathrm { V } ^ { T } \text { for } \left[ w _ { n i } ^ { ( 1 ) } \right] \text { and } \mathrm { W } \text { for } \left[ w _ { i m } ^ { ( 2 ) } \right]$

那麼假設函數可以寫爲：

$\mathrm { h } ( \mathrm { x } ) = \mathrm { W } ^ { T } \underbrace { \mathrm { Vx } } _ { \Phi ( \mathrm { x } ) }$

矩陣 $\mathrm { V }$ 實際上就是特徵轉換 $\Phi ( \mathrm { x } )$ ，然後再使用 $\mathrm { W }$ 進行實現一個基於轉換數據的線性模型。那麼根據 ID 的數值編碼規則，第 $n$ 個用戶的假設函數可以寫爲：

$\mathrm { h } \left( \mathrm { x } _ { n } \right) = \mathrm { W } ^ { T } \mathbf { v } _ { n } , \text { where } \mathbf { v } _ { n } \text { is } n \text { -th column of } \mathrm { V }$

第 $m$ 個電影的假設函數可以寫爲：

$h _ { m } ( \mathbf { x } ) = \mathbf { w } _ { m } ^ { T } \mathbf { \Phi } ( \mathbf { x } )$

那麼對於推薦系統來說現在需要進行 $\mathrm { W }$ 和 $\mathrm { V }$ 的最優解求取。

對於 $\mathrm { W }$ 和 $\mathrm { V }$ 來說理想狀態是：

$r _ { n m } = y _ { n } \approx \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n }= \mathbf { v } _ { n } ^ { T } \mathbf { w } _ { m } \Longleftrightarrow \mathbf { R } \approx \mathbf { V } ^ { T } \mathbf { W }$

也就是說特徵轉換矩陣 $\mathbf { V }$ 和線性模型矩陣 $\mathbf { W }$ 相乘的結果是評分矩陣。

還記得在機器學習基石中的評分預測示意圖嗎？觀看者和電影都有自己的特徵向量，只需要計算兩個向量的相似度便可以用了預測評分。觀看者和電影向量在這裏指的是 $\mathbf { v } _ { n }$ 和 $\mathbf { w } _ { m }$ 。

那麼對於數據集 $\mathcal D$ ，該假設函數的基於平方誤差的誤差測量爲：

$E _ { \mathrm { in } } \left( \left\{ \mathbf { w } _ { m } \right\} , \left\{ \mathbf { v } _ { n } \right\} \right) = \frac { 1 } { \sum _ { m = 1 } ^ { M } \left| \mathcal { D } _ { m } \right| } \sum _ { \text {user } n \text { rated movie } m } \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) ^ { 2 }$

那麼現在就要根據數據集 $\mathcal D$ 進行 $\mathbf { v } _ { n }$ 和 $\mathbf { w } _ { m }$ 的學習來保證誤差最小。

$\begin{aligned} \min _ { \mathrm { W } , \mathrm { V } } E _ { \mathrm { in } } \left( \left\{ \mathbf { w } _ { m } \right\} , \left\{ \mathbf { v } _ { n } \right\} \right) & \propto \sum _ { \mathrm { user } n \text { rated movie } m } \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) ^ { 2 } \\ & = \sum _ { m = 1 } ^ { M } \left( \sum _ { \left( \mathbf { x } _ { n } , r _ { n m } \right) \in \mathcal { D } _ { m } } \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) ^ { 2 } \right) (1)\\ & = \sum _ { n = 1 } ^ { N } \left( \sum _ { \left( \mathbf { x } _ { n } , r _ { n m } \right) \in \mathcal { D } _ { m } } \left( r _ { n m } - \mathbf { v } _ { n } ^ { T } \mathbf { w } _ { m } \right) ^ { 2 } \right) (2) \end{aligned}$

由於上式中有 $\mathbf { v } _ { n }$ 和 $\mathbf { w } _ { m }$ 兩個變量，同時優化的話可能會很困難，所以基本的想法是使用交替最小化操作（alternating minimization）：

固定 $\mathbf { v } _ { n }$ ，也就是說固定用戶特徵向量，然後求取每一個 $\mathbf { w } _ { m } \equiv \text { minimize } E _ { \text {in } } \text { within } \mathcal { D } _ { m }$ 。
固定 $\mathbf { w } _ { m }$ ，也就是說電影的特徵向量，然後求取每一個 $\mathbf { v } _ { n } \equiv \text { minimize } E _ { \text {in } } \text { within } \mathcal { D } _ { m }$ 。

這一過程叫做交替最小二乘算法（alternating least squares algorithm）。該算法的具體實現如下：

$\begin{array} { l } \text { initialize } \tilde { d } \text { dimension vectors } \left\{ \mathbf { w } _ { m } \right\} , \left\{ \mathbf { v } _ { n } \right\} \\ \text { alternating optimization of } E _ { \text {in } } : \text { repeatedly } \\ \qquad \text { optimize } \mathbf { w } _ { 1 } , \mathbf { w } _ { 2 } , \ldots , \mathbf { w } _ { M } \text { : } \text { update } \mathbf { w } _ { m } \text { by } m \text { -th-movie linear regression on } \left\{ \left( \mathbf { v } _ { n } , r _ { n m } \right) \right\} \\ \qquad \text { optimize } \mathbf { v } _ { 1 } , \mathbf { v } _ { 2 } , \ldots , \mathbf { v } _ { N } \text { : } \text { update } \mathbf { v } _ { n } \text { by } n \text { -th-user linear regression on } \left\{ \left( \mathbf { w } _ { m } , r _ { n m } \right) \right\} \\ \text { until converge } \end{array}$

初始化過程使用的是隨機（randomly）選取。隨着迭代的過程保證了 $E _ { \text {in } }$ 不斷下降，由此保證了收斂性。交替最小二乘的過程更像用戶和電影在跳探戈舞。

線性自編碼器與矩陣分解（Linear Autoencoder versus Matrix Factorization）

$\begin{array}{c|c|c} &\text{Linear Autoencoder}&\text{Matrix Factorization}\\ \hline \text{goal} &\mathrm { X } \approx \mathrm { W } \left( \mathrm { W } ^ { T } \mathrm { X } \right)&\mathbf { R } \approx \mathbf { V } ^ { T } \mathbf { W }\\ \hline \text{motivation}&\text { special } d - \tilde { d } - d \text { linear NNet }&N - \tilde { d } - M \text { linear NNet }\\ \hline \text{solution} & \text { solution: local optimal via alternating least squares } &\text { global optimal at eigenvectors of } X ^ { T } X \\ \hline \text { usefulness}& \text { extract hidden user/movie features } & \text { extract dimension-reduced features } \end{array}$

所以線性自編碼器是一種在矩陣 $\mathrm{X}$ 做的特殊的矩陣分解。

隨機梯度法（Stochastic Gradient Descent）

相比交替迭代優化，另一種優化思路是隨機梯度下降法。

回顧一下矩陣分解的誤差測量函數：

$E _ { \mathrm { in } } \left( \left\{ \mathbf { w } _ { m } \right\} , \left\{ \mathbf { v } _ { n } \right\} \right) \propto \sum _ { \text {user } n \text { rated movie } m } \underbrace { \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) ^ { 2 } } _ { \text {err(user } n , \text { movie } m , \text { rating } r_{nm} )}$

隨機梯度下降法高效且簡單，可以拓展於其他的誤差測量。

由於每次只是拿出一個樣本進行優化，那麼先觀察一下單樣本的誤差測量：

$\operatorname { err } \left( \text {user } n , \text { movie } m , \text { rating } r _ { n m } \right) = \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) ^ { 2 }$

那麼偏導數爲：

$\begin{array} { r l } \nabla _ { \mathbf { v } _ { n } } & \operatorname { err } \left( \text { user } n , \text { movie } m , \text { rating } r _ { n m } \right) = - 2 \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) \mathbf { w } _ { m } \\ \nabla _ { \mathbf { w } _ { m } } & \operatorname { err } \left( \text { user } n , \text { movie } m , \text { rating } r _ { n m } \right) = - 2 \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) \mathbf { v } _ { n } \end{array}$

也就是說只對當前樣本的 $\mathbf { v } _ { n }$ 和 $\mathbf { w } _ { m }$ 有影響，而其他的參數的偏導均爲零。總結來說就是：

$\text {per-example gradient } \propto - ( \text { residual } ) ( \text { the other feature vector } )$

那麼使用隨機梯度下降法求解矩陣分解的實際步驟爲：

$\begin{array} { l } \text { initialize } \tilde { d } \text { dimension vectors } \left\{ \mathbf { w } _ { m } \right\} , \left\{ \mathbf { v } _ { n } \right\} \text { randomly } \\ \text{ for } t = 0,1 , \ldots , T \\ \qquad \text { (1) randomly pick } ( n , m ) \text { within all known } r _ { n m } \\ \qquad \text { (2) calculate residual } \tilde { r } _ { n m } = \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) \\ \qquad \text { (3) SGD-update: } \\ \qquad\qquad\qquad \begin{aligned} \mathbf { v } _ { n } ^ { n e w } & \leftarrow \mathbf { v } _ { n } ^ { o l d } + \eta \cdot \tilde { r } _ { n m } \mathbf { w } _ { m } ^ { o l d } \\ \mathbf { w } _ { m } ^ { n e w } & \leftarrow \mathbf { w } _ { m } ^ { o l d } + \eta \cdot \tilde { r } _ { n m } \mathbf { v } _ { n } ^ { o l d } \end{aligned} \end{array}$

但是注意一點隨機梯度下降法是針對隨機選到的樣本進行優化的，那麼針對一些對時間比較敏感的數據分析任務，比如近期的數據更有效，那麼隨機梯度下降法的隨機選取應該偏重於近期的數據樣本，那麼效果可能會好一些。如果你明白這一點，那麼在實際運用中會更容易修改該算法。

提取模型總結（Map of Extraction Models）

提取模型：思路是將特徵轉換作爲隱變量嵌入線性模型（或者其他基礎模型）中。也就是說除了模型的學習，還需要從資料中學到怎麼樣作轉換能夠有效的表現資料。

在神經網絡或者深度學習中，隱含層的前 L - 1 層（ $\text { weights } w _ { i j } ^ { ( \ell ) }$ ）是進行特徵的轉換，最後一層是線性模型（ $\text { weights } w _ { i j } ^ { ( L ) }$ ）。也就是說在學習線性模型的同時，也學到了那些隱藏的轉換。

在RBF網絡中，最後一層也是線性模型（ $\text { weights } \beta _{ m }$ ，而中間潛藏的變數（中心代表， $\text { RBF centers } \mu _ { m }$ ）也是一種特徵的學習。

而在矩陣分解中，學習到了兩個特徵那就是 $\mathbf { w } _ { m }$ 和 $\mathbf { v } _ { n }$ ，兩者可以叫線性模型的權重也可以叫特徵向量，這是相對於用戶還是電影，不同的對象功能不同。

而在自適應提升和梯度提升（Adaptive/Gradient Boosting）中，實際上假設函數 $g_t$ 的求解就是一種特徵的學習，而所學習到的係數 $\alpha_t$ 則是線性模型的權重係數。

相對來說在 k 鄰近算法中，這 Top k 的鄰居則是一種特徵轉換。而各個鄰居投票係數 $y_n$ 則是一種線性模型的權重係數。

提取技術總結（Map of Extraction Techniques）

在神經網絡或者深度學習中，則使用的是基於隨機梯度下降法（SGD）的反向傳播算法（backprop）。同時其中有一種特殊的實現：自編碼器，將輸入和輸出保持一樣，學習出一種壓縮編碼。

在RBF網絡中，使用 k 均值聚類算法（k-means clustering）找出那些中心。

而在矩陣分解中，則可以使用的是交替最小二乘（alternating leastSQR）和隨機梯度下降（SGD）。

而在自適應提升和梯度提升（Adaptive/Gradient Boosting）中，使用的技巧是梯度下降法（functional gradient descent）的思路還獲取假設函數 $g_t$ 的。

相對來說在 k 鄰近算法中，則使用的是一種 lazy learning，什麼意思呢？在訓練過程中不做什麼事情，而在測試過程中，拿已有的數據做一些推論。

提取模型的優缺（Pros and Cons of Extraction Models）

提取模型（Neural Network/Deep Learning、RBF Network 、Matrix Factorization）的優缺點如下：

優點：

easy: reduces human burden in designing features
簡單：減小了設計特徵的人力負擔
powerful : if enough hidden variables considered
強有力：如果考慮足夠多的隱變量的話

缺點

hard: non-convex optimization problems in general
困難：通常是非凸優化問題
overfitting: needs proper regularization/validation
過擬合：由於很有力，所以要合理使用正則化和驗證工具

機器學習技法之矩陣分解（Matrix Factorization）

線性神經網絡（Linear Network Hypothesis）

基本矩陣分解（Basic Matrix Factorization）

線性自編碼器與矩陣分解（Linear Autoencoder versus Matrix Factorization）

隨機梯度法（Stochastic Gradient Descent）

提取模型總結（Map of Extraction Models）

提取技術總結（Map of Extraction Techniques）

提取模型的優缺（Pros and Cons of Extraction Models）

ollama使用

Window 安裝 Python 失敗 0x80070643，發生嚴重錯誤

TiDB Vector 太香啦：以圖搜圖初體驗！

《最新出爐》系列入門篇-Python+Playwright自動化測試-41-錄製視頻

多層感知器分類器的 Tensorflow 實現

Tensorflow 之張量操作

Tensorflow 之張量類型

Tensorflow 之 CPU計算效率和GPU計算效率對比

梯度提升機（Gradient Boosting Machine）之 LightGBM

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

機器學習技法 之 矩陣分解（Matrix Factorization）

線性神經網絡（Linear Network Hypothesis）

基本矩陣分解（Basic Matrix Factorization）

線性自編碼器與矩陣分解（Linear Autoencoder versus Matrix Factorization）

隨機梯度法（Stochastic Gradient Descent）

提取模型總結（Map of Extraction Models）

提取技術總結（Map of Extraction Techniques）

提取模型的優缺（Pros and Cons of Extraction Models）

機器學習技法之矩陣分解（Matrix Factorization）