AI/機器學習常用公式的LaTex代碼彙總

在寫AI/機器學習相關的論文或者博客的時候經常需要用到LaTex的公式,然而作爲資深“伸手黨”的我在網上搜索的時候,居然沒有找到相關現成資源@-@

那麼,我就把自己經常會遇到的公式整理如下,以NLP和一些通用指標函數爲主。有需要的可以自取,當然發現有問題或者遺漏的也歡迎指正和補充。(我同步到了Github上( https://github.com/blmoistawinde/ml_equations_latex ),歡迎提issue和PR,當然還有star~)

Classical ML Equations in LaTeX

A collection of classical ML equations in Latex . Some of them are provided with simple notes and paper link. Hopes to help writings such as papers and blogs.

Better viewed at https://blmoistawinde.github.io/ml_equations_latex/

Model

RNNs(LSTM, GRU)

encoder hidden state hth_t at time step tt
ht=RNNenc(xt,ht1)h_t = RNN_{enc}(x_t, h_{t-1})

decoder hidden state sts_t at time step tt

st=RNNdec(yt,st1)s_t = RNN_{dec}(y_t, s_{t-1})

h_t = RNN_{enc}(x_t, h_{t-1})
s_t = RNN_{dec}(y_t, s_{t-1})

The RNNencRNN_{enc}, RNNdecRNN_{dec} are usually either

Attentional Seq2seq

The attention weight αij\alpha_{ij}, the iith decoder step over the jjth encoder step, resulting in context vector cic_i

ci=j=1Txαijhjc_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j

αij=exp(eij)k=1Txexp(eik) \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}

eik=a(si1,hj) e_{ik} = a(s_{i-1}, h_j)

c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}

e_{ik} = a(s_{i-1}, h_j)

aa is an specific attention function, which can be

Bahdanau Attention

Paper: Neural Machine Translation by Jointly Learning to Align and Translate

eik=vTtanh(W[si1;hj])e_{ik} = v^T tanh(W[s_{i-1}; h_j])

e_{ik} = v^T tanh(W[s_{i-1}; h_j])

Luong(Dot-Product) Attention

Paper: Effective Approaches to Attention-based Neural Machine Translation

If sis_i and hjh_j has same number of dimension.

eik=si1Thje_{ik} = s_{i-1}^T h_j

otherwise

eik=si1TWhje_{ik} = s_{i-1}^T W h_j

e_{ik} = s_{i-1}^T h_j

e_{ik} = s_{i-1}^T W h_j

Finally, the output oio_i is produced by:

st=tanh(W[st1;yt;ct])s_t = tanh(W[s_{t-1};y_t;c_t])
ot=softmax(Vst)o_t = softmax(Vs_t)

s_t = tanh(W[s_{t-1};y_t;c_t])
o_t = softmax(Vs_t)

Transformer

Paper: Attention Is All You Need

Scaled Dot-Product attention

Attention(Q,K,V)=softmax(QKTdk)V Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

where dk\sqrt{d_k} is the dimension of the key vector kk and query vector qq .

Multi-head attention

MultiHead(Q,K,V)=Concat(head1,...,headh)WOMultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O

where
headi=Attention(QWiQ,KWiK,VWiV) head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)

MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O

head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)

Generative Adversarial Networks(GAN)

Paper: Generative Adversarial Networks

Minmax game objective

minGmaxDExpdata(x)[logD(x)]+Ezpgenerated(z)[1logD(G(z))] \min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}] + \mathbb{E}_{z\sim p_{\text{generated}}(z)}[1 - \log{D(G(z))}]

\min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}] +  \mathbb{E}_{z\sim p_{\text{generated}}(z)}[1 - \log{D(G(z))}]

Variational Auto-Encoder(VAE)

Paper: Auto-Encoding Variational Bayes

Reparameterization trick

To produce a latent variable z such that zqμ,σ(z)=N(μ,σ2)z \sim q_{\mu, \sigma}(z) = \mathcal{N}(\mu, \sigma^2), we sample ϵN(0,1)\epsilon \sim \mathcal{N}(0,1), than z is produced by

z=μ+ϵσz = \mu + \epsilon \cdot \sigma

z \sim q_{\mu, \sigma}(z) = \mathcal{N}(\mu, \sigma^2)
\epsilon \sim \mathcal{N}(0,1)
z = \mu + \epsilon \cdot \sigma

Above is for 1-D case. For a multi-dimensional (vector) case we use:

ϵN(0,I) \vec{\epsilon} \sim \mathcal{N}(0, \textbf{I})

zN(μ,σ2I) \vec{z} \sim \mathcal{N}(\vec{\mu}, \sigma^2 \textbf{I})

\epsilon \sim \mathcal{N}(0, \textbf{I})
\vec{z} \sim \mathcal{N}(\vec{\mu}, \sigma^2 \textbf{I})

Activations

Sigmoid

Related to Logistic Regression. For single-label/multi-label binary classification.

σ(z)=11+ez\sigma(z) = \frac{1} {1 + e^{-z}}

\sigma(z) = \frac{1} {1 + e^{-z}}

Softmax

For multi-class single label classification.

σ(zi)=ezij=1Kezj   for i=1,2,,K\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K

\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K

Relu

Relu(z)=max(0,z)Relu(z) = max(0, z)

Relu(z) = max(0, z)

Loss

Regression

Below xx and yy are DD dimensional vectors, and xix_i denotes the value on the iith dimension of xx.

Mean Absolute Error(MAE)

i=1Dxiyi\sum_{i=1}^{D}|x_i-y_i|

\sum_{i=1}^{D}|x_i-y_i|

Mean Squared Error(MSE)

i=1D(xiyi)2\sum_{i=1}^{D}(x_i-y_i)^2

\sum_{i=1}^{D}(x_i-y_i)^2

Huber loss

It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval.

Lδ={12(yy^)2if(yy^)<δδ((yy^)12δ)otherwise L_{\delta}= \left\{\begin{matrix} \frac{1}{2}(y - \hat{y})^{2} & if \left | (y - \hat{y}) \right | < \delta\\ \delta ((y - \hat{y}) - \frac1 2 \delta) & otherwise \end{matrix}\right.

L_{\delta}=
    \left\{\begin{matrix}
        \frac{1}{2}(y - \hat{y})^{2} & if \left | (y - \hat{y})  \right | < \delta\\
        \delta ((y - \hat{y}) - \frac1 2 \delta) & otherwise
    \end{matrix}\right.

Classification

Cross Entropy

  • In binary classification, where the number of classes MM equals 2, Binary Cross-Entropy(BCE) can be calculated as:

(ylog(p)+(1y)log(1p))-{(y\log(p) + (1 - y)\log(1 - p))}

  • If M>2M > 2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.

c=1Myo,clog(po,c)-\sum_{c=1}^My_{o,c}\log(p_{o,c})

-{(y\log(p) + (1 - y)\log(1 - p))}

-\sum_{c=1}^My_{o,c}\log(p_{o,c})

M - number of classes

log - the natural log

y - binary indicator (0 or 1) if class label c is the correct classification for observation o

p - predicted probability observation o is of class c

Negative Loglikelihood

NLL(y)=log(p(y))NLL(y) = -{\log(p(y))}

Minimizing negative loglikelihood

minθylog(p(y;θ))\min_{\theta} \sum_y {-\log(p(y;\theta))}

is equivalent to Maximum Likelihood Estimation(MLE).

maxθyp(y;θ)\max_{\theta} \prod_y p(y;\theta)

Here p(y)p(y) is a scaler instead of vector. It is the value of the single dimension where the ground truth yy lies. It is thus equivalent to cross entropy (See wiki).\

NLL(y) = -{\log(p(y))}

\min_{\theta} \sum_y {-\log(p(y;\theta))}

\max_{\theta} \prod_y p(y;\theta)

Hinge loss

Used in Support Vector Machine(SVM).

max(0,1yy^)max(0, 1 - y \cdot \hat{y})

max(0, 1 - y \cdot \hat{y})

KL/JS divergence

KL(y^y)=c=1My^clogy^cycKL(\hat{y} || y) = \sum_{c=1}^{M}\hat{y}_c \log{\frac{\hat{y}_c}{y_c}}

JS(y^y)=12(KL(yy+y^2)+KL(y^y+y^2))JS(\hat{y} || y) = \frac{1}{2}(KL(y||\frac{y+\hat{y}}{2}) + KL(\hat{y}||\frac{y+\hat{y}}{2}))

KL(\hat{y} || y) = \sum_{c=1}^{M}\hat{y}_c \log{\frac{\hat{y}_c}{y_c}}

v

Regularization

The ErrorError below can be any of the above loss.

L1 regularization

A regression model that uses L1 regularization technique is called Lasso Regression.

Loss=Error(YY^)+λ1nwiLoss = Error(Y - \widehat{Y}) + \lambda \sum_1^n |w_i|

Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n |w_i|

L2 regularization

A regression model that uses L1 regularization technique is called Ridge Regression.

Loss=Error(YY^)+λ1nwi2Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n w_i^{2}

Loss = Error(Y - \widehat{Y}) +  \lambda \sum_1^n w_i^{2}

Metrics

Some of them overlaps with loss, like MAE, KL-divergence.

Classification

Accuracy, Precision, Recall, F1

Accuracy=TP+TFTP+TF+FP+FNAccuracy = \frac{TP+TF}{TP+TF+FP+FN}

Precision=TPTP+FPPrecision = \frac{TP}{TP+FP}

Recall=TPTP+FNRecall = \frac{TP}{TP+FN}

F1=2PrecisionRecallPrecision+Recall=2TP2TP+FP+FNF1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}

Accuracy = \frac{TP+TF}{TP+TF+FP+FN}
Precision = \frac{TP}{TP+FP}
Recall = \frac{TP}{TP+FN}
F1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}

Sensitivity, Specificity and AUC

Sensitivity=Recall=TPTP+FNSensitivity = Recall = \frac{TP}{TP+FN}

Specificity=TNFP+TNSpecificity = \frac{TN}{FP+TN}

Sensitivity = Recall = \frac{TP}{TP+FN}
Specificity = \frac{TN}{FP+TN}

AUC is calculated as the Area Under the SensitivitySensitivity(TPR)-(1Specificity)(1-Specificity)(FPR) Curve.

Regression

MAE, MSE, equation above.

Clustering

(Normalized) Mutual Information (NMI)

The Mutual Information is a measure of the similarity between two labels of the same data. Where Ui|U_i| is the number of the samples in cluster UiU_i and Vi|V_i| is the number of the samples in cluster ViV_i , the Mutual Information between cluster UU and VV is given as:

MI(U,V)=i=1Uj=1VUiVjNlogNUiVjUiVj MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N} \log\frac{N|U_i \cap V_j|}{|U_i||V_j|}

MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N}
\log\frac{N|U_i \cap V_j|}{|U_i||V_j|}

Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by some generalized mean of H(labels_true) and H(labels_pred)), See wiki.

Skip RI, ARI for complexity.

Also skip metrics for related tasks (e.g. modularity for community detection[graph clustering], coherence score for topic modeling[soft clustering]).

Ranking

Skip nDCG (Normalized Discounted Cumulative Gain) for its complexity.

(Mean) Average Precision(MAP)

Average Precision is calculated as:

AP=n(RnRn1)Pn\text{AP} = \sum_n (R_n - R_{n-1}) P_n

\text{AP} = \sum_n (R_n - R_{n-1}) P_n

where RnR_n and PnP_n are the precision and recall at the nnth threshold,

MAP is the mean of AP over all the queries.

Similarity/Relevance

Cosine

Cosine(x,y)=xyxyCosine(x,y) = \frac{x \cdot y}{|x||y|}

Cosine(x,y) = \frac{x \cdot y}{|x||y|}

Jaccard

Similarity of two sets UU and VV.

Jaccard(U,V)=UVUVJaccard(U,V) = \frac{|U \cap V|}{|U \cup V|}

Jaccard(U,V) = \frac{|U \cap V|}{|U \cup V|}

Pointwise Mutual Information(PMI)

Relevance of two events xx and yy.

PMI(x;y)=logp(x,y)p(x)p(y)PMI(x;y) = \log{\frac{p(x,y)}{p(x)p(y)}}

PMI(x;y) = \log{\frac{p(x,y)}{p(x)p(y)}}

For example, p(x)p(x) and p(y)p(y) is the frequency of word xx and yy appearing in corpus and p(x,y)p(x,y) is the frequency of the co-occurrence of the two.

Notes

This repository now only contains simple equations for ML. They are mainly about deep learning and NLP now due to personal research interests.

For time issues, elegant equations in traditional ML approaches like SVM, SVD, PCA, LDA are not included yet.

Moreover, there is a trend towards more complex metrics, which have to be calculated with complicated program (e.g. BLEU, ROUGE, METEOR), iterative algorithms (e.g. PageRank), optimization (e.g. Earth Mover Distance), or even learning based (e.g. BERTScore). They thus cannot be described using simple equations.

Reference

Pytorch Documentation

Scikit-learn Documentation

Machine Learning Glossary

Wikipedia

https://blog.floydhub.com/gans-story-so-far/

https://ermongroup.github.io/cs228-notes/extras/vae/

Thanks for a-rodin’s solution to show Latex in Github markdown, which I have wrapped into latex2pic.py.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章