跨語言詞向量筆記4. 句級別對齊方法

使用平行語料的句子級別方法

本文完全來自於Anders Søgaard等人的著作[Søgaard2019] Søgaard, A., Vulić, I., Ruder, S., & Faruqui M. (2019). Cross-Lingual Word Embeddings

使用平行語料的句子級別方法

使用句子對齊數據的方法基本上都是成功的單語模型的擴展，可以大致分爲三類

組成法

組成法由[Hermann2013]引入，思想是將詞向量組合成句向量，然後訓練模型讓平行句各自的向量互相靠近。對於平行句 ${sent}^s$ 和 ${sent}^t$ ，語言 $s$ 的句子 ${sent}^s$ 對應的向量 $\boldsymbol{y}^s$ 由組成它的詞的詞向量相加而得
$\boldsymbol{y}^s = \sum_{i=1}^{|sent^s|}\boldsymbol{x}_i^s$

訓練的目標是最小化已對齊句對 $sent^s$ 和 $sent^t$ 之間的距離

$E_{\rm dist}(sent^s, sent^t) = \|\boldsymbol{y}^s - \boldsymbol{y}^t\|^2$

優化的目標函數是MMHL（Max Margin Hinge Loss），讓源句與已對齊目標句之間的距離，比其與隨機抽出的負樣本之間的距離更近
$\Omega_{\rm MMHL} = \sum_{(sent^s, sent^t) \in \mathcal{C}}\sum_{i=1}^k \max\left(0, 1 + E_{\rm dist}(sent^s, sent^t) - E_{\rm dist}(sent^s, s_i^t)\right)$
其中 $k$ 是負樣本數量。此外，對每個語言，施加一個 $\ell_2$ 正則項 $\Omega_{\ell_2} = \lambda / 2 \|\boldsymbol{X}\|^2$ 。因此最終總損失函數爲
$J = \Omega_{\ell_2}(\boldsymbol{X}^s) + \Omega_{\ell_2}(\boldsymbol{X}^t) + \Omega_{\rm MMHL}(\boldsymbol{X}^s, \boldsymbol{X}^t, \mathcal{C})$
這裏所有損失函數都是一起優化，沒有特別獨立設計的單語目標函數。[Soyer2015]將其做了些擴展，用到了短語級的單語目標函數

雙語自編碼器

[Lauly2013]的做法是嘗試根據原始句子重構目標句。他們也是把句子編碼成詞向量之和，然後使用語言相關的編碼器-解碼器和分層softmax來訓練自編碼器，重構句子本身和對應的翻譯。在這種情況下，編碼器參數是詞嵌入矩陣 $\boldsymbol{X}^s$ 和 $\boldsymbol{X}^t$ ，解碼器參數是將已編碼信息投影到輸出語言空間的矩陣。目標函數爲
$J = \mathcal{L}_{\rm AUTO}^{s\rightarrow s} + \mathcal{L}_{\rm AUTO}^{t\rightarrow t} + \mathcal{L}_{\rm AUTO}^{s\rightarrow t} + \mathcal{L}_{\rm AUTO}^{t\rightarrow s}$
其中 $\mathcal{L}_{\rm AUTO}^{s \rightarrow t}$ 是從源語言 $s$ 句子重構爲目標語言 $t$ 句子的損失函數。對齊句子對是從平行語料中抽樣獲得，所有損失函數一起優化。該小組的後續擴展[Chandar2014]是使用二元詞袋來代替分層softmax。由於引入詞袋以後維度變高模型變複雜，該方法提出將一個mini batch中的所有詞袋融合成一個詞袋，在這個融合詞袋上更新。此外，還在目標函數上加了一項，通過對兩個向量所有維度的相關性標量求和，來提高源句和目標句向量之間的相關性（書上就寫得比較簡單，可能需要讀引用的原文）

雙語skip-gram

一些方法是對單語的SGNS模型進行擴展來學習跨語言詞向量，做法都是將兩種語言的單語SGNS損失值和一個額外的跨語言正則項一起聯合優化，形如
$J = \mathcal{L}_{\rm SGNS}^s + \mathcal{L}_{\rm SGNS}^t + \Omega(\boldsymbol{X}^s, \boldsymbol{X}^t, \mathcal{C})$
而且這些工作都沒用到已對齊句的詞對齊信息，只是對數據對齊做了不同假設

BilBOWA（Bilingual Bag-of-Words without Word Alignments）[Gouws2015]假設源句中的每個單詞都與目標句的每個單詞對齊。很顯然，知道對齊信息的情況下，一個自然的想法是讓對齊的單詞詞向量儘量接近。那麼如果任意源句單詞都與所有目標句單詞對齊，實際上也就是讓它們詞向量的均值儘量接近，即BilBOWA的目標函數是
$\begin{aligned} \boldsymbol{y}^s &= \frac{1}{|sent^s|}\sum_{i=1}^{|sent^s|}\boldsymbol{x}_i^s \\ \Omega_{\rm BilBOWA} &= \sum_{(sent^s, sent^t) \in \mathcal{C}} \|\boldsymbol{y}^s - \boldsymbol{y}^t\|^2 \end{aligned}$

這個正則項很像組成法用的目標函數，唯一不同是BilBOWA用了均值，而組成法用了總和

Trans-gram[Coulmance2015]對對齊關係做了和BilBOWA相同的假設，不過在跨語言正則項用的也是SGNS目標函數，此時中心詞是被對齊的目標語言單詞，上下文單詞來自於源語言。由於前面的對齊假設，實際上是要使用源句每個單詞預測目標句子所有語言，即
$\Omega_{\rm Trans-gram}=: \Omega_{\rm SGNS}^{s \rightarrow t} = -\sum_{(sent^s,sent^t)\in \mathcal{C}}\frac{1}{|sent^s|}\sum_{t=1}^{|sent^s|}\sum_{j=1}^{|sent^t|}\log P(w_{t+j}|w_t)$

其中 $P(w_{t+j}|w_t)$ 使用負採樣法計算

BiSkip[Luong2015]使用了和Trans-gram一樣的目標函數，不過對源句中的第 $i$ 個詞，其只用來預測目標句的第 $i\cdot \frac{sent^t}{sent^s}$ 個詞，即文章認爲平行句的單詞是順序對應的

其它方法

[Levy2017]通過使用平行語料中對齊句對的ID作爲特徵就達到了一個不錯的基線模型（這篇文章感覺太神奇了，可能需要單獨看看）

使用可比較數據的句級別對齊方法

和詞級別方法類似，這樣的工作也是使用同一張圖片的不同語言標註。不贅述

參考文獻

[Hermann2013]: Karl Moritz Hermann and Phil Blunsom. 2013. Multilingual distributed representations without word alignment. In Proc. of the International Conference on Learning Representations (Conference Track) (ICLR 2013).
[Soyer2015]: Hubert Soyer, Pontus Stenetorp, and Akiko Aizawa. 2015. Leveraging monolingual data for crosslingual compositional word representations. In Proc. of the 3rd International Conference on Learning Representations (ICLR 2015).
[Lauly2013]: Stanislas Lauly, Alex Boulanger, and Hugo Larochelle. 2013. Learning multilingual word representations using a bag-of-words autoencoder. In arXiv preprint arXiv:1401.1803.
[Chandar2014]: Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh M. Khapra, Balaraman Ravindran, Vikas Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. In Proc. of the 27th Annual Conference on Neural Information Processing Systems (NeurIPS 2014), pages 1853–1861.
[Gouws2015]: Stephan Gouws, Yoshua Bengio, and Greg Corrado. 2015. BilBOWA: Fast bilingual distributed representations without word alignments. In Proc. of International Conference on Machine Learning (ICML 2015).
[Coulmance2015]: Jocelyn Coulmance, Jean-Marc Marty, Guillaume Wenzek, and Amine Benhalloum. 2015. Trans-gram, fast cross-lingual word-embeddings. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pages 1109–1113.
[Luong2015]: Minh-Tang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proc. of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 151–159.
[Levy2017]: Omer Levy, Anders Søgaard, and Yoav Goldberg. 2017. A strong baseline for learning crosslingual word embeddings from sentence alignments. In Proc. of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (EACL 2017).

跨語言詞向量筆記4. 句級別對齊方法