Neural Architectures for Named Entity Recognition（用於命名實體識別的神經結構）全文翻譯

6## 前言
原文：https://arxiv.org/pdf/1603.01360.pdf
主要使用翻譯軟件：http://fanyi.youdao.com/
人工修改：https://blog.csdn.net/qq_41837900
本文主要使用 有道翻譯 ，由人工對細節修改，力求達到信達雅。

正文：

Neural Architectures for Named Entity Recognition(用於命名實體識別的神經結構)

Guillaume Lample♠ Miguel Ballesteros♣♠
Sandeep Subramanian♠ Kazuya Kawakami♠ Chris Dyer♠
♠Carnegie Mellon University ♣NLP Group, Pompeu Fabra University
{glample,sandeeps,kkawakam,cdyer}@cs.cmu.edu,
[email protected]

Abstract(摘要)

State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn
effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures—one
based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based
approach inspired by shift-reduce parsers.Our models rely on two sources of information about words: character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora. Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers. ¹

¹The code of the LSTM-CRF and Stack-LSTM NER systems are available at https://github.com/glample/tagger and https://github.com/clab/stack-lstm-ner
LSTM-CRF and Stack-LSTM NER系統的代碼已在https://github.com/glample/tagger 和 https://github.com/clab/stack-lstm-ner

（現在）最先進的命名實體識別系統嚴重依賴手工製作的特性和特定領域的知識，以便從現有的小型、受監督的培訓語料庫中有效地學習。（而）在本文中，我們介紹了兩種新的神經結構——一種基於雙向LSTMs和條件隨機域的神經結構，另一種基於移位約簡解析器的基於移位的方法構造和標記段。我們的模型依賴於兩個關於單詞的信息源:從有監督的語料庫中學習的基於字符的單詞表示和從無註釋的語料庫中學習的無監督單詞表示。我們的模型在四種語言的NER中獲得了最先進的性能，而不依賴於任何特定於語言的知識或資源，如地名辭典。

1 Introduction(介紹)

Named entity recognition (NER) is a challenging learning problem. One the one hand, in most languages and domains, there is only a very small amount of supervised training data available. On the other, there are few constraints on the kinds of words that can be names, so generalizing from this small sample of data is difficult. As a result, carefully constructed orthographic features and language-specific knowledge resources, such as gazetteers, are widely used for solving this task. Unfortunately, languagespecific resources and features are costly to develop in new languages and new domains, making NER a challenge to adapt. Unsupervised learning from unannotated corpora offers an alternative strategy for obtaining better generalization from small amounts of supervision. However, even systems that have relied extensively on unsupervised features(Collobert et al., 2011; Turian et al., 2010;Lin and Wu, 2009; Ando and Zhang, 2005b, inter alia) have used these to augment, rather than replace, hand-engineered features (e.g., knowledge about capitalization patterns and character classes in a particular language) and specialized knowledge resources(e.g., gazetteers).

命名實體識別(NER)是一個具有挑戰性的學習問題。一方面，在大多數語言和領域，只有非常少的監督訓練數據可用。另一方面，對於可以作爲名稱的單詞種類幾乎沒有什麼限制，因此從這個小數據樣本中進行概括是困難的。因此，精心構建的正字法特徵和特定語言的知識資源，如地名辭典，被廣泛用於解決這一任務。不幸的是，在新語言和新領域中開發語言特定的資源和特性非常昂貴，這使得NER很難適應。從無註釋的語料庫中進行無監督學習爲從少量監督中獲得更好的泛化提供了另一種策略。然而，即使是廣泛依賴於非監督特性的系統，也使用這些特性來擴充而不是（完全）替代手工標註特性(例如,瞭解特定語言中的大小寫模式和字符類) 和專門的知識資源(例如,地名錶)。

In this paper, we present neural architectures for NER that use no language-specific resources or features beyond a small amount of supervised training data and unlabeled corpora. Our models are designed to capture two intuitions. First, since names often consist of multiple tokens, reasoning jointly over tagging decisions for each token is important. We compare two models here, (i) a bidirectional LSTM with a sequential conditional random layer above it (LSTM-CRF; §2), and (ii) a new model that constructs and labels chunks of input sentences using an algorithm inspired by transition-based parsing with states represented by stack LSTMs (S-LSTM; §3). Second, token-level evidence for “being a name” includes both orthographic evidence (what does the word being tagged as a name look like?) and distributional evidence (where does the word being tagged tend to occur in a corpus?). To capture orthographic sensitivity, we use character-based word representation model (Ling et al., 2015b) to capture distributional sensitivity, we combine these representations with distributional representations (Mikolov et al., 2013b). Our word representations combine both of these, and dropout training is used to encourage the model to learn to trust both sources of evidence (§4).

在本文中，我們提出了一種神經網絡結構，它不使用任何特定於語言的資源或特性，只使用少量有監督的訓練數據和未標記的語料庫。我們的模型旨在捕捉兩種直覺。首先，由於名稱通常由多個標記組成，因此對每個標記的標記決策進行聯合推理非常重要。我們在這裏比較了兩個模型，(i)一個雙向的LSTM，上面有一個順序條件隨機層 (LSTM-CRF; §2)，(ii)一個新的模型，使用基於轉換的解析的算法構造和標記輸入語句塊，並使用堆棧LSTMs (S-LSTM; §3) 表示的狀態。其次，“作爲名稱”的標籤證據包括拼寫證據（被標記爲名稱的單詞是什麼樣子的？）和分配證據（被標記的單詞在語料庫中往往出現在哪裏？）。爲了捕獲拼寫法靈敏度，我們使用基於字符的單詞表示模型 (Ling et al., 2015b)來捕獲分佈靈敏度，我們將這些表示與分配表示相結合 (Mikolov et al., 2013b)。我們的單詞表示結合了這兩種方法，而 dropout training （Dropout是指在模型訓練時隨機讓網絡某些隱含層節點的權重不工作，不工作的那些節點可以暫時認爲不是網絡結構的一部分，但是它的權重得保留下來(只是暫時不更新而已），因爲下次樣本輸入時它可能又得工作了。訓練神經網絡模型時，如果訓練樣本較少，爲了防止模型過擬合，Dropout可以作爲一種trikc供選擇。）被用來鼓勵模型學會信任這兩種證據來源 (§4)。

Experiments in English, Dutch, German, and Spanish show that we are able to obtain state of-the-art NER performance with the LSTM-CRF model in Dutch, German, and Spanish, and very near the state-of-the-art in English without any hand-engineered features or gazetteers (§5). The transition-based algorithm likewise surpasses the best previously published results in several languages, although it performs less well than the LSTM-CRF model.

在英語、荷蘭語、德語和西班牙語中進行的實驗表明，我們能夠使用荷蘭語、德語和西班牙語的LSTM-CRF模型獲得最先進的NER性能，並且在英語上非常接近最先進（模型），而不需要任何手工設計的特性或地名錶(§5)。基於轉換的算法同樣超越了以前用幾種語言發佈的最佳結果，儘管它的性能不如LSTM-CRF模型。

2 LSTM-CRF Model （LSTM-CRF模型）

We provide a brief description of LSTMs and CRFs, and present a hybrid tagging architecture. This architecture is similar to the ones presented by Collobert et al. (2011) and Huang et al. (2015).

我們對LSTMs和CRFs進行了簡要的描述，並提出了一種混合的標籤體系結構。這個架構類似於Collobert等人(2011)和Huang等人(2015)提出的架構。

2.1 LSTM

Recurrent neural networks (RNNs) are a family of neural networks that operate on sequential data. They take as input a sequence of vectors $\begin{array}{l}{\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{n}\right)} \end{array}$ and return another sequence $\begin{array}{l} {\left(\mathbf{h}_{1}, \mathbf{h}_{2}, \dots, \mathbf{h}_{n}\right)}\end{array}$ that represents some information about the sequence at every step in the input. Although RNNs can, in theory, learn long dependencies, in practice they fail to do so and tend to be biased towards their most recent inputs in the sequence (Bengio et al., 1994). Long Short-term Memory Networks (LSTMs) have been designed to combat this issue by incorporating a memory-cell and have been shown to capture long-range dependencies. They do so using several gates that control the proportion of the input to give to the memory cell, and the proportion from the previous state to forget (Hochreiter and Schmidhuber,

LSTM遞歸神經網絡(RNNs)是一類基於序列數據的神經網絡。它們取一個向量序列作爲輸入 $\begin{array}{l}{\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{n}\right)} \end{array}$ 然後返回另一個序列 $\begin{array}{l} {\left(\mathbf{h}_{1}, \mathbf{h}_{2}, \dots, \mathbf{h}_{n}\right)}\end{array}$ ，表示在輸入的每個步驟中關於序列的一些信息。雖然RNNs在理論上可以學習長依賴關係，但在實踐中卻無法做到這一點，而且往往偏向於序列中最近的輸入(Bengio et al.， 1994)。長期短期內存網絡(LSTMs)被設計成通過合併一個內存單元來解決這個問題，並被證明能夠捕獲長期依賴關係。他們使用幾個門來控制輸入到內存單元的比例，以及前一狀態到遺忘的比例(Hochreiter和Schmidhuber, 1997)。我們使用下面的實現:

$\begin{aligned} \mathbf{i}_{t}=& \sigma\left(\mathbf{W}_{x i} \mathbf{x}_{t}+\mathbf{W}_{h i} \mathbf{h}_{t-1}+\mathbf{W}_{c i} \mathbf{c}_{t-1}+\mathbf{b}_{i}\right) \\ \mathbf{c}_{t}=&\left(1-\mathbf{i}_{t}\right) \odot \mathbf{c}_{t-1}+\mathbf{i}_{t} \odot \tanh \left(\mathbf{W}_{x c} \mathbf{x}_{t}+\mathbf{W}_{h c} \mathbf{h}_{t-1}+\mathbf{b}_{c}\right) \\ \mathbf{o}_{t}=& \sigma\left(\mathbf{W}_{x o} \mathbf{x}_{t}+\mathbf{W}_{h o} \mathbf{h}_{t-1}+\mathbf{W}_{c o} \mathbf{c}_{t}+\mathbf{b}_{o}\right) \\ & \mathbf{h}_{t}=\mathbf{o}_{t} \odot \tanh \left(\mathbf{c}_{t}\right) \end{aligned}$

where σ is the element-wise sigmoid function, and $\odot$ is the element-wise product.

其中σ是元素方式(element-wise)的sigmoid函數，, $\odot$ 是元素方式(element-wise)的產品。

For a given sentence $\begin{array}{l}{\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{n}\right)} \end{array}$ containing n words, each represented as a d-dimensional vector, an LSTM computes a representation $\overrightarrow{\mathrm{h}_{t}}$ of the left context of the sentence at every word t. Naturally, generating a representation of the right context $\overleftarrow{\mathrm{h}_{t}}$ as well should add useful information. This can be achieved using a second LSTM that reads the same sequence in reverse. We will refer to the former as the forward LSTM and the latter as the backward LSTM. These are two distinct networks with different parameters. This forward and backward LSTM pair is referred to as a bidirectional LSTM (Graves and Schmidhuber, 2005).

對於給定的句子 $\begin{array}{l}{\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{n}\right)} \end{array}$ 包含n個單詞，每個單詞都表示爲d維向量，LSTM計算每個單詞t的句子左上下文的表示 $\overrightarrow{\mathrm{h}_{t}}$ 。自然，生成右上下文的表示 $\overleftarrow{\mathrm{h}_{t}}$ t也應該添加有用的信息。可以使用第二個反向讀取相同序列的LSTM來實現這一點。我們將前者稱爲前向LSTM，後者稱爲後向LSTM。這是兩個具有不同參數的不同網絡。這種正向和反向LSTM對稱爲雙向LSTM (Graves和Schmidhuber, 2005)。

The representation of a word using this model is obtained by concatenating its left and right context representations, $\mathbf{h}_{t}=\left[\overrightarrow{\mathrm{h}_{t}}; \overleftarrow{h}_{t}\right]$ . These representations effectively include a representation of a word in context, which is useful for numerous tagging applications.

使用該模型的單詞表示是通過連接其左右上下文表示 $\mathbf{h}_{t}=\left[\overrightarrow{\mathrm{h}_{t}}; \overleftarrow{h}_{t}\right]$ 得到的。這些表示有效地包括上下文中單詞的表示，這對於許多標記應用程序非常有用。

2.2 CRF Tagging Models （CRF 標籤模型）

A very simple—but surprisingly effective—tagging model is to use the ${h}_{t}$ ’s as features to make independent tagging decisions for each output ${y}_{t}$ (Ling et al., 2015b). Despite this model’s success in simple problems like POS tagging, its independent classifi- cation decisions are limiting when there are strong dependencies across output labels. NER is one such task, since the “grammar” that characterizes interpretable sequences of tags imposes several hard constraints (e.g., I-PER cannot follow B-LOC; see §2.4 for details) that would be impossible to model with independence assumptions.

一個非常簡單但非常有效的標籤模型是使用 ${h}_{t}$ 的特性爲每個輸出 ${y}_{t}$ 做出獨立的標籤決策(Ling等，2015b)。儘管該模型在諸如POS標籤這樣的簡單問題上取得了成功，但是當輸出標籤之間存在很強的依賴關係時，它的獨立分類決策受到了限制。NER就是這樣一個任務，因爲描述可解釋的標籤序列的“語法”強加了幾個硬約束(例如，I-PER不能遵循B-LOC; see §2.4 for details)這是不可能用獨立假設來建模的。

Therefore, instead of modeling tagging decisions independently, we model them jointly using a conditional random field (Lafferty et al., 2001). For an input sentence

因此，我們沒有獨立地對標記決策建模，而是聯合使用條件隨機字段對它們建模(Lafferty et al.， 2001)。對於輸入語句

$\mathbf{X}=\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{n}\right)$

we consider ${P}$ to be the matrix of scores output by the bidirectional LSTM network. ${P}$ is of size $n × k$ , where $k$ is the number of distinct tags, and ${P}_{i,j}$ corresponds to the score of the $j^{t h}$ tag of the $i^{t h}$ word in a sentence. For a sequence of predictions

我們認爲 $P$ 是雙向LSTM網絡輸出的分數矩陣。 $P$ 的大小爲 $n×k$ ，其中 $k$ 爲不同標籤的個數， ${P}_{i,j}$ 對應一個句子中第 $i^{th}$ 個單詞的第 $j^{th}$ 個標籤的得分。對於一系列的預測

$\mathbf{y}=\left(y_{1}, y_{2}, \ldots, y_{n}\right)$

we define its score to be

我們定義它的分數(得分)爲

$s(\mathbf{X}, \mathbf{y})=\sum_{i=0}^{n} A_{y_{i}, y_{i+1}}+\sum_{i=1}^{n} P_{i, y_{i}}$

where $A$ is a matrix of transition scores such that $A_{i,j}$ represents the score of a transition from the tag $i$ to tag $j$ . $y_{0}$ and $y_{n}$ are the start and end tags of a sentence, that we add to the set of possible tags. $A$ is therefore a square matrix of size $k+2$ .

其中 $A$ 是一個轉換得分矩陣， $A_{i j}$ 表示從標籤 $i$ 到標籤 $j$ 的轉換得分， $y_{0}$ 和 $y_{n}$ 是一個句子的開始和結束標籤，我們將它們添加到一組可能的標籤中。因此 $A$ 是一個大小爲 $k+2$ 的方陣。
譯者圖：

譯者注：增加了start與end，所以爲k+2

A softmax over all possible tag sequences yields a probability for the sequence $y$ :

softmax對所有可能的標籤序列產生一個序列 $y$ 的概率:

$p(\mathbf{y} | \mathbf{X})=\frac{e^{s(\mathbf{X}, \mathbf{y})}}{\sum_{\widetilde{\mathbf{y}} \in \mathbf{Y}_{\mathbf{X}}} e^{s(\mathbf{X}, \widetilde{\mathbf{y}})}}$

During training, we maximize the log-probability of the correct tag sequence:

在訓練過程中，我們最大化了正確標籤序列的log-probability(對數概率):

$\begin{aligned} \log (p(\mathbf{y} | \mathbf{X})) &=s(\mathbf{X}, \mathbf{y})-\log \left(\sum_{\tilde{\mathbf{y}} \in \mathbf{Y}_{\mathbf{X}}} e^{s(\mathbf{X}, \tilde{\mathbf{y}})}\right) \\ &=s(\mathbf{X}, \mathbf{y})-\underset{\tilde{\mathbf{y}} \in \mathbf{Y}_{\mathbf{X}}}{\operatorname{logadd}} s(\mathbf{X}, \widetilde{\mathbf{y}}) &(1) \end{aligned}$

where $Y_{X}$ represents all possible tag sequences (even those that do not verify the IOB format) for a sentence $X$ . From the formulation above, it is evident that we encourage our network to produce a valid sequence of output labels. While decoding, we predict the output sequence that obtains the maximum score given by:

其中 $Y_{X}$ 表示一個句子 $X$ 的所有可能的標記序列(甚至那些沒有驗證IOB格式的標記序列)。從上面的公式可以明顯看出，我們鼓勵我們的網絡生成一個有效的輸出標記序列。解碼時，我們預測得到最大分值的輸出序列爲:

$\begin{aligned}\mathbf{y}^{*}=\underset{\tilde{\mathbf{y}} \in \mathbf{Y}_{\mathbf{X}}}{\operatorname{argmax}} s(\mathbf{X}, \widetilde{ \mathbf{y}}) (2) \end{aligned}$

Since we are only modeling bigram interactions between outputs, both the summation in Eq. 1 and the maximum a posteriori sequence $y^{ ∗}$ in Eq. 2 can be computed using dynamic programming.

由於我們只是對輸出之間的雙圖交互進行建模，因此可以使用動態規劃方法計算式1中的和和式2中的最大後驗序列 $y^{*}$ 。

2.3 Parameterization and Training(參數與訓練)

The scores associated with each tagging decision for each token (i.e., the Pi,y’s) are defined to be the dot product between the embedding of a wordin-context computed with a bidirectional LSTM— exactly the same as the POS tagging model of Ling et al. (2015b) and these are combined with bigram compatibility scores (i.e., the Ay,y0’s). This architecture is shown in figure 1. Circles represent observed variables, diamonds are deterministic functions of their parents, and double circles are random variables.

與每個標記(i.e., the Pi,y’s)的每個標記決策相關聯的分數被定義爲嵌入用雙向LSTM計算的單詞上下文之間的點積 - 與Ling等人(2015b) 的POS標記模型完全相同。這些與bigram兼容性分數(i.e., the Ay,y0’s)相結合。該結構如圖1所示。圓型表示觀察到的變量，菱形是其父函數的確定性函數，雙圓是隨機變量。

Figure 1: Main architecture of the network. Word embeddings are given to a bidirectional LSTM. $l_{i}$ represents the word $i$ and its left context, $r_{i}$ represents the word $i$ and its right context. Concatenating these two vectors yields a representation of the word $i$ in its context, $c_{i}$ .

圖1: 網絡的主要架構。字嵌入被給予一個雙向LSTM。 $l_{i}$ 表示單詞 $i$ 及其左上下文， $r_{i}$ 表示單詞 $i$ 及其右上下文。將這兩個向量連接起來，就得到了單詞 $i$ 在上下文中的表示形式 $c_{i}$ 。

The parameters of this model are thus the matrix of bigram compatibility scores A, and the parameters that give rise to the matrix P, namely the parameters of the bidirectional LSTM, the linear feature weights, and the word embeddings. As in part 2.2, let $x_{i}$ denote the sequence of word embeddings for every word in a sentence, and $y_{i}$ be their associated tags. We return to a discussion of how the embeddings $x_{i}$ are modeled in Section 4. The sequence of word embeddings is given as input to a bidirectional LSTM, which returns a representation of the left and right context for each word as explained in 2.1.

因此，該模型的參數爲二元圖兼容性評分A的矩陣，產生矩陣P的參數，即雙向LSTM的參數、線性特徵權重和單詞嵌入。在第2.2部分中， $x_{i}$ 表示一個句子中每個單詞的單詞嵌入序列， $y_{i}$ 表示它們的關聯標記。我們將在第4節中討論如何對embeddings $x_{i}$ 建模。單詞嵌入序列作爲雙向LSTM的輸入，它返回每個單詞的左右上下文表示，如2.1中所述。

These representations are concatenated ( $c_{i}$ ) and linearly projected onto a layer whose size is equal to the number of distinct tags. Instead of using the softmax output from this layer, we use a CRF as previously described to take into account neighboring tags, yielding the final predictions for every word $y_{i}$ . Additionally, we observed that adding a hidden layer between $c_{i}$ and the CRF layer marginally improved our results. All results reported with this model incorporate this extra-layer. The parameters are trained to maximize Eq. 1 of observed sequences of NER tags in an annotated corpus, given the observed words.

這些表示被連接( $c_{i}$ )併線性投影到一個層上，該層的大小等於不同標籤的數量。我們沒有使用這一層的softmax輸出，而是使用前面描述的CRF來考慮相鄰的標記，從而生成每個單詞 $y_{i}$ 的最終預測。此外，我們還觀察到，在 $c_{i}$ 和CRF層之間添加一個隱藏層會略微改善我們的結果。這個模型的所有結果都包含了這個額外的層。在給定所觀察到的單詞的情況下，對參數進行訓練，使所觀察到的NER標記序列的公式1最大化。

2.4 Tagging Schemes（標籤方案）

The task of named entity recognition is to assign a named entity label to every word in a sentence. A single named entity could span several tokens within a sentence. Sentences are usually represented in the IOB format (Inside, Outside, Beginning) where every token is labeled as B-label if the token is the beginning of a named entity, I-label if it is inside a named entity but not the first token within the named entity, or O otherwise. However, we decided to use the IOBES tagging scheme, a variant of IOB commonly used for named entity recognition, which encodes information about singleton entities (S) and explicitly marks the end of named entities (E). Using this scheme, tagging a word as I-label with high-confidence narrows down the choices for the subsequent word to I-label or E-label, however, the IOB scheme is only capable of determining that the subsequent word cannot be the interior of another label. Ratinov and Roth (2009) and Dai et al. (2015) showed that using a more expressive tagging scheme like IOBES improves model performance marginally. However, we did not observe a significant improvement over the IOB tagging scheme.

命名實體識別的任務是爲句子中的每個單詞分配一個命名實體標籤。一個命名實體可以在一個句子中跨越多個標記。句子通常以IOB格式表示(內部、外部、開頭)，如果標記是命名實體的開頭，則每個標記都標記爲B-label;如果標記在命名實體中，但不是命名實體中的第一個標記，則標記爲I-label，否則標記爲O。但是，我們決定使用IOBES標記方案，這是一種常用於命名實體識別的IOB變體，它對單例實體（S）的信息進行編碼，並明確標記命名實體（E）的結尾。使用這種方案，以高置信度標記單詞作爲I標籤會縮小後續單詞到I標籤或E標籤的選擇，但是，IOB方案只能確定後續單詞不在另一個標籤內。Ratinov和Roth(2009)以及Dai等人(2015)表明，使用更富表現力的標籤方案，如IOBES，可以略微提高模型性能。但是，我們沒有觀察到相對於IOB標記方案的顯着改進。

3 Transition-Based Chunking Model（基於過渡的分塊模型）

As an alternative to the LSTM-CRF discussed in the previous section, we explore a new architecture that chunks and labels a sequence of inputs using an algorithm similar to transition-based dependency parsing. This model directly constructs representations of the multi-token names (e.g., the name Mark Watney is composed into a single representation).

作爲上一節討論的LSTM-CRF的替代方案，我們將探索一種新的體系結構，它使用類似於基於轉換的依賴項解析的算法對輸入序列進行塊和標籤。這個模型直接構造了多令牌名稱的表示(例如，將Mark Watney的名稱組合成一個表示)。

This model relies on a stack data structure to incrementally construct chunks of the input. To obtain representations of this stack used for predicting subsequent actions, we use the Stack-LSTM presented by Dyer et al. (2015), in which the LSTM is augmented with a “stack pointer.” While sequential LSTMs model sequences from left to right, stack LSTMs permit embedding of a stack of objects that are both added to (using a push operation) and removed from (using a pop operation). This allows the Stack-LSTM to work like a stack that maintains a “summary embedding” of its contents. We refer to this model as Stack-LSTM or S-LSTM model for simplicity.

該模型依賴於堆棧數據結構來增量地構造輸入塊。爲了獲得用於預測後續操作的這個堆棧的表示，我們使用Dyer等人(2015)提出的 Stack-LSTM ，其中LSTM被一個“堆棧指針”擴充。雖然順序LSTMs從左到右建模序列，但是堆棧LSTMs允許嵌入一組對象，這些對象既可以添加(使用push操作)，也可以刪除(使用pop操作)。這允許 Stack-LSTM 像堆棧一樣工作，維護其內容的“摘要嵌入”。爲了簡單起見，我們將此模型稱爲 Stack-LSTM 或S-LSTM模型。

Finally, we refer interested readers to the original paper (Dyer et al., 2015) for details about the StackLSTM model since in this paper we merely use the same architecture through a new transition-based algorithm presented in the following Section.

最後，我們希望有興趣的讀者參考原始論文(Dyer et al.， 2015)，瞭解關於StackLSTM模型的詳細信息，因爲在本文中，我們只是通過在下一節中介紹的基於轉換的新算法使用相同的體系結構。

3.1 Chunking Algorithm (分塊算法)

We designed a transition inventory which is given in Figure 2 that is inspired by transition-based parsers, in particular the arc-standard parser of Nivre (2004). In this algorithm, we make use of two stacks (designated output and stack representing, respectively, completed chunks and scratch space) and a buffer that contains the words that have yet to be processed. The transition inventory contains the following transitions: The SHIFT transition moves a word from the buffer to the stack, the OUT transition moves a word from the buffer directly into the output stack while the REDUCE(y) transition pops all items from the top of the stack creating a “chunk,” labels this with label y, and pushes a representation of this chunk onto the output stack. The algorithm completes when the stack and buffer are both empty. The algorithm is depicted in Figure 2, which shows the sequence of operations required to process the sentence Mark Watney visited Mars.

我們設計了一個轉換清單，如圖2所示，它受到基於轉換的解析器的啓發，特別是Nivre(2004)的arc標準解析器。在這個算法中，我們使用兩個堆棧(分別表示已完成的塊和擦除空間的指定輸出和堆棧)和一個包含尚未處理的單詞的緩衝區。過渡庫存包含以下轉變:轉變過渡移動堆棧緩衝區的一句話,直接從緩衝過渡動作一個詞到輸出棧而減少(y)過渡彈出所有項目從堆棧的頂部創建一個“塊”,標籤與標籤y,並把這一塊的表示到輸出棧。當堆棧和緩衝區都爲空時，算法就完成了。該算法如圖2所示，它顯示了處理Mark Watney visit Mars語句所需的操作序列。

The model is parameterized by defining a probability distribution over actions at each time step, given the current contents of the stack, buffer, and output, as well as the history of actions taken. Following Dyer et al. (2015), we use stack LSTMs to compute a fixed dimensional embedding of each of these, and take a concatenation of these to obtain the full algorithm state. This representation is used to define a distribution over the possible actions that can be taken at each time step. The model is trained to maximize the conditional probability of sequences of reference actions (extracted from a labeled training corpus) given the input sentences. To label a new input sequence at test time, the maximum probability action is chosen greedily until the algorithm reaches a termination state. Although this is not guaranteed to find a global optimum, it is effective in practice. Since each token is either moved directly to the output (1 action) or first to the stack and then the output (2 actions), the total number of actions for a sequence of length n is maximally 2n .

給定堆棧、緩衝區和輸出的當前內容以及所採取的操作的歷史，通過在每個時間步上定義操作的概率分佈來參數化模型。在Dyer et al.(2015)之後，我們使用stack LSTMs來計算其中每一個的固定維嵌入，並將它們串聯起來，得到完整的算法狀態。此表示形式用於定義在每個時間步驟上可能採取的操作的分佈。該模型經過訓練，最大限度地提高給定輸入句子的參考動作序列(從標記的訓練語料庫中提取)的條件概率。爲了在測試時標記一個新的輸入序列，貪婪地選擇最大概率動作，直到算法達到終止狀態。雖然不能保證找到全局最優解，但在實踐中是有效的。由於每個令牌要麼直接移動到輸出(1個操作)，要麼先移動到堆棧，然後再移動到輸出(2個操作)，因此長度爲n的序列的操作總數最大爲2n。

Figure 2: Transitions of the Stack-LSTM model indicating the action applied and the resulting state. Bold symbols indicate (learned) embeddings of words and relations, script symbols indicate the corresponding words and relations.

圖2:Stack-LSTM模型的轉換，該模型指示應用的操作和結果狀態。大膽(粗體)的符號表示(學習)詞與關係的嵌入，文字符號表示相應的詞與關係。

Figure 3: Transition sequence for Mark Watney visited Mars with the Stack-LSTM model.

圖3:Mark Watney visited Mars 與Stack-LSTM模型的轉換序列。

It is worth noting that the nature of this algorithm model makes it agnostic to the tagging scheme used since it directly predicts labeled chunks.

值得注意的是，這個算法模型的性質使得它與所使用的標記方案無關，因爲它直接預測被標記的塊。

3.2 Representing Labeled Chunks

When the REDUCE( $y$ ) operation is executed, the algorithm shifts a sequence of tokens (together with their vector embeddings) from the stack to the output buffer as a single completed chunk. To compute an embedding of this sequence, we run a bidirectional LSTM over the embeddings of its constituent tokens together with a token representing the type of the chunk being identified $(i.e., y)$ . This function is given as $g(u, . . . , v, r_{y})$ , where ry is a learned embedding of a label type. Thus, the output buffer contains a single vector representation for each labeled chunk that is generated, regardless of its length.

當REDUCE( $y$ )操作執行時，算法將一組令牌序列(連同它們的向量嵌入)作爲一個完整的塊從堆棧轉移到輸出緩衝區。爲了計算這個序列的嵌入，我們在其組成標記的嵌入上運行一個雙向LSTM，並使用一個表示要標識的塊類型的標記 $(i.e., y)$ 。這個函數被給出爲 $g(u, . . . , v, r_{y})$ ，其中ry是一種學習嵌入標籤類型。因此，輸出緩衝區爲生成的每個標記塊包含一個向量表示，而不管它的長度如何。

4 Input Word Embeddings

The input layers to both of our models are vector representations of individual words. Learning independent representations for word types from the limited NER training data is a difficult problem: there are simply too many parameters to reliably estimate. Since many languages have orthographic or morphological evidence that something is a name (or not a name), we want representations that are sensitive to the spelling of words. We therefore use a model that constructs representations of words from representations of the characters they are composed of (4.1). Our second intuition is that names, which may individually be quite varied, appear in regular contexts in large corpora. Therefore we use embeddings learned from a large corpus that are sensitive to word order (4.2). Finally, to prevent the models from depending on one representation or the other too strongly, we use dropout training and find this is crucial for good generalization performance (4.3).

我們的兩個模型的輸入層都是單個單詞的向量表示。從有限的NER訓練數據中學習單詞類型的獨立表示是一個難題:有太多的參數無法可靠地估計。由於許多語言都有拼寫正確（orthographic）或形態學證據表明某些東西是名稱（或不是名稱），因此我們需要對單詞拼寫敏感的表示。因此，我們使用一個模型來構造它們由（4.1）組成的字符的表示形式的單詞表示。我們的第二個直覺是，在大型語料庫中，名稱可能會出現各種各樣的變化。因此，我們使用從大型語料庫中學習的對詞序（4.2）敏感的嵌入。最後，爲了防止模型過於依賴於一種表示，我們使用了dropout訓練，發現這對於良好的泛化性能至關重要(4.3)。

Figure 4: The character embeddings of the word “Mars” are given to a bidirectional LSTMs. We concatenate their last outputs to an embedding from a lookup table to obtain a representation for this word.

圖4:將單詞“Mars”的字符嵌入到雙向LSTMs中。我們將它們的最後輸出連接到查找表的嵌入中，以獲得這個單詞的表示形式。

4.1 Character-based models of words（基於字符的單詞模型）

An important distinction of our work from most previous approaches is that we learn character-level features while training instead of hand-engineering prefix and suffix information about words. Learning character-level embeddings has the advantage of learning representations specific to the task and domain at hand. They have been found useful for morphologically rich languages and to handle the outof-vocabulary problem for tasks like part-of-speech tagging and language modeling (Ling et al., 2015b) or dependency parsing (Ballesteros et al., 2015).

我們的工作與大多數以前的方法的一個重要區別是，我們在訓練中學習字符級別的特性，而不是手工設計單詞的前綴和後綴信息。學習字符級嵌入具有學習特定於當前任務和領域的表示形式的優勢。它們已被發現對豐富的形態學語言和處理詞性標記和語言建模(Ling等，2015b)或依賴分析(Ballesteros等，2015)等任務的outof-vocabulary問題很有用。

Figure 4 describes our architecture to generate a word embedding for a word from its characters. A character lookup table initialized at random contains an embedding for every character. The character embeddings corresponding to every character in a word are given in direct and reverse order to a forward and a backward LSTM. The embedding for a word derived from its characters is the concatenation of its forward and backward representations from the bidirectional LSTM. This character-level representation is then concatenated with a word-level representation from a word lookup-table. During testing, words that do not have an embedding in the lookup table are mapped to a UNK embedding. To train the UNK embedding, we replace singletons with the UNK embedding with a probability 0.5. In all our experiments, the hidden dimension of the forward and backward character LSTMs are 25 each, which results in our character-based representation of words being of dimension 50.

圖4描述了從單詞的字符生成單詞嵌入的體系結構。隨機初始化的字符查找表包含每個字符的嵌入。與單詞中的每個字符對應的字符嵌入按順序分別給出了正向LSTM和反向LSTM。從其字符派生的單詞的嵌入是其來自雙向LSTM的正向和反向表示的連接。然後，將此字符級表示與來自單詞查找表的單詞級表示連接起來。在測試期間，在查找表中沒有嵌入的單詞將被映射到UNK嵌入。爲了訓練UNK嵌入，我們用概率爲0.5的UNK嵌入替換單例。在我們所有的實驗中，正向和反向字符LSTMs的隱藏維數各爲25，這導致我們基於字符的單詞表示爲50維。

Recurrent models like RNNs and LSTMs are capable of encoding very long sequences, however, they have a representation biased towards their most recent inputs. As a result, we expect the final representation of the forward LSTM to be an accurate representation of the suffix of the word, and the fi- nal state of the backward LSTM to be a better representation of its prefix. Alternative approaches— most notably like convolutional networks—have been proposed to learn representations of words from their characters (Zhang et al., 2015; Kim et al., 2015). However, convnets are designed to discover position-invariant features of their inputs. While this is appropriate for many problems, e.g., image recognition (a cat can appear anywhere in a picture), we argue that important information is position dependent (e.g., prefixes and suffixes encode different information than stems), making LSTMs an a priori better function class for modeling the relationship between words and their characters.

像RNN和LSTM這樣的遞歸模型能夠編碼非常長的序列，但是，它們具有偏向其最近輸入的表示。因此，我們期望前向LSTM的最終表示是該單詞後綴的準確表示，並且後向LSTM的最終狀態是其前綴的更好表示。已經提出了替代方法 - 最值得注意的是卷積網絡 - 來學習其角色的詞語表示（Zhang et al。，2015; Kim et al。，2015）。但是，convnet旨在發現其輸入的位置不變特徵。雖然這適用於許多問題，例如圖像識別（貓可以出現在圖片中的任何位置），但我們認爲重要信息是位置相關的（例如，前綴和後綴編碼與詞幹不同的信息），使得LSTM更先進用於建模單詞及其字符之間關係的函數類。

4.2 Pretrained embeddings(Pretrained嵌入)

As in Collobert et al. (2011), we use pretrained word embeddings to initialize our lookup table. We observe significant improvements using pretrained word embeddings over randomly initialized ones. Embeddings are pretrained using skip-n-gram (Ling et al., 2015a), a variation of word2vec (Mikolov et al., 2013a) that accounts for word order. These embeddings are fine-tuned during training. Word embeddings for Spanish, Dutch, German and English are trained using the Spanish Gigaword version 3, the Leipzig corpora collection, the German monolingual training data from the 2010 Machine Translation Workshop and the English Gigaword version 4 (with the LA Times and NY Times portions removed) respectively. $^{2}$ We use an embedding dimension of 100 for English, 64 for other languages, a minimum word frequency cutoff of 4, and a window size of 8.

$^{2}$ (Graff, 2011; Biemann et al., 2007; Callison-Burch et al.,2010; Parker et al., 2009)

與Collobert等人(2011)一樣，我們使用預先訓練好的單詞嵌入來初始化查找表。我們觀察到，與隨機初始化的詞相比，使用預先訓練的詞嵌入有顯著的改進。嵌入使用skip-n-gram (Ling et al.， 2015a)進行預訓練，這是word2vec (Mikolov et al.， 2013a)的變體，用於解釋詞序。這些嵌入在培訓期間進行了微調。西班牙語、荷蘭語、德語和英語的詞嵌入分別使用西班牙語Gigaword版本3、萊比錫語料庫集合、2010年機器翻譯研討會的德語單語培訓數據和英語Gigaword版本4(去掉了《洛杉磯時報》和《紐約時報》的部分)進行培訓。我們對英語使用100的嵌入維數，對其他語言使用64的嵌入維數，最小單詞頻率截止值爲4，窗口大小爲8。

4.3 Dropout training（Dropout 訓練）

Initial experiments showed that character-level embeddings did not improve our overall performance when used in conjunction with pretrained word representations. To encourage the model to depend on both representations, we use dropout training (Hinton et al., 2012), applying a dropout mask to the final embedding layer just before the input to the bidirectional LSTM in Figure 1. We observe a significant improvement in our model’s performance after using dropout (see table 5).

最初的實驗表明，當與預先訓練好的單詞表示一起使用時，字符級嵌入並沒有提高我們的整體性能。爲了鼓勵模型依賴於這兩種表示，我們使用了dropout訓練(Hinton et al.， 2012)，在圖1中雙向LSTM輸入之前的最後一個嵌入層上應用了dropout掩碼。使用dropout後，我們發現模型的性能有了顯著的改進(見表5)。

5 Experiments

This section presents the methods we use to train our models, the results we obtained on various tasks and the impact of our networks’ configuration on model performance.

本節介紹了我們用來訓練模型的方法、我們在各種任務上獲得的結果以及網絡配置對模型性能的影響。

5.1 Training

For both models presented, we train our networks using the back-propagation algorithm updating our parameters on every training example, one at a time, using stochastic gradient descent (SGD) with a learning rate of 0.01 and a gradient clipping of 5.0. Several methods have been proposed to enhance the performance of SGD, such as Adadelta (Zeiler, 2012) or Adam (Kingma and Ba, 2014). Although we observe faster convergence using these methods, none of them perform as well as SGD with gradient clipping.

對於提出的兩個模型，我們使用反向傳播算法訓練我們的網絡，每次更新一個訓練實例的參數，使用隨機梯度下降(SGD)，學習率爲0.01，梯度裁剪爲5.0。已經提出了幾種提高SGD性能的方法，如Adadelta (Zeiler, 2012)或Adam (Kingma and Ba, 2014)。雖然我們觀察到使用這些方法收斂速度更快，但它們都沒有使用梯度裁剪的SGD那麼好。

Our LSTM-CRF model uses a single layer for the forward and backward LSTMs whose dimensions are set to 100. Tuning this dimension did not significantly impact model performance. We set the dropout rate to 0.5. Using higher rates negatively impacted our results, while smaller rates led to longer training time.
The stack-LSTM model uses two layers each of dimension 100 for each stack.

我們的LSTM-CRF模型爲前向和後向lstm使用一個單層，其尺寸設置爲100。調優這個維度不會顯著影響模型性能。我們將dropout 率設置爲0.5。使用較高的(dropout )率會對我們的結果產生負面影響，而較低的心率會導致較長的訓練時間。

The embeddings of the actions used in the composition functions have 16 dimensions each, and the output embedding is of dimension 20. We experimented with different dropout rates and reported the scores using the best dropout rate for each language. $^3$ It is a greedy model that apply locally optimal actions until the entire sentence is processed, further improvements might be obtained with beam search (Zhang and Clark, 2011) or training with exploration (Ballesteros et al., 2016).

複合函數中使用的操作的嵌入各有16個維度，輸出嵌入的維度爲20。我們對不同的（dropout ）率進行了實驗，並使用每種語言的最佳（dropout ）率報告了分數 $^3$ 。這是一個貪婪(貪心算法)的模型，它應用局部最優的動作，直到整個句子被處理，可以通過波束搜索(Zhang and Clark, 2011)或探索訓練(Ballesteros et al.， 2016)得到進一步的改進。

$^3$ English (D=0.2), German, Spanish and Dutch (D=0.3)

5.2 Data Sets(數據設置)

We test our model on different datasets for named entity recognition. To demonstrate our model’s ability to generalize to different languages, we present results on the CoNLL-2002 and CoNLL- 2003 datasets (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003) that contain independent named entity labels for English, Spanish, German and Dutch. All datasets contain four different types of named entities: locations, persons, organizations, and miscellaneous entities that do not belong in any of the three previous categories. Although POS tags were made available for all datasets, we did not include them in our models. We did not perform any dataset preprocessing, apart from replacing every digit with a zero in the English NER dataset.

我們在不同的數據集上測試模型以進行命名實體識別。爲了證明我們的模型泛化到不同語言的能力，我們在CoNLL-2002和CoNLL- 2003數據集上給出了結果(Tjong Kim Sang, 2002;Tjong Kim Sang和De Meulder, 2003)，其中包含英語、西班牙語、德語和荷蘭語的獨立命名實體標籤。所有數據集包含四種不同類型的命名實體:位置、人員、組織和其他不屬於前三種類別中的任何一種的實體。雖然POS標籤對所有數據集都可用，但是我們沒有在模型中包含它們。我們沒有執行任何數據集預處理，只是將英文NER數據集中的每個數字替換爲零。

5.3 Results

Table 1 presents our comparisons with other models for named entity recognition in English. To make the comparison between our model and others fair, we report the scores of other models with and without the use of external labeled data such as gazetteers and knowledge bases. Our models do not use gazetteers or any external labeled resources. The best score reported on this task is by Luo et al. (2015). They obtained a F1 of 91.2 by jointly modeling the NER and entity linking tasks (Hoffart et al., 2011). Their model uses a lot of hand-engineered features including spelling features, WordNet clusters, Brown clusters, POS tags, chunks tags, as well as stemming and external knowledge bases like Freebase and Wikipedia. Our LSTM-CRF model outperforms all other systems, including the ones using external labeled data like gazetteers. Our StackLSTM model also outperforms all previous models that do not incorporate external features, apart from the one presented by Chiu and Nichols (2015).

表1展示了我們與其他英語命名實體識別模型的比較。爲了使我們的模型和其他模型之間的比較公平，我們報告了使用或不使用外部標記數據(如地名錶和知識庫)的其他模型的得分。我們的模型不使用地名錶或任何外部標記的資源。在這項任務中，羅等人(2015)的得分最高。他們通過聯合建模NER和實體鏈接任務獲得了91.2的F1 (Hoffart et al.， 2011)。他們的模型使用了很多手工設計的功能，包括拼寫功能、WordNet集羣、Brown集羣、POS標籤、chunk標籤，以及詞幹提取和外部知識庫(如Freebase和Wikipedia)。我們的LSTM-CRF模型優於所有其他系統，包括使用外部標記數據(如地名錶)的系統。除了Chiu和Nichols(2015)提出的模型外，我們的StackLSTM模型還優於所有不包含外部特性的先前模型。

Tables 2, 3 and 4 present our results on NER for German, Dutch and Spanish respectively in comparison to other models. On these three languages, the LSTM-CRF model significantly outperforms all previous methods, including the ones using external labeled data. The only exception is Dutch, where the model of Gillick et al. (2015) can perform better by leveraging the information from other NER datasets. The Stack-LSTM also consistently presents statethe-art (or close to) results compared to systems that do not use external data.

表2、表3和表4分別展示了我們對德語、荷蘭語和西班牙語的NER與其他模型的比較結果。在這三種語言上，LSTM-CRF模型的性能顯著優於所有以前的方法，包括使用外部標記數據的方法。唯一的例外是荷蘭，Gillick等人(2015)的模型可以更好地利用來自其他NER數據集的信息。與不使用外部數據的系統相比，堆棧- lstm還始終呈現最新的(或接近的)結果。

As we can see in the tables, the Stack-LSTM model is more dependent on character-based representations to achieve competitive performance; we hypothesize that the LSTM-CRF model requires less orthographic information since it gets more contextual information out of the bidirectional LSTMs; however, the Stack-LSTM model consumes the words one by one and it just relies on the word representations when it chunks words.

從表中可以看出，Stack-LSTM模型更依賴於基於字符的表示來實現競爭性能;我們假設LSTM-CRF模型需要較少的正字法信息，因爲它能從雙向lstm中獲得更多的上下文信息;但是，Stack-LSTM模型逐個使用單詞，並且在對單詞進行塊處理時只依賴單詞表示。

5.4 Network architectures(網絡體系結構)

我們的模型有幾個組件，我們可以調整它們來理解它們對整體性能的影響。我們探討了CRF、字符級表示、單詞嵌入和刪除的預訓練對LSTMCRF模型的影響。我們觀察到，在F1中，預習word embeddings使我們在+7.31的整體性能上得到了最大的提高。CRF層增加了+1.79，使用dropout增加了+1.17，最後學習字符級單詞嵌入增加了約+0.74。對於堆棧- lstm，我們進行了類似的一組實驗。不同架構的結果如表5所示。

Table 5: English NER results with our models, using differentconfigurations. “pretrain” refers to models that include pretrained word embeddings, “char” refers to models that includecharacter-based modeling of words, “dropout” refers to modelsthat include dropout rate.

表5:英語NER結果與我們的模型，使用不同配置。“預培訓”指的是包含預培訓的模型word embeddings，“char”指的是包含基於字符的單詞建模，“dropout”指的是模型包括(dropout )率。

6 Related Work

In the CoNLL-2002 shared task, Carreras et al. (2002) obtained among the best results on both Dutch and Spanish by combining several small fixed-depth decision trees. Next year, in the CoNLL- 2003 Shared Task, Florian et al. (2003) obtained the best score on German by combining the output of four diverse classifiers. Qi et al. (2009) later improved on this with a neural network by doing unsupervised learning on a massive unlabeled corpus.
Several other neural architectures have previously been proposed for NER. For instance, Collobert et al. (2011) uses a CNN over a sequence of word embeddings with a CRF layer on top. This can be thought of as our first model without character-level embeddings and with the bidirectional LSTM being replaced by a CNN. More recently, Huang et al. (2015) presented a model similar to our LSTM-CRF, but using hand-crafted spelling features. Zhou and Xu (2015) also used a similar model and adapted it to the semantic role labeling task. Lin and Wu (2009) used a linear chain CRF with L2 regularization, they added phrase cluster features extracted from the web data and spelling features. Passos et al. (2014) also used a linear chain CRF with spelling features and gazetteers.
Language independent NER models like ours have also been proposed in the past. Cucerzan and Yarowsky (1999; 2002) present semi-supervised bootstrapping algorithms for named entity recognition by co-training character-level (word-internal) and token-level (context) features. Eisenstein et al. (2011) use Bayesian nonparametrics to construct a database of named entities in an almost unsupervised setting. Ratinov and Roth (2009) quantitatively compare several approaches for NER and build their own supervised model using a regularized average perceptron and aggregating context information.
Finally, there is currently a lot of interest in models for NER that use letter-based representations. Gillick et al. (2015) model the task of sequencelabeling as a sequence to sequence learning problem and incorporate character-based representations into their encoder model. Chiu and Nichols (2015) employ an architecture similar to ours, but instead use CNNs to learn character-level features, in a way similar to the work by Santos and Guimaraes (2015).

Carreras等人(2002)在CoNLL-2002 shared task中，通過組合幾個固定深度的小決策樹，在荷蘭語和西班牙語兩種語言中都獲得了最好的結果。第二年，在CoNLL- 2003 Shared Task中，Florian et al.(2003)綜合了四個不同分類器的輸出結果，獲得了德語的最佳成績。Qi等人(2009)後來利用神經網絡對大量無標記語料庫進行無監督學習，改進了這一方法。
之前已經爲NER提出了幾個其他的神經結構。例如，Collobert et al.(2011)在一系列嵌入單詞的CNN上使用了一個CRF層。這可以被認爲是我們的第一個模型，沒有字符級嵌入，雙向LSTM被CNN替換。最近，Huang等人(2015)提出了一個類似於我們的LSTM-CRF的模型，但是使用了手工拼寫功能。Zhou和Xu(2015)也使用了類似的模型，並將其應用於語義角色標記任務。Lin和Wu(2009)使用了帶有L2正則化的線性鏈CRF，他們添加了從web數據中提取的短語簇特徵和拼寫特徵。Passos等人(2014)也使用了帶有拼寫特徵和地名錶的線性鏈CRF。
與語言無關的NER模型在過去也被提出過。Cucerzan和Yarowsky (1999;2002)提出了基於字符級(內部詞)和令牌級(上下文)特徵的半監督自舉識別算法。Eisenstein等人(2011)使用貝葉斯非參數化方法在幾乎無監督的設置下構建了一個命名實體的數據庫。Ratinov和Roth(2009)定量比較了幾種NER方法，並使用正則化的平均感知器和聚合上下文信息構建了自己的監督模型。
最後，目前對使用基於字母表示的NER模型有很多興趣。Gillick等人(2015)將測序任務建模爲一個序列到序列的學習問題，並將基於字符的表示形式合併到他們的編碼器模型中。Chiu和Nichols(2015)採用了一種類似於我們的架構，但使用CNNs來學習字符級特性，其方式類似於Santos和Guimaraes(2015)的工作。

7 Conclusion

This paper presents two neural architectures for sequence labeling that provide the best NER results ever reported in standard evaluation settings, even compared with models that use external resources, such as gazetteers.
A key aspect of our models are that they model output label dependencies, either via a simple CRF architecture, or using a transition-based algorithm to explicitly construct and label chunks of the input. Word representations are also crucially important for success: we use both pre-trained word representations and “character-based” representations that capture morphological and orthographic information. To prevent the learner from depending too heavily on one representation class, dropout is used.

本文提出了兩種用於序列標記的神經結構，即使與使用外部資源(如地名錶)的模型相比，它們也能在標準評估設置中提供有史以來最好的NER結果。
我們的模型的一個關鍵方面是，它們對輸出標籤依賴關係建模，要麼通過簡單的CRF架構，要麼使用基於轉換的算法顯式地構造和標記輸入塊。單詞表示對於成功也至關重要:我們既使用預先訓練的單詞表示，也使用“基於字符”的表示，以捕獲形態學和正字法信息。爲了防止學習者過於依賴一個表示類，使用了dropout。

Acknowledgments

This work was sponsored in part by the Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O) under the Low Resource Languages for Emergent Incidents (LORELEI) program issued by DARPA/I2O under Contract No. HR0011-15-C-0114. Miguel Ballesteros is supported by the European Commission under the contract numbers FP7-ICT-610411 (project MULTISENSOR) and H2020-RIA-645012 (project KRISTINA).

這項工作由美國國防部高級研究計劃局(DARPA)信息創新辦公室(I2O)在低資源語言緊急事件(LORELEI)項目下贊助，該項目由DARPA/I2O根據合同編號(No. 1)發佈。hr0011 - 15 - c - 0114。Miguel Ballesteros由歐洲委員會根據合同編號FP7-ICT-610411(項目多傳感器)和H2020-RIA-645012(項目KRISTINA)提供支持。

References

[Ando and Zhang2005a] Rie Kubota Ando and Tong Zhang. 2005a. A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research, 6:1817–1853.
[Ando and Zhang2005b] Rie Kubota Ando and Tong Zhang. 2005b. Learning predictive structures. JMLR, 6:1817–1853.
[Ballesteros et al.2015] Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based dependency parsing by modeling characters instead of words with LSTMs. In Proceedings of EMNLP.
[Ballesteros et al.2016] Miguel Ballesteros, Yoav Golderg, Chris Dyer, and Noah A. Smith. 2016. Training with Exploration Improves a Greedy Stack-LSTM Parser. In arXiv:1603.03793.
[Bengio et al.1994] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166.
[Biemann et al.2007] Chris Biemann, Gerhard Heyer, Uwe Quasthoff, and Matthias Richter. 2007. The leipzig corpora collection-monolingual corpora of standard size. Proceedings of Corpus Linguistic.
[Callison-Burch et al.2010] Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar F Zaidan. 2010. Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17–53. Association for Computational Linguistics.
[Carreras et al.2002] Xavier Carreras, Llu´ıs Marquez, and ` Llu´ıs Padro. 2002. Named entity extraction using ad- ´ aboost, proceedings of the 6th conference on natural language learning. August, 31:1–4.
[Chiu and Nichols2015] Jason PC Chiu and Eric Nichols. 2015. Named entity recognition with bidirectional lstm-cnns. arXiv preprint arXiv:1511.08308.
[Collobert et al.2011] Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, ´ and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537.
[Cucerzan and Yarowsky1999] Silviu Cucerzan and David Yarowsky. 1999. Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of the 1999 Joint SIGDAT Conference on EMNLP and VLC, pages 90–99.
[Cucerzan and Yarowsky2002] Silviu Cucerzan and David Yarowsky. 2002. Language independent ner using a unified model of internal and contextual evidence. In proceedings of the 6th conference on Natural language learning-Volume 20, pages 1–4. Association for Computational Linguistics.
[Dai et al.2015] Hong-Jie Dai, Po-Ting Lai, Yung-Chun Chang, and Richard Tzong-Han Tsai. 2015. Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization. Journal of cheminformatics, 7(Suppl 1):S14. [Dyer et al.2015] Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proc. ACL.
[Eisenstein et al.2011] Jacob Eisenstein, Tae Yano, William W Cohen, Noah A Smith, and Eric P Xing. 2011. Structured databases of named entities from bayesian nonparametrics. In Proceedings of the First Workshop on Unsupervised Learning in NLP, pages 2–12. Association for Computational Linguistics. [Florian et al.2003] Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named entity recognition through classifier combination. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 168–171. Association for Computational Linguistics.
[Gillick et al.2015] Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2015. Multilingual language processing from bytes. arXiv preprint arXiv:1512.00103.
[Graff2011] David Graff. 2011. Spanish gigaword third edition (ldc2011t12). Linguistic Data Consortium, Univer-sity of Pennsylvania, Philadelphia, PA.
[Graves and Schmidhuber2005] Alex Graves and Jurgen ¨ Schmidhuber. 2005. Framewise phoneme classifi- cation with bidirectional LSTM networks. In Proc. IJCNN.
[Hinton et al.2012] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
[Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. ¨ Neural Computation, 9(8):1735–1780.
[Hoffart et al.2011] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Furstenau, Manfred ¨ Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782–792. Association for Computational Linguistics.
[Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.
[Kim et al.2015] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2015. Character-aware neural language models. CoRR, abs/1508.06615.
[Kingma and Ba2014] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[Lafferty et al.2001] John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML.
[Lin and Wu2009] Dekang Lin and Xiaoyun Wu. 2009. Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1030–1038. Association for Computational Linguistics.
[Ling et al.2015a] Wang Ling, Lin Chu-Cheng, Yulia Tsvetkov, Silvio Amir, Ramon Fernandez Astudillo, ´ Chris Dyer, Alan W Black, and Isabel Trancoso. 2015a. Not all contexts are created equal: Better word representations with variable attention. In Proc. EMNLP.
[Ling et al.2015b] Wang Ling, Tiago Lu´ıs, Lu´ıs Marujo, Ramon Fernandez Astudillo, Silvio Amir, Chris Dyer, ´ Alan W Black, and Isabel Trancoso. 2015b. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
[Luo et al.2015] Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie. 2015. Joint named entity recognition and disambiguation. In Proc. EMNLP. [Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proc. NIPS. [Nivre2004] Joakim Nivre. 2004. Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together.
[Nothman et al.2013] Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, and James R Curran. 2013. Learning multilingual named entity recognition from wikipedia. Artificial Intelligence, 194:151–175. [Parker et al.2009] Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2009. English gigaword fourth edition (ldc2009t13). Linguistic Data Consortium, Univer-sity of Pennsylvania, Philadelphia, PA.
[Passos et al.2014] Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. arXiv preprint arXiv:1404.5367.
[Qi et al.2009] Yanjun Qi, Ronan Collobert, Pavel Kuksa, Koray Kavukcuoglu, and Jason Weston. 2009. Combining labeled and unlabeled data with word-class distribution learning. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 1737–1740. ACM.
[Ratinov and Roth2009] Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147–155. Association for Computational Linguistics.
[Santos and Guimaraes2015] ˜ Cicero Nogueira dos Santos and Victor Guimaraes. 2015. Boosting named entity ˜ recognition with neural character embeddings. arXiv preprint arXiv:1505.05008. [Tjong Kim Sang and De Meulder2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proc. CoNLL.
[Tjong Kim Sang2002] Erik F. Tjong Kim Sang. 2002. Introduction to the conll-2002 shared task: Languageindependent named entity recognition. In Proc. CoNLL. [Turian et al.2010] Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proc. ACL.
[Zeiler2012] Matthew D Zeiler. 2012. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701. [Zhang and Clark2011] Yue Zhang and Stephen Clark. 2011. Syntactic processing using the generalized perceptron and beam search. Computational Linguistics, 37(1).
[Zhang et al.2015] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pages 649–657.
[Zhou and Xu2015] Jie Zhou and Wei Xu. 2015. End-toend learning of semantic role labeling using recurrent neural networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.