Neural Architectures for Named Entity Recognition（用于命名实体识别的神经结构）全文翻译

6## 前言
原文：https://arxiv.org/pdf/1603.01360.pdf
主要使用翻译软件：http://fanyi.youdao.com/
人工修改：https://blog.csdn.net/qq_41837900
本文主要使用 有道翻译 ，由人工对细节修改，力求达到信达雅。

正文：

Neural Architectures for Named Entity Recognition(用于命名实体识别的神经结构)

Guillaume Lample♠ Miguel Ballesteros♣♠
Sandeep Subramanian♠ Kazuya Kawakami♠ Chris Dyer♠
♠Carnegie Mellon University ♣NLP Group, Pompeu Fabra University
{glample,sandeeps,kkawakam,cdyer}@cs.cmu.edu,
[email protected]

Abstract(摘要)

State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn
effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures—one
based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based
approach inspired by shift-reduce parsers.Our models rely on two sources of information about words: character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora. Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers. ¹

¹The code of the LSTM-CRF and Stack-LSTM NER systems are available at https://github.com/glample/tagger and https://github.com/clab/stack-lstm-ner
LSTM-CRF and Stack-LSTM NER系统的代码已在https://github.com/glample/tagger 和 https://github.com/clab/stack-lstm-ner

（现在）最先进的命名实体识别系统严重依赖手工制作的特性和特定领域的知识，以便从现有的小型、受监督的培训语料库中有效地学习。（而）在本文中，我们介绍了两种新的神经结构——一种基于双向LSTMs和条件随机域的神经结构，另一种基于移位约简解析器的基于移位的方法构造和标记段。我们的模型依赖于两个关於单词的信息源:从有监督的语料库中学习的基于字符的单词表示和从无注释的语料库中学习的无监督单词表示。我们的模型在四种语言的NER中获得了最先进的性能，而不依赖于任何特定于语言的知识或资源，如地名辞典。

1 Introduction(介绍)

Named entity recognition (NER) is a challenging learning problem. One the one hand, in most languages and domains, there is only a very small amount of supervised training data available. On the other, there are few constraints on the kinds of words that can be names, so generalizing from this small sample of data is difficult. As a result, carefully constructed orthographic features and language-specific knowledge resources, such as gazetteers, are widely used for solving this task. Unfortunately, languagespecific resources and features are costly to develop in new languages and new domains, making NER a challenge to adapt. Unsupervised learning from unannotated corpora offers an alternative strategy for obtaining better generalization from small amounts of supervision. However, even systems that have relied extensively on unsupervised features(Collobert et al., 2011; Turian et al., 2010;Lin and Wu, 2009; Ando and Zhang, 2005b, inter alia) have used these to augment, rather than replace, hand-engineered features (e.g., knowledge about capitalization patterns and character classes in a particular language) and specialized knowledge resources(e.g., gazetteers).

命名实体识别(NER)是一个具有挑战性的学习问题。一方面，在大多数语言和领域，只有非常少的监督训练数据可用。另一方面，对于可以作为名称的单词种类几乎没有什么限制，因此从这个小数据样本中进行概括是困难的。因此，精心构建的正字法特征和特定语言的知识资源，如地名辞典，被广泛用于解决这一任务。不幸的是，在新语言和新领域中开发语言特定的资源和特性非常昂贵，这使得NER很难适应。从无注释的语料库中进行无监督学习为从少量监督中获得更好的泛化提供了另一种策略。然而，即使是广泛依赖于非监督特性的系统，也使用这些特性来扩充而不是（完全）替代手工标注特性(例如,了解特定语言中的大小写模式和字符类) 和专门的知识资源(例如,地名表)。

In this paper, we present neural architectures for NER that use no language-specific resources or features beyond a small amount of supervised training data and unlabeled corpora. Our models are designed to capture two intuitions. First, since names often consist of multiple tokens, reasoning jointly over tagging decisions for each token is important. We compare two models here, (i) a bidirectional LSTM with a sequential conditional random layer above it (LSTM-CRF; §2), and (ii) a new model that constructs and labels chunks of input sentences using an algorithm inspired by transition-based parsing with states represented by stack LSTMs (S-LSTM; §3). Second, token-level evidence for “being a name” includes both orthographic evidence (what does the word being tagged as a name look like?) and distributional evidence (where does the word being tagged tend to occur in a corpus?). To capture orthographic sensitivity, we use character-based word representation model (Ling et al., 2015b) to capture distributional sensitivity, we combine these representations with distributional representations (Mikolov et al., 2013b). Our word representations combine both of these, and dropout training is used to encourage the model to learn to trust both sources of evidence (§4).

在本文中，我们提出了一种神经网络结构，它不使用任何特定于语言的资源或特性，只使用少量有监督的训练数据和未标记的语料库。我们的模型旨在捕捉两种直觉。首先，由于名称通常由多个标记组成，因此对每个标记的标记决策进行联合推理非常重要。我们在这里比较了两个模型，(i)一个双向的LSTM，上面有一个顺序条件随机层 (LSTM-CRF; §2)，(ii)一个新的模型，使用基于转换的解析的算法构造和标记输入语句块，并使用堆栈LSTMs (S-LSTM; §3) 表示的状态。其次，“作为名称”的标签证据包括拼写证据（被标记为名称的单词是什么样子的？）和分配证据（被标记的单词在语料库中往往出现在哪里？）。为了捕获拼写法灵敏度，我们使用基于字符的单词表示模型 (Ling et al., 2015b)来捕获分布灵敏度，我们将这些表示与分配表示相结合 (Mikolov et al., 2013b)。我们的单词表示结合了这两种方法，而 dropout training （Dropout是指在模型训练时随机让网络某些隐含层节点的权重不工作，不工作的那些节点可以暂时认为不是网络结构的一部分，但是它的权重得保留下来(只是暂时不更新而已），因为下次样本输入时它可能又得工作了。训练神经网络模型时，如果训练样本较少，为了防止模型过拟合，Dropout可以作为一种trikc供选择。）被用来鼓励模型学会信任这两种证据来源 (§4)。

Experiments in English, Dutch, German, and Spanish show that we are able to obtain state of-the-art NER performance with the LSTM-CRF model in Dutch, German, and Spanish, and very near the state-of-the-art in English without any hand-engineered features or gazetteers (§5). The transition-based algorithm likewise surpasses the best previously published results in several languages, although it performs less well than the LSTM-CRF model.

在英语、荷兰语、德语和西班牙语中进行的实验表明，我们能够使用荷兰语、德语和西班牙语的LSTM-CRF模型获得最先进的NER性能，并且在英语上非常接近最先进（模型），而不需要任何手工设计的特性或地名表(§5)。基于转换的算法同样超越了以前用几种语言发布的最佳结果，尽管它的性能不如LSTM-CRF模型。

2 LSTM-CRF Model （LSTM-CRF模型）

We provide a brief description of LSTMs and CRFs, and present a hybrid tagging architecture. This architecture is similar to the ones presented by Collobert et al. (2011) and Huang et al. (2015).

我们对LSTMs和CRFs进行了简要的描述，并提出了一种混合的标签体系结构。这个架构类似于Collobert等人(2011)和Huang等人(2015)提出的架构。

2.1 LSTM

Recurrent neural networks (RNNs) are a family of neural networks that operate on sequential data. They take as input a sequence of vectors $\begin{array}{l}{\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{n}\right)} \end{array}$ and return another sequence $\begin{array}{l} {\left(\mathbf{h}_{1}, \mathbf{h}_{2}, \dots, \mathbf{h}_{n}\right)}\end{array}$ that represents some information about the sequence at every step in the input. Although RNNs can, in theory, learn long dependencies, in practice they fail to do so and tend to be biased towards their most recent inputs in the sequence (Bengio et al., 1994). Long Short-term Memory Networks (LSTMs) have been designed to combat this issue by incorporating a memory-cell and have been shown to capture long-range dependencies. They do so using several gates that control the proportion of the input to give to the memory cell, and the proportion from the previous state to forget (Hochreiter and Schmidhuber,

LSTM递归神经网络(RNNs)是一类基于序列数据的神经网络。它们取一个向量序列作为输入 $\begin{array}{l}{\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{n}\right)} \end{array}$ 然后返回另一个序列 $\begin{array}{l} {\left(\mathbf{h}_{1}, \mathbf{h}_{2}, \dots, \mathbf{h}_{n}\right)}\end{array}$ ，表示在输入的每个步骤中关于序列的一些信息。虽然RNNs在理论上可以学习长依赖关系，但在实践中却无法做到这一点，而且往往偏向于序列中最近的输入(Bengio et al.， 1994)。长期短期内存网络(LSTMs)被设计成通过合并一个内存单元来解决这个问题，并被证明能够捕获长期依赖关系。他们使用几个门来控制输入到内存单元的比例，以及前一状态到遗忘的比例(Hochreiter和Schmidhuber, 1997)。我们使用下面的实现:

$\begin{aligned} \mathbf{i}_{t}=& \sigma\left(\mathbf{W}_{x i} \mathbf{x}_{t}+\mathbf{W}_{h i} \mathbf{h}_{t-1}+\mathbf{W}_{c i} \mathbf{c}_{t-1}+\mathbf{b}_{i}\right) \\ \mathbf{c}_{t}=&\left(1-\mathbf{i}_{t}\right) \odot \mathbf{c}_{t-1}+\mathbf{i}_{t} \odot \tanh \left(\mathbf{W}_{x c} \mathbf{x}_{t}+\mathbf{W}_{h c} \mathbf{h}_{t-1}+\mathbf{b}_{c}\right) \\ \mathbf{o}_{t}=& \sigma\left(\mathbf{W}_{x o} \mathbf{x}_{t}+\mathbf{W}_{h o} \mathbf{h}_{t-1}+\mathbf{W}_{c o} \mathbf{c}_{t}+\mathbf{b}_{o}\right) \\ & \mathbf{h}_{t}=\mathbf{o}_{t} \odot \tanh \left(\mathbf{c}_{t}\right) \end{aligned}$

where σ is the element-wise sigmoid function, and $\odot$ is the element-wise product.

其中σ是元素方式(element-wise)的sigmoid函数，, $\odot$ 是元素方式(element-wise)的产品。

For a given sentence $\begin{array}{l}{\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{n}\right)} \end{array}$ containing n words, each represented as a d-dimensional vector, an LSTM computes a representation $\overrightarrow{\mathrm{h}_{t}}$ of the left context of the sentence at every word t. Naturally, generating a representation of the right context $\overleftarrow{\mathrm{h}_{t}}$ as well should add useful information. This can be achieved using a second LSTM that reads the same sequence in reverse. We will refer to the former as the forward LSTM and the latter as the backward LSTM. These are two distinct networks with different parameters. This forward and backward LSTM pair is referred to as a bidirectional LSTM (Graves and Schmidhuber, 2005).

对于给定的句子 $\begin{array}{l}{\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{n}\right)} \end{array}$ 包含n个单词，每个单词都表示为d维向量，LSTM计算每个单词t的句子左上下文的表示 $\overrightarrow{\mathrm{h}_{t}}$ 。自然，生成右上下文的表示 $\overleftarrow{\mathrm{h}_{t}}$ t也应该添加有用的信息。可以使用第二个反向读取相同序列的LSTM来实现这一点。我们将前者称为前向LSTM，后者称为后向LSTM。这是两个具有不同参数的不同网络。这种正向和反向LSTM对称为双向LSTM (Graves和Schmidhuber, 2005)。

The representation of a word using this model is obtained by concatenating its left and right context representations, $\mathbf{h}_{t}=\left[\overrightarrow{\mathrm{h}_{t}}; \overleftarrow{h}_{t}\right]$ . These representations effectively include a representation of a word in context, which is useful for numerous tagging applications.

使用该模型的单词表示是通过连接其左右上下文表示 $\mathbf{h}_{t}=\left[\overrightarrow{\mathrm{h}_{t}}; \overleftarrow{h}_{t}\right]$ 得到的。这些表示有效地包括上下文中单词的表示，这对于许多标记应用程序非常有用。

2.2 CRF Tagging Models （CRF 标签模型）

A very simple—but surprisingly effective—tagging model is to use the ${h}_{t}$ ’s as features to make independent tagging decisions for each output ${y}_{t}$ (Ling et al., 2015b). Despite this model’s success in simple problems like POS tagging, its independent classifi- cation decisions are limiting when there are strong dependencies across output labels. NER is one such task, since the “grammar” that characterizes interpretable sequences of tags imposes several hard constraints (e.g., I-PER cannot follow B-LOC; see §2.4 for details) that would be impossible to model with independence assumptions.

一个非常简单但非常有效的标签模型是使用 ${h}_{t}$ 的特性为每个输出 ${y}_{t}$ 做出独立的标签决策(Ling等，2015b)。尽管该模型在诸如POS标签这样的简单问题上取得了成功，但是当输出标签之间存在很强的依赖关系时，它的独立分类决策受到了限制。NER就是这样一个任务，因为描述可解释的标签序列的“语法”强加了几个硬约束(例如，I-PER不能遵循B-LOC; see §2.4 for details)这是不可能用独立假设来建模的。

Therefore, instead of modeling tagging decisions independently, we model them jointly using a conditional random field (Lafferty et al., 2001). For an input sentence

因此，我们没有独立地对标记决策建模，而是联合使用条件随机字段对它们建模(Lafferty et al.， 2001)。对于输入语句

$\mathbf{X}=\left(\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{n}\right)$

we consider ${P}$ to be the matrix of scores output by the bidirectional LSTM network. ${P}$ is of size $n × k$ , where $k$ is the number of distinct tags, and ${P}_{i,j}$ corresponds to the score of the $j^{t h}$ tag of the $i^{t h}$ word in a sentence. For a sequence of predictions

我们认为 $P$ 是双向LSTM网络输出的分数矩阵。 $P$ 的大小为 $n×k$ ，其中 $k$ 为不同标签的个数， ${P}_{i,j}$ 对应一个句子中第 $i^{th}$ 个单词的第 $j^{th}$ 个标签的得分。对于一系列的预测

$\mathbf{y}=\left(y_{1}, y_{2}, \ldots, y_{n}\right)$

we define its score to be

我们定义它的分数(得分)为

$s(\mathbf{X}, \mathbf{y})=\sum_{i=0}^{n} A_{y_{i}, y_{i+1}}+\sum_{i=1}^{n} P_{i, y_{i}}$

where $A$ is a matrix of transition scores such that $A_{i,j}$ represents the score of a transition from the tag $i$ to tag $j$ . $y_{0}$ and $y_{n}$ are the start and end tags of a sentence, that we add to the set of possible tags. $A$ is therefore a square matrix of size $k+2$ .

其中 $A$ 是一个转换得分矩阵， $A_{i j}$ 表示从标签 $i$ 到标签 $j$ 的转换得分， $y_{0}$ 和 $y_{n}$ 是一个句子的开始和结束标签，我们将它们添加到一组可能的标签中。因此 $A$ 是一个大小为 $k+2$ 的方阵。
译者图：

译者注：增加了start与end，所以为k+2

A softmax over all possible tag sequences yields a probability for the sequence $y$ :

softmax对所有可能的标签序列产生一个序列 $y$ 的概率:

$p(\mathbf{y} | \mathbf{X})=\frac{e^{s(\mathbf{X}, \mathbf{y})}}{\sum_{\widetilde{\mathbf{y}} \in \mathbf{Y}_{\mathbf{X}}} e^{s(\mathbf{X}, \widetilde{\mathbf{y}})}}$

During training, we maximize the log-probability of the correct tag sequence:

在训练过程中，我们最大化了正确标签序列的log-probability(对数概率):

$\begin{aligned} \log (p(\mathbf{y} | \mathbf{X})) &=s(\mathbf{X}, \mathbf{y})-\log \left(\sum_{\tilde{\mathbf{y}} \in \mathbf{Y}_{\mathbf{X}}} e^{s(\mathbf{X}, \tilde{\mathbf{y}})}\right) \\ &=s(\mathbf{X}, \mathbf{y})-\underset{\tilde{\mathbf{y}} \in \mathbf{Y}_{\mathbf{X}}}{\operatorname{logadd}} s(\mathbf{X}, \widetilde{\mathbf{y}}) &(1) \end{aligned}$

where $Y_{X}$ represents all possible tag sequences (even those that do not verify the IOB format) for a sentence $X$ . From the formulation above, it is evident that we encourage our network to produce a valid sequence of output labels. While decoding, we predict the output sequence that obtains the maximum score given by:

其中 $Y_{X}$ 表示一个句子 $X$ 的所有可能的标记序列(甚至那些没有验证IOB格式的标记序列)。从上面的公式可以明显看出，我们鼓励我们的网络生成一个有效的输出标记序列。解码时，我们预测得到最大分值的输出序列为:

$\begin{aligned}\mathbf{y}^{*}=\underset{\tilde{\mathbf{y}} \in \mathbf{Y}_{\mathbf{X}}}{\operatorname{argmax}} s(\mathbf{X}, \widetilde{ \mathbf{y}}) (2) \end{aligned}$

Since we are only modeling bigram interactions between outputs, both the summation in Eq. 1 and the maximum a posteriori sequence $y^{ ∗}$ in Eq. 2 can be computed using dynamic programming.

由于我们只是对输出之间的双图交互进行建模，因此可以使用动态规划方法计算式1中的和和式2中的最大后验序列 $y^{*}$ 。

2.3 Parameterization and Training(参数与训练)

The scores associated with each tagging decision for each token (i.e., the Pi,y’s) are defined to be the dot product between the embedding of a wordin-context computed with a bidirectional LSTM— exactly the same as the POS tagging model of Ling et al. (2015b) and these are combined with bigram compatibility scores (i.e., the Ay,y0’s). This architecture is shown in figure 1. Circles represent observed variables, diamonds are deterministic functions of their parents, and double circles are random variables.

与每个标记(i.e., the Pi,y’s)的每个标记决策相关联的分数被定义为嵌入用双向LSTM计算的单词上下文之间的点积 - 与Ling等人(2015b) 的POS标记模型完全相同。这些与bigram兼容性分数(i.e., the Ay,y0’s)相结合。该结构如图1所示。圆型表示观察到的变量，菱形是其父函数的确定性函数，双圆是随机变量。

Figure 1: Main architecture of the network. Word embeddings are given to a bidirectional LSTM. $l_{i}$ represents the word $i$ and its left context, $r_{i}$ represents the word $i$ and its right context. Concatenating these two vectors yields a representation of the word $i$ in its context, $c_{i}$ .

图1: 网络的主要架构。字嵌入被给予一个双向LSTM。 $l_{i}$ 表示单词 $i$ 及其左上下文， $r_{i}$ 表示单词 $i$ 及其右上下文。将这两个向量连接起来，就得到了单词 $i$ 在上下文中的表示形式 $c_{i}$ 。

The parameters of this model are thus the matrix of bigram compatibility scores A, and the parameters that give rise to the matrix P, namely the parameters of the bidirectional LSTM, the linear feature weights, and the word embeddings. As in part 2.2, let $x_{i}$ denote the sequence of word embeddings for every word in a sentence, and $y_{i}$ be their associated tags. We return to a discussion of how the embeddings $x_{i}$ are modeled in Section 4. The sequence of word embeddings is given as input to a bidirectional LSTM, which returns a representation of the left and right context for each word as explained in 2.1.

因此，该模型的参数为二元图兼容性评分A的矩阵，产生矩阵P的参数，即双向LSTM的参数、线性特征权重和单词嵌入。在第2.2部分中， $x_{i}$ 表示一个句子中每个单词的单词嵌入序列， $y_{i}$ 表示它们的关联标记。我们将在第4节中讨论如何对embeddings $x_{i}$ 建模。单词嵌入序列作为双向LSTM的输入，它返回每个单词的左右上下文表示，如2.1中所述。

These representations are concatenated ( $c_{i}$ ) and linearly projected onto a layer whose size is equal to the number of distinct tags. Instead of using the softmax output from this layer, we use a CRF as previously described to take into account neighboring tags, yielding the final predictions for every word $y_{i}$ . Additionally, we observed that adding a hidden layer between $c_{i}$ and the CRF layer marginally improved our results. All results reported with this model incorporate this extra-layer. The parameters are trained to maximize Eq. 1 of observed sequences of NER tags in an annotated corpus, given the observed words.

这些表示被连接( $c_{i}$ )并线性投影到一个层上，该层的大小等于不同标签的数量。我们没有使用这一层的softmax输出，而是使用前面描述的CRF来考虑相邻的标记，从而生成每个单词 $y_{i}$ 的最终预测。此外，我们还观察到，在 $c_{i}$ 和CRF层之间添加一个隐藏层会略微改善我们的结果。这个模型的所有结果都包含了这个额外的层。在给定所观察到的单词的情况下，对参数进行训练，使所观察到的NER标记序列的公式1最大化。

2.4 Tagging Schemes（标签方案）

The task of named entity recognition is to assign a named entity label to every word in a sentence. A single named entity could span several tokens within a sentence. Sentences are usually represented in the IOB format (Inside, Outside, Beginning) where every token is labeled as B-label if the token is the beginning of a named entity, I-label if it is inside a named entity but not the first token within the named entity, or O otherwise. However, we decided to use the IOBES tagging scheme, a variant of IOB commonly used for named entity recognition, which encodes information about singleton entities (S) and explicitly marks the end of named entities (E). Using this scheme, tagging a word as I-label with high-confidence narrows down the choices for the subsequent word to I-label or E-label, however, the IOB scheme is only capable of determining that the subsequent word cannot be the interior of another label. Ratinov and Roth (2009) and Dai et al. (2015) showed that using a more expressive tagging scheme like IOBES improves model performance marginally. However, we did not observe a significant improvement over the IOB tagging scheme.

命名实体识别的任务是为句子中的每个单词分配一个命名实体标签。一个命名实体可以在一个句子中跨越多个标记。句子通常以IOB格式表示(内部、外部、开头)，如果标记是命名实体的开头，则每个标记都标记为B-label;如果标记在命名实体中，但不是命名实体中的第一个标记，则标记为I-label，否则标记为O。但是，我们决定使用IOBES标记方案，这是一种常用于命名实体识别的IOB变体，它对单例实体（S）的信息进行编码，并明确标记命名实体（E）的结尾。使用这种方案，以高置信度标记单词作为I标签会缩小后续单词到I标签或E标签的选择，但是，IOB方案只能确定后续单词不在另一个标签内。Ratinov和Roth(2009)以及Dai等人(2015)表明，使用更富表现力的标签方案，如IOBES，可以略微提高模型性能。但是，我们没有观察到相对于IOB标记方案的显着改进。

3 Transition-Based Chunking Model（基于过渡的分块模型）

As an alternative to the LSTM-CRF discussed in the previous section, we explore a new architecture that chunks and labels a sequence of inputs using an algorithm similar to transition-based dependency parsing. This model directly constructs representations of the multi-token names (e.g., the name Mark Watney is composed into a single representation).

作为上一节讨论的LSTM-CRF的替代方案，我们将探索一种新的体系结构，它使用类似于基于转换的依赖项解析的算法对输入序列进行块和标签。这个模型直接构造了多令牌名称的表示(例如，将Mark Watney的名称组合成一个表示)。

This model relies on a stack data structure to incrementally construct chunks of the input. To obtain representations of this stack used for predicting subsequent actions, we use the Stack-LSTM presented by Dyer et al. (2015), in which the LSTM is augmented with a “stack pointer.” While sequential LSTMs model sequences from left to right, stack LSTMs permit embedding of a stack of objects that are both added to (using a push operation) and removed from (using a pop operation). This allows the Stack-LSTM to work like a stack that maintains a “summary embedding” of its contents. We refer to this model as Stack-LSTM or S-LSTM model for simplicity.

该模型依赖于堆栈数据结构来增量地构造输入块。为了获得用于预测后续操作的这个堆栈的表示，我们使用Dyer等人(2015)提出的 Stack-LSTM ，其中LSTM被一个“堆栈指针”扩充。虽然顺序LSTMs从左到右建模序列，但是堆栈LSTMs允许嵌入一组对象，这些对象既可以添加(使用push操作)，也可以删除(使用pop操作)。这允许 Stack-LSTM 像堆栈一样工作，维护其内容的“摘要嵌入”。为了简单起见，我们将此模型称为 Stack-LSTM 或S-LSTM模型。

Finally, we refer interested readers to the original paper (Dyer et al., 2015) for details about the StackLSTM model since in this paper we merely use the same architecture through a new transition-based algorithm presented in the following Section.

最后，我们希望有兴趣的读者参考原始论文(Dyer et al.， 2015)，了解关于StackLSTM模型的详细信息，因为在本文中，我们只是通过在下一节中介绍的基于转换的新算法使用相同的体系结构。

3.1 Chunking Algorithm (分块算法)

We designed a transition inventory which is given in Figure 2 that is inspired by transition-based parsers, in particular the arc-standard parser of Nivre (2004). In this algorithm, we make use of two stacks (designated output and stack representing, respectively, completed chunks and scratch space) and a buffer that contains the words that have yet to be processed. The transition inventory contains the following transitions: The SHIFT transition moves a word from the buffer to the stack, the OUT transition moves a word from the buffer directly into the output stack while the REDUCE(y) transition pops all items from the top of the stack creating a “chunk,” labels this with label y, and pushes a representation of this chunk onto the output stack. The algorithm completes when the stack and buffer are both empty. The algorithm is depicted in Figure 2, which shows the sequence of operations required to process the sentence Mark Watney visited Mars.

我们设计了一个转换清单，如图2所示，它受到基于转换的解析器的启发，特别是Nivre(2004)的arc标准解析器。在这个算法中，我们使用两个堆栈(分别表示已完成的块和擦除空间的指定输出和堆栈)和一个包含尚未处理的单词的缓冲区。过渡库存包含以下转变:转变过渡移动堆栈缓冲区的一句话,直接从缓冲过渡动作一个词到输出栈而减少(y)过渡弹出所有项目从堆栈的顶部创建一个“块”,标签与标签y,并把这一块的表示到输出栈。当堆栈和缓冲区都为空时，算法就完成了。该算法如图2所示，它显示了处理Mark Watney visit Mars语句所需的操作序列。

The model is parameterized by defining a probability distribution over actions at each time step, given the current contents of the stack, buffer, and output, as well as the history of actions taken. Following Dyer et al. (2015), we use stack LSTMs to compute a fixed dimensional embedding of each of these, and take a concatenation of these to obtain the full algorithm state. This representation is used to define a distribution over the possible actions that can be taken at each time step. The model is trained to maximize the conditional probability of sequences of reference actions (extracted from a labeled training corpus) given the input sentences. To label a new input sequence at test time, the maximum probability action is chosen greedily until the algorithm reaches a termination state. Although this is not guaranteed to find a global optimum, it is effective in practice. Since each token is either moved directly to the output (1 action) or first to the stack and then the output (2 actions), the total number of actions for a sequence of length n is maximally 2n .

给定堆栈、缓冲区和输出的当前内容以及所采取的操作的历史，通过在每个时间步上定义操作的概率分布来参数化模型。在Dyer et al.(2015)之后，我们使用stack LSTMs来计算其中每一个的固定维嵌入，并将它们串联起来，得到完整的算法状态。此表示形式用于定义在每个时间步骤上可能采取的操作的分布。该模型经过训练，最大限度地提高给定输入句子的参考动作序列(从标记的训练语料库中提取)的条件概率。为了在测试时标记一个新的输入序列，贪婪地选择最大概率动作，直到算法达到终止状态。虽然不能保证找到全局最优解，但在实践中是有效的。由于每个令牌要么直接移动到输出(1个操作)，要么先移动到堆栈，然后再移动到输出(2个操作)，因此长度为n的序列的操作总数最大为2n。

Figure 2: Transitions of the Stack-LSTM model indicating the action applied and the resulting state. Bold symbols indicate (learned) embeddings of words and relations, script symbols indicate the corresponding words and relations.

图2:Stack-LSTM模型的转换，该模型指示应用的操作和结果状态。大胆(粗体)的符号表示(学习)词与关系的嵌入，文字符号表示相应的词与关系。

Figure 3: Transition sequence for Mark Watney visited Mars with the Stack-LSTM model.

图3:Mark Watney visited Mars 与Stack-LSTM模型的转换序列。

It is worth noting that the nature of this algorithm model makes it agnostic to the tagging scheme used since it directly predicts labeled chunks.

值得注意的是，这个算法模型的性质使得它与所使用的标记方案无关，因为它直接预测被标记的块。

3.2 Representing Labeled Chunks

When the REDUCE( $y$ ) operation is executed, the algorithm shifts a sequence of tokens (together with their vector embeddings) from the stack to the output buffer as a single completed chunk. To compute an embedding of this sequence, we run a bidirectional LSTM over the embeddings of its constituent tokens together with a token representing the type of the chunk being identified $(i.e., y)$ . This function is given as $g(u, . . . , v, r_{y})$ , where ry is a learned embedding of a label type. Thus, the output buffer contains a single vector representation for each labeled chunk that is generated, regardless of its length.

当REDUCE( $y$ )操作执行时，算法将一组令牌序列(连同它们的向量嵌入)作为一个完整的块从堆栈转移到输出缓冲区。为了计算这个序列的嵌入，我们在其组成标记的嵌入上运行一个双向LSTM，并使用一个表示要标识的块类型的标记 $(i.e., y)$ 。这个函数被给出为 $g(u, . . . , v, r_{y})$ ，其中ry是一种学习嵌入标签类型。因此，输出缓冲区为生成的每个标记块包含一个向量表示，而不管它的长度如何。

4 Input Word Embeddings

The input layers to both of our models are vector representations of individual words. Learning independent representations for word types from the limited NER training data is a difficult problem: there are simply too many parameters to reliably estimate. Since many languages have orthographic or morphological evidence that something is a name (or not a name), we want representations that are sensitive to the spelling of words. We therefore use a model that constructs representations of words from representations of the characters they are composed of (4.1). Our second intuition is that names, which may individually be quite varied, appear in regular contexts in large corpora. Therefore we use embeddings learned from a large corpus that are sensitive to word order (4.2). Finally, to prevent the models from depending on one representation or the other too strongly, we use dropout training and find this is crucial for good generalization performance (4.3).

我们的两个模型的输入层都是单个单词的向量表示。从有限的NER训练数据中学习单词类型的独立表示是一个难题:有太多的参数无法可靠地估计。由于许多语言都有拼写正确（orthographic）或形态学证据表明某些东西是名称（或不是名称），因此我们需要对单词拼写敏感的表示。因此，我们使用一个模型来构造它们由（4.1）组成的字符的表示形式的单词表示。我们的第二个直觉是，在大型语料库中，名称可能会出现各种各样的变化。因此，我们使用从大型语料库中学习的对词序（4.2）敏感的嵌入。最后，为了防止模型过于依赖于一种表示，我们使用了dropout训练，发现这对于良好的泛化性能至关重要(4.3)。

Figure 4: The character embeddings of the word “Mars” are given to a bidirectional LSTMs. We concatenate their last outputs to an embedding from a lookup table to obtain a representation for this word.

图4:将单词“Mars”的字符嵌入到双向LSTMs中。我们将它们的最后输出连接到查找表的嵌入中，以获得这个单词的表示形式。

4.1 Character-based models of words（基于字符的单词模型）

An important distinction of our work from most previous approaches is that we learn character-level features while training instead of hand-engineering prefix and suffix information about words. Learning character-level embeddings has the advantage of learning representations specific to the task and domain at hand. They have been found useful for morphologically rich languages and to handle the outof-vocabulary problem for tasks like part-of-speech tagging and language modeling (Ling et al., 2015b) or dependency parsing (Ballesteros et al., 2015).

我们的工作与大多数以前的方法的一个重要区别是，我们在训练中学习字符级别的特性，而不是手工设计单词的前缀和后缀信息。学习字符级嵌入具有学习特定于当前任务和领域的表示形式的优势。它们已被发现对丰富的形态学语言和处理词性标记和语言建模(Ling等，2015b)或依赖分析(Ballesteros等，2015)等任务的outof-vocabulary问题很有用。

Figure 4 describes our architecture to generate a word embedding for a word from its characters. A character lookup table initialized at random contains an embedding for every character. The character embeddings corresponding to every character in a word are given in direct and reverse order to a forward and a backward LSTM. The embedding for a word derived from its characters is the concatenation of its forward and backward representations from the bidirectional LSTM. This character-level representation is then concatenated with a word-level representation from a word lookup-table. During testing, words that do not have an embedding in the lookup table are mapped to a UNK embedding. To train the UNK embedding, we replace singletons with the UNK embedding with a probability 0.5. In all our experiments, the hidden dimension of the forward and backward character LSTMs are 25 each, which results in our character-based representation of words being of dimension 50.

图4描述了从单词的字符生成单词嵌入的体系结构。随机初始化的字符查找表包含每个字符的嵌入。与单词中的每个字符对应的字符嵌入按顺序分别给出了正向LSTM和反向LSTM。从其字符派生的单词的嵌入是其来自双向LSTM的正向和反向表示的连接。然后，将此字符级表示与来自单词查找表的单词级表示连接起来。在测试期间，在查找表中没有嵌入的单词将被映射到UNK嵌入。为了训练UNK嵌入，我们用概率为0.5的UNK嵌入替换单例。在我们所有的实验中，正向和反向字符LSTMs的隐藏维数各为25，这导致我们基于字符的单词表示为50维。

Recurrent models like RNNs and LSTMs are capable of encoding very long sequences, however, they have a representation biased towards their most recent inputs. As a result, we expect the final representation of the forward LSTM to be an accurate representation of the suffix of the word, and the fi- nal state of the backward LSTM to be a better representation of its prefix. Alternative approaches— most notably like convolutional networks—have been proposed to learn representations of words from their characters (Zhang et al., 2015; Kim et al., 2015). However, convnets are designed to discover position-invariant features of their inputs. While this is appropriate for many problems, e.g., image recognition (a cat can appear anywhere in a picture), we argue that important information is position dependent (e.g., prefixes and suffixes encode different information than stems), making LSTMs an a priori better function class for modeling the relationship between words and their characters.

像RNN和LSTM这样的递归模型能够编码非常长的序列，但是，它们具有偏向其最近输入的表示。因此，我们期望前向LSTM的最终表示是该单词后缀的准确表示，并且后向LSTM的最终状态是其前缀的更好表示。已经提出了替代方法 - 最值得注意的是卷积网络 - 来学习其角色的词语表示（Zhang et al。，2015; Kim et al。，2015）。但是，convnet旨在发现其输入的位置不变特征。虽然这适用于许多问题，例如图像识别（猫可以出现在图片中的任何位置），但我们认为重要信息是位置相关的（例如，前缀和后缀编码与词干不同的信息），使得LSTM更先进用于建模单词及其字符之间关系的函数类。

4.2 Pretrained embeddings(Pretrained嵌入)

As in Collobert et al. (2011), we use pretrained word embeddings to initialize our lookup table. We observe significant improvements using pretrained word embeddings over randomly initialized ones. Embeddings are pretrained using skip-n-gram (Ling et al., 2015a), a variation of word2vec (Mikolov et al., 2013a) that accounts for word order. These embeddings are fine-tuned during training. Word embeddings for Spanish, Dutch, German and English are trained using the Spanish Gigaword version 3, the Leipzig corpora collection, the German monolingual training data from the 2010 Machine Translation Workshop and the English Gigaword version 4 (with the LA Times and NY Times portions removed) respectively. $^{2}$ We use an embedding dimension of 100 for English, 64 for other languages, a minimum word frequency cutoff of 4, and a window size of 8.

$^{2}$ (Graff, 2011; Biemann et al., 2007; Callison-Burch et al.,2010; Parker et al., 2009)

与Collobert等人(2011)一样，我们使用预先训练好的单词嵌入来初始化查找表。我们观察到，与随机初始化的词相比，使用预先训练的词嵌入有显著的改进。嵌入使用skip-n-gram (Ling et al.， 2015a)进行预训练，这是word2vec (Mikolov et al.， 2013a)的变体，用于解释词序。这些嵌入在培训期间进行了微调。西班牙语、荷兰语、德语和英语的词嵌入分别使用西班牙语Gigaword版本3、莱比锡语料库集合、2010年机器翻译研讨会的德语单语培训数据和英语Gigaword版本4(去掉了《洛杉矶时报》和《纽约时报》的部分)进行培训。我们对英语使用100的嵌入维数，对其他语言使用64的嵌入维数，最小单词频率截止值为4，窗口大小为8。

4.3 Dropout training（Dropout 训练）

Initial experiments showed that character-level embeddings did not improve our overall performance when used in conjunction with pretrained word representations. To encourage the model to depend on both representations, we use dropout training (Hinton et al., 2012), applying a dropout mask to the final embedding layer just before the input to the bidirectional LSTM in Figure 1. We observe a significant improvement in our model’s performance after using dropout (see table 5).

最初的实验表明，当与预先训练好的单词表示一起使用时，字符级嵌入并没有提高我们的整体性能。为了鼓励模型依赖于这两种表示，我们使用了dropout训练(Hinton et al.， 2012)，在图1中双向LSTM输入之前的最后一个嵌入层上应用了dropout掩码。使用dropout后，我们发现模型的性能有了显著的改进(见表5)。

5 Experiments

This section presents the methods we use to train our models, the results we obtained on various tasks and the impact of our networks’ configuration on model performance.

本节介绍了我们用来训练模型的方法、我们在各种任务上获得的结果以及网络配置对模型性能的影响。

5.1 Training

For both models presented, we train our networks using the back-propagation algorithm updating our parameters on every training example, one at a time, using stochastic gradient descent (SGD) with a learning rate of 0.01 and a gradient clipping of 5.0. Several methods have been proposed to enhance the performance of SGD, such as Adadelta (Zeiler, 2012) or Adam (Kingma and Ba, 2014). Although we observe faster convergence using these methods, none of them perform as well as SGD with gradient clipping.

对于提出的两个模型，我们使用反向传播算法训练我们的网络，每次更新一个训练实例的参数，使用随机梯度下降(SGD)，学习率为0.01，梯度裁剪为5.0。已经提出了几种提高SGD性能的方法，如Adadelta (Zeiler, 2012)或Adam (Kingma and Ba, 2014)。虽然我们观察到使用这些方法收敛速度更快，但它们都没有使用梯度裁剪的SGD那么好。

Our LSTM-CRF model uses a single layer for the forward and backward LSTMs whose dimensions are set to 100. Tuning this dimension did not significantly impact model performance. We set the dropout rate to 0.5. Using higher rates negatively impacted our results, while smaller rates led to longer training time.
The stack-LSTM model uses two layers each of dimension 100 for each stack.

我们的LSTM-CRF模型为前向和后向lstm使用一个单层，其尺寸设置为100。调优这个维度不会显著影响模型性能。我们将dropout 率设置为0.5。使用较高的(dropout )率会对我们的结果产生负面影响，而较低的心率会导致较长的训练时间。

The embeddings of the actions used in the composition functions have 16 dimensions each, and the output embedding is of dimension 20. We experimented with different dropout rates and reported the scores using the best dropout rate for each language. $^3$ It is a greedy model that apply locally optimal actions until the entire sentence is processed, further improvements might be obtained with beam search (Zhang and Clark, 2011) or training with exploration (Ballesteros et al., 2016).

复合函数中使用的操作的嵌入各有16个维度，输出嵌入的维度为20。我们对不同的（dropout ）率进行了实验，并使用每种语言的最佳（dropout ）率报告了分数 $^3$ 。这是一个贪婪(贪心算法)的模型，它应用局部最优的动作，直到整个句子被处理，可以通过波束搜索(Zhang and Clark, 2011)或探索训练(Ballesteros et al.， 2016)得到进一步的改进。

$^3$ English (D=0.2), German, Spanish and Dutch (D=0.3)

5.2 Data Sets(数据设置)

We test our model on different datasets for named entity recognition. To demonstrate our model’s ability to generalize to different languages, we present results on the CoNLL-2002 and CoNLL- 2003 datasets (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003) that contain independent named entity labels for English, Spanish, German and Dutch. All datasets contain four different types of named entities: locations, persons, organizations, and miscellaneous entities that do not belong in any of the three previous categories. Although POS tags were made available for all datasets, we did not include them in our models. We did not perform any dataset preprocessing, apart from replacing every digit with a zero in the English NER dataset.

我们在不同的数据集上测试模型以进行命名实体识别。为了证明我们的模型泛化到不同语言的能力，我们在CoNLL-2002和CoNLL- 2003数据集上给出了结果(Tjong Kim Sang, 2002;Tjong Kim Sang和De Meulder, 2003)，其中包含英语、西班牙语、德语和荷兰语的独立命名实体标签。所有数据集包含四种不同类型的命名实体:位置、人员、组织和其他不属于前三种类别中的任何一种的实体。虽然POS标签对所有数据集都可用，但是我们没有在模型中包含它们。我们没有执行任何数据集预处理，只是将英文NER数据集中的每个数字替换为零。

5.3 Results

Table 1 presents our comparisons with other models for named entity recognition in English. To make the comparison between our model and others fair, we report the scores of other models with and without the use of external labeled data such as gazetteers and knowledge bases. Our models do not use gazetteers or any external labeled resources. The best score reported on this task is by Luo et al. (2015). They obtained a F1 of 91.2 by jointly modeling the NER and entity linking tasks (Hoffart et al., 2011). Their model uses a lot of hand-engineered features including spelling features, WordNet clusters, Brown clusters, POS tags, chunks tags, as well as stemming and external knowledge bases like Freebase and Wikipedia. Our LSTM-CRF model outperforms all other systems, including the ones using external labeled data like gazetteers. Our StackLSTM model also outperforms all previous models that do not incorporate external features, apart from the one presented by Chiu and Nichols (2015).

表1展示了我们与其他英语命名实体识别模型的比较。为了使我们的模型和其他模型之间的比较公平，我们报告了使用或不使用外部标记数据(如地名表和知识库)的其他模型的得分。我们的模型不使用地名表或任何外部标记的资源。在这项任务中，罗等人(2015)的得分最高。他们通过联合建模NER和实体链接任务获得了91.2的F1 (Hoffart et al.， 2011)。他们的模型使用了很多手工设计的功能，包括拼写功能、WordNet集群、Brown集群、POS标签、chunk标签，以及词干提取和外部知识库(如Freebase和Wikipedia)。我们的LSTM-CRF模型优于所有其他系统，包括使用外部标记数据(如地名表)的系统。除了Chiu和Nichols(2015)提出的模型外，我们的StackLSTM模型还优于所有不包含外部特性的先前模型。

Tables 2, 3 and 4 present our results on NER for German, Dutch and Spanish respectively in comparison to other models. On these three languages, the LSTM-CRF model significantly outperforms all previous methods, including the ones using external labeled data. The only exception is Dutch, where the model of Gillick et al. (2015) can perform better by leveraging the information from other NER datasets. The Stack-LSTM also consistently presents statethe-art (or close to) results compared to systems that do not use external data.

表2、表3和表4分别展示了我们对德语、荷兰语和西班牙语的NER与其他模型的比较结果。在这三种语言上，LSTM-CRF模型的性能显著优于所有以前的方法，包括使用外部标记数据的方法。唯一的例外是荷兰，Gillick等人(2015)的模型可以更好地利用来自其他NER数据集的信息。与不使用外部数据的系统相比，堆栈- lstm还始终呈现最新的(或接近的)结果。

As we can see in the tables, the Stack-LSTM model is more dependent on character-based representations to achieve competitive performance; we hypothesize that the LSTM-CRF model requires less orthographic information since it gets more contextual information out of the bidirectional LSTMs; however, the Stack-LSTM model consumes the words one by one and it just relies on the word representations when it chunks words.

从表中可以看出，Stack-LSTM模型更依赖于基于字符的表示来实现竞争性能;我们假设LSTM-CRF模型需要较少的正字法信息，因为它能从双向lstm中获得更多的上下文信息;但是，Stack-LSTM模型逐个使用单词，并且在对单词进行块处理时只依赖单词表示。

5.4 Network architectures(网络体系结构)

我们的模型有几个组件，我们可以调整它们来理解它们对整体性能的影响。我们探讨了CRF、字符级表示、单词嵌入和删除的预训练对LSTMCRF模型的影响。我们观察到，在F1中，预习word embeddings使我们在+7.31的整体性能上得到了最大的提高。CRF层增加了+1.79，使用dropout增加了+1.17，最后学习字符级单词嵌入增加了约+0.74。对于堆栈- lstm，我们进行了类似的一组实验。不同架构的结果如表5所示。

Table 5: English NER results with our models, using differentconfigurations. “pretrain” refers to models that include pretrained word embeddings, “char” refers to models that includecharacter-based modeling of words, “dropout” refers to modelsthat include dropout rate.

表5:英语NER结果与我们的模型，使用不同配置。“预培训”指的是包含预培训的模型word embeddings，“char”指的是包含基于字符的单词建模，“dropout”指的是模型包括(dropout )率。

6 Related Work

In the CoNLL-2002 shared task, Carreras et al. (2002) obtained among the best results on both Dutch and Spanish by combining several small fixed-depth decision trees. Next year, in the CoNLL- 2003 Shared Task, Florian et al. (2003) obtained the best score on German by combining the output of four diverse classifiers. Qi et al. (2009) later improved on this with a neural network by doing unsupervised learning on a massive unlabeled corpus.
Several other neural architectures have previously been proposed for NER. For instance, Collobert et al. (2011) uses a CNN over a sequence of word embeddings with a CRF layer on top. This can be thought of as our first model without character-level embeddings and with the bidirectional LSTM being replaced by a CNN. More recently, Huang et al. (2015) presented a model similar to our LSTM-CRF, but using hand-crafted spelling features. Zhou and Xu (2015) also used a similar model and adapted it to the semantic role labeling task. Lin and Wu (2009) used a linear chain CRF with L2 regularization, they added phrase cluster features extracted from the web data and spelling features. Passos et al. (2014) also used a linear chain CRF with spelling features and gazetteers.
Language independent NER models like ours have also been proposed in the past. Cucerzan and Yarowsky (1999; 2002) present semi-supervised bootstrapping algorithms for named entity recognition by co-training character-level (word-internal) and token-level (context) features. Eisenstein et al. (2011) use Bayesian nonparametrics to construct a database of named entities in an almost unsupervised setting. Ratinov and Roth (2009) quantitatively compare several approaches for NER and build their own supervised model using a regularized average perceptron and aggregating context information.
Finally, there is currently a lot of interest in models for NER that use letter-based representations. Gillick et al. (2015) model the task of sequencelabeling as a sequence to sequence learning problem and incorporate character-based representations into their encoder model. Chiu and Nichols (2015) employ an architecture similar to ours, but instead use CNNs to learn character-level features, in a way similar to the work by Santos and Guimaraes (2015).

Carreras等人(2002)在CoNLL-2002 shared task中，通过组合几个固定深度的小决策树，在荷兰语和西班牙语两种语言中都获得了最好的结果。第二年，在CoNLL- 2003 Shared Task中，Florian et al.(2003)综合了四个不同分类器的输出结果，获得了德语的最佳成绩。Qi等人(2009)后来利用神经网络对大量无标记语料库进行无监督学习，改进了这一方法。
之前已经为NER提出了几个其他的神经结构。例如，Collobert et al.(2011)在一系列嵌入单词的CNN上使用了一个CRF层。这可以被认为是我们的第一个模型，没有字符级嵌入，双向LSTM被CNN替换。最近，Huang等人(2015)提出了一个类似于我们的LSTM-CRF的模型，但是使用了手工拼写功能。Zhou和Xu(2015)也使用了类似的模型，并将其应用于语义角色标记任务。Lin和Wu(2009)使用了带有L2正则化的线性链CRF，他们添加了从web数据中提取的短语簇特征和拼写特征。Passos等人(2014)也使用了带有拼写特征和地名表的线性链CRF。
与语言无关的NER模型在过去也被提出过。Cucerzan和Yarowsky (1999;2002)提出了基于字符级(内部词)和令牌级(上下文)特征的半监督自举识别算法。Eisenstein等人(2011)使用贝叶斯非参数化方法在几乎无监督的设置下构建了一个命名实体的数据库。Ratinov和Roth(2009)定量比较了几种NER方法，并使用正则化的平均感知器和聚合上下文信息构建了自己的监督模型。
最后，目前对使用基于字母表示的NER模型有很多兴趣。Gillick等人(2015)将测序任务建模为一个序列到序列的学习问题，并将基于字符的表示形式合并到他们的编码器模型中。Chiu和Nichols(2015)采用了一种类似于我们的架构，但使用CNNs来学习字符级特性，其方式类似于Santos和Guimaraes(2015)的工作。

7 Conclusion

This paper presents two neural architectures for sequence labeling that provide the best NER results ever reported in standard evaluation settings, even compared with models that use external resources, such as gazetteers.
A key aspect of our models are that they model output label dependencies, either via a simple CRF architecture, or using a transition-based algorithm to explicitly construct and label chunks of the input. Word representations are also crucially important for success: we use both pre-trained word representations and “character-based” representations that capture morphological and orthographic information. To prevent the learner from depending too heavily on one representation class, dropout is used.

本文提出了两种用于序列标记的神经结构，即使与使用外部资源(如地名表)的模型相比，它们也能在标准评估设置中提供有史以来最好的NER结果。
我们的模型的一个关键方面是，它们对输出标签依赖关系建模，要么通过简单的CRF架构，要么使用基于转换的算法显式地构造和标记输入块。单词表示对于成功也至关重要:我们既使用预先训练的单词表示，也使用“基于字符”的表示，以捕获形态学和正字法信息。为了防止学习者过于依赖一个表示类，使用了dropout。

Acknowledgments

This work was sponsored in part by the Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O) under the Low Resource Languages for Emergent Incidents (LORELEI) program issued by DARPA/I2O under Contract No. HR0011-15-C-0114. Miguel Ballesteros is supported by the European Commission under the contract numbers FP7-ICT-610411 (project MULTISENSOR) and H2020-RIA-645012 (project KRISTINA).

这项工作由美国国防部高级研究计划局(DARPA)信息创新办公室(I2O)在低资源语言紧急事件(LORELEI)项目下赞助，该项目由DARPA/I2O根据合同编号(No. 1)发布。hr0011 - 15 - c - 0114。Miguel Ballesteros由欧洲委员会根据合同编号FP7-ICT-610411(项目多传感器)和H2020-RIA-645012(项目KRISTINA)提供支持。

References

[Ando and Zhang2005a] Rie Kubota Ando and Tong Zhang. 2005a. A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research, 6:1817–1853.
[Ando and Zhang2005b] Rie Kubota Ando and Tong Zhang. 2005b. Learning predictive structures. JMLR, 6:1817–1853.
[Ballesteros et al.2015] Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based dependency parsing by modeling characters instead of words with LSTMs. In Proceedings of EMNLP.
[Ballesteros et al.2016] Miguel Ballesteros, Yoav Golderg, Chris Dyer, and Noah A. Smith. 2016. Training with Exploration Improves a Greedy Stack-LSTM Parser. In arXiv:1603.03793.
[Bengio et al.1994] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166.
[Biemann et al.2007] Chris Biemann, Gerhard Heyer, Uwe Quasthoff, and Matthias Richter. 2007. The leipzig corpora collection-monolingual corpora of standard size. Proceedings of Corpus Linguistic.
[Callison-Burch et al.2010] Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar F Zaidan. 2010. Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17–53. Association for Computational Linguistics.
[Carreras et al.2002] Xavier Carreras, Llu´ıs Marquez, and ` Llu´ıs Padro. 2002. Named entity extraction using ad- ´ aboost, proceedings of the 6th conference on natural language learning. August, 31:1–4.
[Chiu and Nichols2015] Jason PC Chiu and Eric Nichols. 2015. Named entity recognition with bidirectional lstm-cnns. arXiv preprint arXiv:1511.08308.
[Collobert et al.2011] Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, ´ and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537.
[Cucerzan and Yarowsky1999] Silviu Cucerzan and David Yarowsky. 1999. Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of the 1999 Joint SIGDAT Conference on EMNLP and VLC, pages 90–99.
[Cucerzan and Yarowsky2002] Silviu Cucerzan and David Yarowsky. 2002. Language independent ner using a unified model of internal and contextual evidence. In proceedings of the 6th conference on Natural language learning-Volume 20, pages 1–4. Association for Computational Linguistics.
[Dai et al.2015] Hong-Jie Dai, Po-Ting Lai, Yung-Chun Chang, and Richard Tzong-Han Tsai. 2015. Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization. Journal of cheminformatics, 7(Suppl 1):S14. [Dyer et al.2015] Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proc. ACL.
[Eisenstein et al.2011] Jacob Eisenstein, Tae Yano, William W Cohen, Noah A Smith, and Eric P Xing. 2011. Structured databases of named entities from bayesian nonparametrics. In Proceedings of the First Workshop on Unsupervised Learning in NLP, pages 2–12. Association for Computational Linguistics. [Florian et al.2003] Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named entity recognition through classifier combination. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 168–171. Association for Computational Linguistics.
[Gillick et al.2015] Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2015. Multilingual language processing from bytes. arXiv preprint arXiv:1512.00103.
[Graff2011] David Graff. 2011. Spanish gigaword third edition (ldc2011t12). Linguistic Data Consortium, Univer-sity of Pennsylvania, Philadelphia, PA.
[Graves and Schmidhuber2005] Alex Graves and Jurgen ¨ Schmidhuber. 2005. Framewise phoneme classifi- cation with bidirectional LSTM networks. In Proc. IJCNN.
[Hinton et al.2012] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
[Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. ¨ Neural Computation, 9(8):1735–1780.
[Hoffart et al.2011] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Furstenau, Manfred ¨ Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782–792. Association for Computational Linguistics.
[Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.
[Kim et al.2015] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2015. Character-aware neural language models. CoRR, abs/1508.06615.
[Kingma and Ba2014] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[Lafferty et al.2001] John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML.
[Lin and Wu2009] Dekang Lin and Xiaoyun Wu. 2009. Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1030–1038. Association for Computational Linguistics.
[Ling et al.2015a] Wang Ling, Lin Chu-Cheng, Yulia Tsvetkov, Silvio Amir, Ramon Fernandez Astudillo, ´ Chris Dyer, Alan W Black, and Isabel Trancoso. 2015a. Not all contexts are created equal: Better word representations with variable attention. In Proc. EMNLP.
[Ling et al.2015b] Wang Ling, Tiago Lu´ıs, Lu´ıs Marujo, Ramon Fernandez Astudillo, Silvio Amir, Chris Dyer, ´ Alan W Black, and Isabel Trancoso. 2015b. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
[Luo et al.2015] Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie. 2015. Joint named entity recognition and disambiguation. In Proc. EMNLP. [Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proc. NIPS. [Nivre2004] Joakim Nivre. 2004. Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together.
[Nothman et al.2013] Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, and James R Curran. 2013. Learning multilingual named entity recognition from wikipedia. Artificial Intelligence, 194:151–175. [Parker et al.2009] Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2009. English gigaword fourth edition (ldc2009t13). Linguistic Data Consortium, Univer-sity of Pennsylvania, Philadelphia, PA.
[Passos et al.2014] Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. arXiv preprint arXiv:1404.5367.
[Qi et al.2009] Yanjun Qi, Ronan Collobert, Pavel Kuksa, Koray Kavukcuoglu, and Jason Weston. 2009. Combining labeled and unlabeled data with word-class distribution learning. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 1737–1740. ACM.
[Ratinov and Roth2009] Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147–155. Association for Computational Linguistics.
[Santos and Guimaraes2015] ˜ Cicero Nogueira dos Santos and Victor Guimaraes. 2015. Boosting named entity ˜ recognition with neural character embeddings. arXiv preprint arXiv:1505.05008. [Tjong Kim Sang and De Meulder2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proc. CoNLL.
[Tjong Kim Sang2002] Erik F. Tjong Kim Sang. 2002. Introduction to the conll-2002 shared task: Languageindependent named entity recognition. In Proc. CoNLL. [Turian et al.2010] Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proc. ACL.
[Zeiler2012] Matthew D Zeiler. 2012. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701. [Zhang and Clark2011] Yue Zhang and Stephen Clark. 2011. Syntactic processing using the generalized perceptron and beam search. Computational Linguistics, 37(1).
[Zhang et al.2015] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pages 649–657.
[Zhou and Xu2015] Jie Zhou and Wei Xu. 2015. End-toend learning of semantic role labeling using recurrent neural networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.