BERT中文实战---命名实体识别

我一直做的是有关实体识别的任务，BERT已经火了有一段时间，也研究过一点，今天将自己对bert对识别实体的简单认识记录下来，希望与大家进行来讨论

BERT官方Github地址：https://github.com/google-research/bert ，其中对BERT模型进行了详细的介绍，更详细的可以查阅原文献：https://arxiv.org/abs/1810.04805

bert可以简单地理解成两段式的nlp模型，（1）pre_training：即预训练，相当于wordembedding，利用没有任何标记的语料训练一个模型；（2）fine-tuning：即微调，利用现有的训练好的模型，根据不同的任务，输入不同，修改输出的部分，即可完成下游的一些任务（如命名实体识别、文本分类、相似度计算等等）
本文是在官网上给定的run_classifier.py中进行修改从而完成命名实体识别的任务

BERT+Bilstm-CRF，前面的BERT就是用来产生词向量的

代码的解读，将主要的几个代码进行简单的解读

数据格式

张 B-PER

三 I-PER

来 O

自 O

北 B-LOC

京 I-LOC

我们最终需要把数据转换成bert论文中的形式

数据封装

代码中将所有的数据封装成record的形式：

 for (ex_index, example) in enumerate(examples):
        if ex_index % 5000 == 0:
            tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
        # 对于每一个训练样本,
        feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, mode)
        # print(feature.input_ids)  #
        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f

        features = collections.OrderedDict()
        features["input_ids"] = create_int_feature(feature.input_ids)
        features["input_mask"] = create_int_feature(feature.input_mask)
        features["segment_ids"] = create_int_feature(feature.segment_ids)
        features["label_ids"] = create_int_feature(feature.label_ids)

        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        writer.write(tf_example.SerializeToString()) # 它的作用是将Example中的map压缩为二进制，节约大量空间

数据读取

读取record 数据，组成batch

train_input_fn = file_based_input_fn_builder(
            input_file=train_file,
            seq_length=FLAGS.max_seq_length,
            is_training=True,
            drop_remainder=True)

这里主要也是通过回调函数完成

    def input_fn(params):
        batch_size = params["batch_size"]
        d = tf.data.TFRecordDataset(input_file)
        if is_training:
            d = d.repeat()
            d = d.shuffle(buffer_size=100)
        d = d.apply(tf.contrib.data.map_and_batch(
            lambda record: _decode_record(record, name_to_features),
            batch_size=batch_size,
            drop_remainder=drop_remainder
        ))
        return d

input_file就是保存的record文件，然后用d = tf.data.TFRecordDataset(input_file)读数据，这样就得到了一个batch的数据。

estimator封装器

estimator = tf.contrib.tpu.TPUEstimator(
        use_tpu=FLAGS.use_tpu,
        model_fn=model_fn,
        config=run_config,
        train_batch_size=FLAGS.train_batch_size,
        eval_batch_size=FLAGS.eval_batch_size,
        predict_batch_size=FLAGS.predict_batch_size)

有了这个封装器训练、验证测试都比较方便，这里的model_fn就是模型定义的的回调函数。

1、主函数

if __name__ == "__main__":
    flags.mark_flag_as_required("data_dir")
    flags.mark_flag_as_required("task_name")
    flags.mark_flag_as_required("vocab_file")
    flags.mark_flag_as_required("bert_config_file")
    flags.mark_flag_as_required("output_dir")
    tf.app.run()

主函数中指定了一些必须不能少的参数
data_dir:指的是我们的输入数据的文件夹路径
task_name:任务的名字
vocab_file:字典，一般从下载的模型中直接包含这个字典，名字“vocab.txt”
bert_config_file:一些预训练好的配置参数，同样在下载的模型文件夹中，名字为“bert_config.json”
output_dir:输出文件保存的位置

2、main(_)函数

processors = {
        "ner": NerProcessor
    }
task_name = FLAGS.task_name.lower()  
processor = processors[task_name]()

上面代码中的task_name是用来选择processor的

processor：任何模型的训练、预测都是需要有一个明确的输入，而BERT代码中processor就是负责对模型的输入进行处理，自定义的processor里需要继承DataProcessor，并重载获取label的get_labels和获取单个输入的get_train_examples,get_dev_examples和get_test_examples函数。其分别会在main函数的FLAGS.do_train、FLAGS.do_eval和FLAGS.do_predict阶段被调用。这三个函数的内容是相差无几的，区别只在于需要指定各自读入文件的地址
NerProcessor的代码如下：

class NerProcessor(DataProcessor):  ##数据的读入
    def get_train_examples(self, data_dir):
        return self._create_example(
            self._read_data(os.path.join(data_dir, "train.txt")), "train"
        )

    def get_dev_examples(self, data_dir):
        return self._create_example(
            self._read_data(os.path.join(data_dir, "dev.txt")), "dev"
        )

    def get_test_examples(self, data_dir):
        return self._create_example(
            self._read_data(os.path.join(data_dir, "test.txt")), "test")

    def get_labels(self):

        # 9个类别
        return ["O", "B-dizhi", "I-dizhi", "B-shouduan", "I-shouduan", "B-caiwu", "I-caiwu", "B-riqi", "I-riqi", "X",
                "[CLS]", "[SEP]"]

    def _create_example(self, lines, set_type):
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text = tokenization.convert_to_unicode(line[1])
            label = tokenization.convert_to_unicode(line[0])
            if i == 0:
            examples.append(InputExample(guid=guid, text=text, label=label))
        return examples

上面的代码主要是完成了数据的读入，且继承了DataProcessor这个类，_read_data()函数是在父类DataProcessor中实现的，具体的代码如下所示：

class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    @classmethod
    def _read_data(cls, input_file):
        """Reads a BIO data."""
        with codecs.open(input_file, 'r', encoding='utf-8') as f:
            lines = []
            words = []
            labels = []
            for line in f:
                contends = line.strip()
                tokens = contends.split()  ##根据不同的语料，此处的split()划分标志需要进行更改
                # print(len(tokens))
                if len(tokens) == 2:
                    word = line.strip().split()[0]  ##根据不同的语料，此处的split()划分标志需要进行更改
                    label = line.strip().split()[-1]  ##根据不同的语料，此处的split()划分标志需要进行更改
                else:
                    if len(contends) == 0:
                        l = ' '.join([label for label in labels if len(label) > 0])
                        w = ' '.join([word for word in words if len(word) > 0])
                        lines.append([l, w])
                        words = []
                        labels = []
                        continue
                if contends.startswith("-DOCSTART-"):
                    words.append('')
                    continue
                words.append(word)
                labels.append(label)

            return lines  ##(label,word)

_read_data()函数：主要是针对NER的任务进行改写的，将输入的数据中的字存储到words中，标签存储到labels中，将一句话中所有字以空格隔开组成一个字符串放入到w中，同理标签放到l中，同时将w与l放到lines中，具体的代码如下所示：

l = ' '.join([label for label in labels if len(label) > 0])
w = ' '.join([word for word in words if len(word) > 0])
lines.append([l, w])

def get_labels(self)：是将标签返回，会在原来标签的基础之上多添加"X","[CLS]", "[SEP]"这三个标签，句子开始设置CLS 标志，句尾添加[SEP] 标志,"X"表示的是英文中缩写拆分时，拆分出的几个部分，除了第1部分，其他的都标记为"X"

代码中使用了InputExample类

class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text, label=None):
        """Constructs a InputExample. ##构造BLSTM_CRF一个输入的例子
        Args:
          guid: Unique id for the example.
          text: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
          label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text = text
        self.label = label

我的理解是这个是输入数据的一个封装，不管要处理的是什么任务，需要经过这一步，对输入的格式进行统一一下
guid是一种标识，标识的是test、train、dev

3、模型的构造

 model_fn = model_fn_builder(
        bert_config=bert_config,
        num_labels=len(label_list) + 1,
        init_checkpoint=FLAGS.init_checkpoint,
        learning_rate=FLAGS.learning_rate,
        num_train_steps=num_train_steps,
        num_warmup_steps=num_warmup_steps,
        use_tpu=FLAGS.use_tpu,
        use_one_hot_embeddings=FLAGS.use_tpu)

    estimator = tf.contrib.tpu.TPUEstimator(
        use_tpu=FLAGS.use_tpu,
        model_fn=model_fn,
        config=run_config,
        train_batch_size=FLAGS.train_batch_size,
        eval_batch_size=FLAGS.eval_batch_size,
        predict_batch_size=FLAGS.predict_batch_size)

返回的model_dn 是一个函数，其定义了模型，训练，评测方法，并且使用钩子参数，加载了BERT模型的参数进行了自己模型的参数初始化过程

这个model_fn_builder是为了构造代码中默认调用的model_fn函数服务的，为了使用其他的参数，只不过model_fn函数的默认参数只有features, labels, mode, params，这四个，所以在model_fn包裹了一层model_fn_builder

tf 新的架构方法，通过定义model_fn 函数，定义模型，然后通过EstimatorAPI进行模型的其他工作，Es就可以控制模型的训练，预测，评估工作等。

init_checkpoint就是下载的模型

4、train()函数

    if FLAGS.do_train:
        # 1. 将数据转化为tf_record 数据
        if data_config.get('train.tf_record_path', '') == '':
            train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
            filed_based_convert_examples_to_features(
                train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
        else:
            train_file = data_config.get('train.tf_record_path')
        num_train_size = int(data_config['num_train_size'])
        # 2.读取record 数据，组成batch
        train_input_fn = file_based_input_fn_builder(
            input_file=train_file,
            seq_length=FLAGS.max_seq_length,
            is_training=True,
            drop_remainder=True)
        estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, mode):
    """
    将一个样本进行分析，然后将字转化为id, 标签转化为id,然后结构化到InputFeatures对象中
    """
    label_map = {}
    # 1表示从1开始对label进行index化
    for (i, label) in enumerate(label_list, 1):
        label_map[label] = i
    # 保存label->index 的map
    with codecs.open(os.path.join(FLAGS.output_dir, 'label2id.pkl'), 'wb') as w:
        pickle.dump(label_map, w)
    textlist = example.text.split(' ')
    labellist = example.label.split(' ')
    tokens = []
    labels = []
    for i, word in enumerate(textlist):
        # 分词，如果是中文，就是分字
        token = tokenizer.tokenize(word)
        tokens.extend(token)
        label_1 = labellist[i]
        for m in range(len(token)):
            if m == 0:
                labels.append(label_1)
            else:  # 一般不会出现else
                labels.append("X")
    # tokens = tokenizer.tokenize(example.text)
    # 序列截断
    if len(tokens) >= max_seq_length - 1:
        tokens = tokens[0:(max_seq_length - 2)]  # -2 的原因是因为序列需要加一个句首和句尾标志
        labels = labels[0:(max_seq_length - 2)]
    ntokens = []
    segment_ids = []
    label_ids = []
    ntokens.append("[CLS]")  # 句子开始设置CLS 标志
    segment_ids.append(0)
    # append("O") or append("[CLS]") not sure!
    label_ids.append(label_map["[CLS]"])  # O OR CLS 没有任何影响，不过我觉得O 会减少标签个数,不过拒收和句尾使用不同的标志来标注，使用LCS 也没毛病
    for i, token in enumerate(tokens):
        ntokens.append(token)
        segment_ids.append(0)
        label_ids.append(label_map[labels[i]])
    ntokens.append("[SEP]")  # 句尾添加[SEP] 标志
    segment_ids.append(0)
    # append("O") or append("[SEP]") not sure!
    label_ids.append(label_map["[SEP]"])
    input_ids = tokenizer.convert_tokens_to_ids(ntokens)  # 将序列中的字(ntokens)转化为ID形式
    input_mask = [1] * len(input_ids)
    # label_mask = [1] * len(input_ids)
    # padding, 使用
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)
        # we don't concerned about it!
        label_ids.append(0)
        ntokens.append("**NULL**")
        # label_mask.append(0)
    # print(len(input_ids))
    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length
    assert len(label_ids) == max_seq_length
    # assert len(label_mask) == max_seq_length

    # 结构化为一个类
    feature = InputFeatures(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_ids=label_ids,
        # label_mask = label_mask
    )
    # mode='test'的时候才有效
    write_tokens(ntokens, mode)
    return feature


def filed_based_convert_examples_to_features(
        examples, label_list, max_seq_length, tokenizer, output_file, mode=None
):
    """
    将数据转化为TF_Record 结构，作为模型数据输入
    :param examples:  样本
    :param label_list:标签list
    :param max_seq_length: 预先设定的最大序列长度
    :param tokenizer: tokenizer 对象
    :param output_file: tf.record 输出路径
    :param mode:
    :return:
    """
    writer = tf.python_io.TFRecordWriter(output_file)
    # 遍历训练数据
    for (ex_index, example) in enumerate(examples):
        if ex_index % 5000 == 0:
            tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
        # 对于每一个训练样本,
        feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, mode)

        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f

        features = collections.OrderedDict()
        features["input_ids"] = create_int_feature(feature.input_ids)
        features["input_mask"] = create_int_feature(feature.input_mask)
        features["segment_ids"] = create_int_feature(feature.segment_ids)
        features["label_ids"] = create_int_feature(feature.label_ids)
        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        writer.write(tf_example.SerializeToString())

print(feature.input_ids)

模型的训练

estimator.train(input_fn=train_input_fn, max_steps=num_train_steps,
                        hooks=[early_stopping_hook])

总结起来如下所示：来自网址https://www.jianshu.com/p/b05e50f682dd

暂时更新到这个地方，后续会继续更新

BERT中文实战---命名实体识别

数据格式

数据封装

数据读取

estimator封装器

1、主函数

2、main(_)函数

3、模型的构造

4、train()函数

sm4加密工具类

BERT中文實戰---命名實體識別

python-docx生成docx文件

linux下docker安裝neo4j

leetcode面試刷題

leetcode

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結