BERT中文實戰---命名實體識別

我一直做的是有關實體識別的任務，BERT已經火了有一段時間，也研究過一點，今天將自己對bert對識別實體的簡單認識記錄下來，希望與大家進行來討論

BERT官方Github地址：https://github.com/google-research/bert ，其中對BERT模型進行了詳細的介紹，更詳細的可以查閱原文獻：https://arxiv.org/abs/1810.04805

bert可以簡單地理解成兩段式的nlp模型，（1）pre_training：即預訓練，相當於wordembedding，利用沒有任何標記的語料訓練一個模型；（2）fine-tuning：即微調，利用現有的訓練好的模型，根據不同的任務，輸入不同，修改輸出的部分，即可完成下游的一些任務（如命名實體識別、文本分類、相似度計算等等）
本文是在官網上給定的run_classifier.py中進行修改從而完成命名實體識別的任務

BERT+Bilstm-CRF，前面的BERT就是用來產生詞向量的

代碼的解讀，將主要的幾個代碼進行簡單的解讀

數據格式

張 B-PER

三 I-PER

來 O

自 O

北 B-LOC

京 I-LOC

我們最終需要把數據轉換成bert論文中的形式

數據封裝

代碼中將所有的數據封裝成record的形式：

 for (ex_index, example) in enumerate(examples):
        if ex_index % 5000 == 0:
            tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
        # 對於每一個訓練樣本,
        feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, mode)
        # print(feature.input_ids)  #
        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f

        features = collections.OrderedDict()
        features["input_ids"] = create_int_feature(feature.input_ids)
        features["input_mask"] = create_int_feature(feature.input_mask)
        features["segment_ids"] = create_int_feature(feature.segment_ids)
        features["label_ids"] = create_int_feature(feature.label_ids)

        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        writer.write(tf_example.SerializeToString()) # 它的作用是將Example中的map壓縮爲二進制，節約大量空間

數據讀取

讀取record 數據，組成batch

train_input_fn = file_based_input_fn_builder(
            input_file=train_file,
            seq_length=FLAGS.max_seq_length,
            is_training=True,
            drop_remainder=True)

這裏主要也是通過回調函數完成

    def input_fn(params):
        batch_size = params["batch_size"]
        d = tf.data.TFRecordDataset(input_file)
        if is_training:
            d = d.repeat()
            d = d.shuffle(buffer_size=100)
        d = d.apply(tf.contrib.data.map_and_batch(
            lambda record: _decode_record(record, name_to_features),
            batch_size=batch_size,
            drop_remainder=drop_remainder
        ))
        return d

input_file就是保存的record文件，然後用d = tf.data.TFRecordDataset(input_file)讀數據，這樣就得到了一個batch的數據。

estimator封裝器

estimator = tf.contrib.tpu.TPUEstimator(
        use_tpu=FLAGS.use_tpu,
        model_fn=model_fn,
        config=run_config,
        train_batch_size=FLAGS.train_batch_size,
        eval_batch_size=FLAGS.eval_batch_size,
        predict_batch_size=FLAGS.predict_batch_size)

有了這個封裝器訓練、驗證測試都比較方便，這裏的model_fn就是模型定義的的回調函數。

1、主函數

if __name__ == "__main__":
    flags.mark_flag_as_required("data_dir")
    flags.mark_flag_as_required("task_name")
    flags.mark_flag_as_required("vocab_file")
    flags.mark_flag_as_required("bert_config_file")
    flags.mark_flag_as_required("output_dir")
    tf.app.run()

主函數中指定了一些必須不能少的參數
data_dir:指的是我們的輸入數據的文件夾路徑
task_name:任務的名字
vocab_file:字典，一般從下載的模型中直接包含這個字典，名字“vocab.txt”
bert_config_file:一些預訓練好的配置參數，同樣在下載的模型文件夾中，名字爲“bert_config.json”
output_dir:輸出文件保存的位置

2、main(_)函數

processors = {
        "ner": NerProcessor
    }
task_name = FLAGS.task_name.lower()  
processor = processors[task_name]()

上面代碼中的task_name是用來選擇processor的

processor：任何模型的訓練、預測都是需要有一個明確的輸入，而BERT代碼中processor就是負責對模型的輸入進行處理，自定義的processor裏需要繼承DataProcessor，並重載獲取label的get_labels和獲取單個輸入的get_train_examples,get_dev_examples和get_test_examples函數。其分別會在main函數的FLAGS.do_train、FLAGS.do_eval和FLAGS.do_predict階段被調用。這三個函數的內容是相差無幾的，區別只在於需要指定各自讀入文件的地址
NerProcessor的代碼如下：

class NerProcessor(DataProcessor):  ##數據的讀入
    def get_train_examples(self, data_dir):
        return self._create_example(
            self._read_data(os.path.join(data_dir, "train.txt")), "train"
        )

    def get_dev_examples(self, data_dir):
        return self._create_example(
            self._read_data(os.path.join(data_dir, "dev.txt")), "dev"
        )

    def get_test_examples(self, data_dir):
        return self._create_example(
            self._read_data(os.path.join(data_dir, "test.txt")), "test")

    def get_labels(self):

        # 9個類別
        return ["O", "B-dizhi", "I-dizhi", "B-shouduan", "I-shouduan", "B-caiwu", "I-caiwu", "B-riqi", "I-riqi", "X",
                "[CLS]", "[SEP]"]

    def _create_example(self, lines, set_type):
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text = tokenization.convert_to_unicode(line[1])
            label = tokenization.convert_to_unicode(line[0])
            if i == 0:
            examples.append(InputExample(guid=guid, text=text, label=label))
        return examples

上面的代碼主要是完成了數據的讀入，且繼承了DataProcessor這個類，_read_data()函數是在父類DataProcessor中實現的，具體的代碼如下所示：

class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    @classmethod
    def _read_data(cls, input_file):
        """Reads a BIO data."""
        with codecs.open(input_file, 'r', encoding='utf-8') as f:
            lines = []
            words = []
            labels = []
            for line in f:
                contends = line.strip()
                tokens = contends.split()  ##根據不同的語料，此處的split()劃分標誌需要進行更改
                # print(len(tokens))
                if len(tokens) == 2:
                    word = line.strip().split()[0]  ##根據不同的語料，此處的split()劃分標誌需要進行更改
                    label = line.strip().split()[-1]  ##根據不同的語料，此處的split()劃分標誌需要進行更改
                else:
                    if len(contends) == 0:
                        l = ' '.join([label for label in labels if len(label) > 0])
                        w = ' '.join([word for word in words if len(word) > 0])
                        lines.append([l, w])
                        words = []
                        labels = []
                        continue
                if contends.startswith("-DOCSTART-"):
                    words.append('')
                    continue
                words.append(word)
                labels.append(label)

            return lines  ##(label,word)

_read_data()函數：主要是針對NER的任務進行改寫的，將輸入的數據中的字存儲到words中，標籤存儲到labels中，將一句話中所有字以空格隔開組成一個字符串放入到w中，同理標籤放到l中，同時將w與l放到lines中，具體的代碼如下所示：

l = ' '.join([label for label in labels if len(label) > 0])
w = ' '.join([word for word in words if len(word) > 0])
lines.append([l, w])

def get_labels(self)：是將標籤返回，會在原來標籤的基礎之上多添加"X","[CLS]", "[SEP]"這三個標籤，句子開始設置CLS 標誌，句尾添加[SEP] 標誌,"X"表示的是英文中縮寫拆分時，拆分出的幾個部分，除了第1部分，其他的都標記爲"X"

代碼中使用了InputExample類

class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text, label=None):
        """Constructs a InputExample. ##構造BLSTM_CRF一個輸入的例子
        Args:
          guid: Unique id for the example.
          text: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
          label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text = text
        self.label = label

我的理解是這個是輸入數據的一個封裝，不管要處理的是什麼任務，需要經過這一步，對輸入的格式進行統一一下
guid是一種標識，標識的是test、train、dev

3、模型的構造

 model_fn = model_fn_builder(
        bert_config=bert_config,
        num_labels=len(label_list) + 1,
        init_checkpoint=FLAGS.init_checkpoint,
        learning_rate=FLAGS.learning_rate,
        num_train_steps=num_train_steps,
        num_warmup_steps=num_warmup_steps,
        use_tpu=FLAGS.use_tpu,
        use_one_hot_embeddings=FLAGS.use_tpu)

    estimator = tf.contrib.tpu.TPUEstimator(
        use_tpu=FLAGS.use_tpu,
        model_fn=model_fn,
        config=run_config,
        train_batch_size=FLAGS.train_batch_size,
        eval_batch_size=FLAGS.eval_batch_size,
        predict_batch_size=FLAGS.predict_batch_size)

返回的model_dn 是一個函數，其定義了模型，訓練，評測方法，並且使用鉤子參數，加載了BERT模型的參數進行了自己模型的參數初始化過程

這個model_fn_builder是爲了構造代碼中默認調用的model_fn函數服務的，爲了使用其他的參數，只不過model_fn函數的默認參數只有features, labels, mode, params，這四個，所以在model_fn包裹了一層model_fn_builder

tf 新的架構方法，通過定義model_fn 函數，定義模型，然後通過EstimatorAPI進行模型的其他工作，Es就可以控制模型的訓練，預測，評估工作等。

init_checkpoint就是下載的模型

4、train()函數

    if FLAGS.do_train:
        # 1. 將數據轉化爲tf_record 數據
        if data_config.get('train.tf_record_path', '') == '':
            train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
            filed_based_convert_examples_to_features(
                train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
        else:
            train_file = data_config.get('train.tf_record_path')
        num_train_size = int(data_config['num_train_size'])
        # 2.讀取record 數據，組成batch
        train_input_fn = file_based_input_fn_builder(
            input_file=train_file,
            seq_length=FLAGS.max_seq_length,
            is_training=True,
            drop_remainder=True)
        estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, mode):
    """
    將一個樣本進行分析，然後將字轉化爲id, 標籤轉化爲id,然後結構化到InputFeatures對象中
    """
    label_map = {}
    # 1表示從1開始對label進行index化
    for (i, label) in enumerate(label_list, 1):
        label_map[label] = i
    # 保存label->index 的map
    with codecs.open(os.path.join(FLAGS.output_dir, 'label2id.pkl'), 'wb') as w:
        pickle.dump(label_map, w)
    textlist = example.text.split(' ')
    labellist = example.label.split(' ')
    tokens = []
    labels = []
    for i, word in enumerate(textlist):
        # 分詞，如果是中文，就是分字
        token = tokenizer.tokenize(word)
        tokens.extend(token)
        label_1 = labellist[i]
        for m in range(len(token)):
            if m == 0:
                labels.append(label_1)
            else:  # 一般不會出現else
                labels.append("X")
    # tokens = tokenizer.tokenize(example.text)
    # 序列截斷
    if len(tokens) >= max_seq_length - 1:
        tokens = tokens[0:(max_seq_length - 2)]  # -2 的原因是因爲序列需要加一個句首和句尾標誌
        labels = labels[0:(max_seq_length - 2)]
    ntokens = []
    segment_ids = []
    label_ids = []
    ntokens.append("[CLS]")  # 句子開始設置CLS 標誌
    segment_ids.append(0)
    # append("O") or append("[CLS]") not sure!
    label_ids.append(label_map["[CLS]"])  # O OR CLS 沒有任何影響，不過我覺得O 會減少標籤個數,不過拒收和句尾使用不同的標誌來標註，使用LCS 也沒毛病
    for i, token in enumerate(tokens):
        ntokens.append(token)
        segment_ids.append(0)
        label_ids.append(label_map[labels[i]])
    ntokens.append("[SEP]")  # 句尾添加[SEP] 標誌
    segment_ids.append(0)
    # append("O") or append("[SEP]") not sure!
    label_ids.append(label_map["[SEP]"])
    input_ids = tokenizer.convert_tokens_to_ids(ntokens)  # 將序列中的字(ntokens)轉化爲ID形式
    input_mask = [1] * len(input_ids)
    # label_mask = [1] * len(input_ids)
    # padding, 使用
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)
        # we don't concerned about it!
        label_ids.append(0)
        ntokens.append("**NULL**")
        # label_mask.append(0)
    # print(len(input_ids))
    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length
    assert len(label_ids) == max_seq_length
    # assert len(label_mask) == max_seq_length

    # 結構化爲一個類
    feature = InputFeatures(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_ids=label_ids,
        # label_mask = label_mask
    )
    # mode='test'的時候纔有效
    write_tokens(ntokens, mode)
    return feature


def filed_based_convert_examples_to_features(
        examples, label_list, max_seq_length, tokenizer, output_file, mode=None
):
    """
    將數據轉化爲TF_Record 結構，作爲模型數據輸入
    :param examples:  樣本
    :param label_list:標籤list
    :param max_seq_length: 預先設定的最大序列長度
    :param tokenizer: tokenizer 對象
    :param output_file: tf.record 輸出路徑
    :param mode:
    :return:
    """
    writer = tf.python_io.TFRecordWriter(output_file)
    # 遍歷訓練數據
    for (ex_index, example) in enumerate(examples):
        if ex_index % 5000 == 0:
            tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
        # 對於每一個訓練樣本,
        feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, mode)

        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f

        features = collections.OrderedDict()
        features["input_ids"] = create_int_feature(feature.input_ids)
        features["input_mask"] = create_int_feature(feature.input_mask)
        features["segment_ids"] = create_int_feature(feature.segment_ids)
        features["label_ids"] = create_int_feature(feature.label_ids)
        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        writer.write(tf_example.SerializeToString())

print(feature.input_ids)

模型的訓練

estimator.train(input_fn=train_input_fn, max_steps=num_train_steps,
                        hooks=[early_stopping_hook])

總結起來如下所示：來自網址https://www.jianshu.com/p/b05e50f682dd

暫時更新到這個地方，後續會繼續更新

BERT中文實戰---命名實體識別

數據格式

數據封裝

數據讀取

estimator封裝器

1、主函數

2、main(_)函數

3、模型的構造

4、train()函數

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

BERT中文實戰---命名實體識別

python-docx生成docx文件

linux下docker安裝neo4j

leetcode面試刷題

leetcode

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結