【自然語言處理入門系列】加載和預處理數據-以Cornell Movie-Dialogs Corpus數據集爲例

【自然語言處理入門系列】加載和預處理數據-以Cornell Movie-Dialogs Corpus數據集爲例

Author: Yirong Chen from South China University of Technology
My CSDN Blog: https://blog.csdn.net/m0_37201243
My Homepage: http://www.yirongchen.com/

Dependencies:

  • Python: 3.6.9

參考網站:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math

示例1:Cornell Movie-Dialogs Corpus數據集

Cornell Movie-Dialogs Corpus是一個豐富的電影角色對話數據集:

  • 10,292 對電影角色之間的220,579次對話
  • 617部電影中的9,035個電影角色
  • 總共304,713發言量

這個數據集龐大而多樣,在語言形式、時間段、情感上等都有很大的變化。我們希望這種多樣性使我們的模型能夠適應多種形式的輸入和查詢。

1、下載數據集

### 下載數據集
import os
import requests

print("downloading Cornell Movie-Dialogs Corpus數據集")
data_url = "http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip"

path = "./data/"
if not os.path.exists(path):
    os.makedirs(path)

res = requests.get(data_url)
with open("./data/cornell_movie_dialogs_corpus.zip", "wb") as fp:
    fp.write(res.content)
print("Cornell Movie-Dialogs Corpus數據集下載完畢!")
downloading Cornell Movie-Dialogs Corpus數據集
Cornell Movie-Dialogs Corpus數據集下載完畢!

2、解壓數據集

import time
import zipfile

srcfile = "./data/cornell_movie_dialogs_corpus.zip"

file = zipfile.ZipFile(srcfile, 'r')
file.extractall(path)
print('解壓cornell_movie_dialogs_corpus.zip完畢!')
print("Cornell Movie-Dialogs Corpus數據集的文件組成如下:")
corpus_file_list=os.listdir("./data/cornell movie-dialogs corpus")
print(corpus_file_list)
解壓cornell_movie_dialogs_corpus.zip完畢!
Cornell Movie-Dialogs Corpus數據集的文件組成如下:
['formatted_movie_lines.txt', 'chameleons.pdf', '.DS_Store', 'README.txt', 'movie_conversations.txt', 'movie_lines.txt', 'raw_script_urls.txt', 'movie_characters_metadata.txt', 'movie_titles_metadata.txt']

3、查看數據集的各個文件的部分數據

def printLines(file, n=10):
    with open(file, 'rb') as datafile:
        lines = datafile.readlines()
    for line in lines[:n]:
        print(line)
# corpus_name = "cornell movie-dialogs corpus"
# corpus = os.path.join("data", corpus_name)
corpus_file_list=os.listdir("./data/cornell movie-dialogs corpus")
for file_name in corpus_file_list:    
    file_dir = os.path.join("./data/cornell movie-dialogs corpus", file_name)
    print(file_dir,"的前10行")
    printLines(file_dir)

這部分的結果省略在博客中!

Note:movie_lines.txt是關鍵數據文件,其實我們在找到一個數據集的時候,是可以從它的官網、來源或者相應的論文當中看到相應的介紹。也就是,我們至少知道某個數據集它的文件組成。

4、創建格式化數據文件

以下函數便於解析原始 movie_lines.txt 數據文件。

  • loadLines:將文件的每一行拆分爲字段(lineID, characterID, movieID, character, text)組合的字典
  • loadConversations :根據movie_conversations.txt將loadLines中的每一行數據進行歸類
  • extractSentencePairs: 從對話中提取句子對
# 將文件的每一行拆分爲字段字典
def loadLines(fileName, fields):
    lines = {}
    with open(fileName, 'r', encoding='iso-8859-1') as f:
        for line in f:
            values = line.split(" +++$+++ ")
            # Extract fields
            lineObj = {}
            for i, field in enumerate(fields):
                lineObj[field] = values[i]
            lines[lineObj['lineID']] = lineObj
    return lines
# 將 `loadLines` 中的行字段分組爲基於 *movie_conversations.txt* 的對話
def loadConversations(fileName, lines, fields):
    conversations = []
    with open(fileName, 'r', encoding='iso-8859-1') as f:
        for line in f:
            values = line.split(" +++$+++ ")
            # Extract fields
            convObj = {}
            for i, field in enumerate(fields):
                convObj[field] = values[i]
            # Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")
            lineIds = eval(convObj["utteranceIDs"])
            # Reassemble lines
            convObj["lines"] = []
            for lineId in lineIds:
                convObj["lines"].append(lines[lineId])
            conversations.append(convObj)
    return conversations
# 從對話中提取一對句子
def extractSentencePairs(conversations):
    qa_pairs = []
    for conversation in conversations:
        # Iterate over all the lines of the conversation
        for i in range(len(conversation["lines"]) - 1):  # We ignore the last line (no answer for it)
            inputLine = conversation["lines"][i]["text"].strip()
            targetLine = conversation["lines"][i+1]["text"].strip()
            # Filter wrong samples (if one of the lists is empty)
            if inputLine and targetLine:
                qa_pairs.append([inputLine, targetLine])
    return qa_pairs

Note:以下代碼使用上面定義的函數創建格式化數據文件

import csv
import codecs
# 定義新文件的路徑
datafile = os.path.join(corpus, "formatted_movie_lines.txt")

delimiter = '\t'

delimiter = str(codecs.decode(delimiter, "unicode_escape"))

# 初始化行dict,對話列表和字段ID
lines = {}
conversations = []
MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]
MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]

# 加載行和進程對話
print("\nProcessing corpus...")
lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)
print("\nLoading conversations...")
conversations = loadConversations(os.path.join(corpus, "movie_conversations.txt"),
                                  lines, MOVIE_CONVERSATIONS_FIELDS)

# 寫入新的csv文件
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:
    writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
    for pair in extractSentencePairs(conversations):
        writer.writerow(pair)

# 打印一個樣本的行
print("\nSample lines from file:")
printLines(datafile)
Processing corpus...

Loading conversations...

Writing newly formatted file...

Sample lines from file:
b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\n"
b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\n"
b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"
b"You're asking me out.  That's so cute. What's your name again?\tForget it.\n"
b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\n"
b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\n"
b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\tSeems like she could get a date easy enough...\n"
b'Why?\tUnsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.\n'
b"Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.\tThat's a shame.\n"
b'Gosh, if only we could find Kat a boyfriend...\tLet me see what I can do.\n'

5、加載和清洗數據

# 默認詞向量
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token

創建了一個Voc類,它會存儲從單詞到索引的映射、索引到單詞的反向映射、每個單詞的計數和總單詞量。這個類提供向詞彙表中添加單詞的方法(addWord)、添加句子的所有單詞到詞彙表中的方法 (addSentence) 和清洗不常見的單詞方法(trim)。更多的數據清洗在後面進行。

class Voc:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3  # Count SOS, EOS, PAD
    # 添加句子中的所有單詞到詞彙表
    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)
    # 向詞彙表中添加單詞
    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

    # 刪除低於特定計數閾值的單詞
    def trim(self, min_count):
        if self.trimmed:
            return
        self.trimmed = True

        keep_words = []

        for k, v in self.word2count.items():
            if v >= min_count:
                keep_words.append(k)

        print('keep_words {} / {} = {:.4f}'.format(
            len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)
        ))

        # 重初始化字典
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3 # Count default tokens

        for word in keep_words:
            self.addWord(word)

使用unicodeToAscii將 unicode 字符串轉換爲 ASCII。然後,我們應該將所有字母轉換爲小寫字母並清洗掉除基本標點之 外的所有非字母字符 (normalizeString)。最後,爲了幫助訓練收斂,我們將過濾掉長度大於MAX_LENGTH 的句子 (filterPairs)。

# 將Unicode字符串轉換爲純ASCII
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )
# normalizeString函數是一個正則化的函數,也就是使數據更加標準化的
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s
MAX_LENGTH = 10  # Maximum sentence length to consider

# 初始化Voc對象 和 格式化pairs對話存放到list中
def readVocs(datafile, corpus_name):
    print("Reading lines...")
    # Read the file and split into lines
    lines = open(datafile, encoding='utf-8').read().strip().split('\n')
    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    voc = Voc(corpus_name)
    return voc, pairs

# 如果對 'p' 中的兩個句子都低於 MAX_LENGTH 閾值,則返回True
def filterPair(p):
    # Input sequences need to preserve the last word for EOS token
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

# 過濾滿足條件的 pairs 對話
def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

# 使用上面定義的函數,返回一個填充的voc對象和對列表
def loadPrepareData(corpus, corpus_name, datafile, save_dir):
    print("Start preparing training data ...")
    voc, pairs = readVocs(datafile, corpus_name)
    print("Read {!s} sentence pairs".format(len(pairs)))
    pairs = filterPairs(pairs)
    print("Trimmed to {!s} sentence pairs".format(len(pairs)))
    print("Counting words...")
    for pair in pairs:
        voc.addSentence(pair[0])
        voc.addSentence(pair[1])
    print("Counted words:", voc.num_words)
    return voc, pairs

# 加載/組裝voc和對
save_dir = os.path.join("data", "save")
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)
# 打印一些對進行驗證
print("\npairs:")
for pair in pairs[:10]:
    print(pair)

Start preparing training data ...
Reading lines...
Read 221282 sentence pairs
Trimmed to 63446 sentence pairs
Counting words...
Counted words: 17774

pairs:
['there .', 'where ?']
['you have my word . as a gentleman', 'you re sweet .']
['hi .', 'looks like things worked out tonight huh ?']
['you know chastity ?', 'i believe we share an art instructor']
['have fun tonight ?', 'tons']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']
['but', 'you always been this selfish ?']
['do you listen to this crap ?', 'what crap ?']
['what good stuff ?', ' the real you . ']

另一種有利於讓訓練更快收斂的策略是去除詞彙表中很少使用的單詞。減少特徵空間也會降低模型學習目標函數的難度。我們通過以下兩個步 驟完成這個操作:

  • 使用voc.trim函數去除 MIN_COUNT 閾值以下單詞 。
  • 如果句子中包含詞頻過小的單詞,那麼整個句子也被過濾掉。
MIN_COUNT = 3    # 修剪的最小字數閾值

def trimRareWords(voc, pairs, MIN_COUNT):
    # 修剪來自voc的MIN_COUNT下使用的單詞
    voc.trim(MIN_COUNT)
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # 檢查輸入句子
        for word in input_sentence.split(' '):
            if word not in voc.word2index:
                keep_input = False
                break
        # 檢查輸出句子
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break

        # 只保留輸入或輸出句子中不包含修剪單詞的對
        if keep_input and keep_output:
            keep_pairs.append(pair)

    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))
    return keep_pairs


# 修剪voc和對
pairs = trimRareWords(voc, pairs, MIN_COUNT)
keep_words 7706 / 17771 = 0.4336
Trimmed from 63446 pairs to 52456, 0.8268 of total
print("pairs類型:", type(pairs))
print("pairs的Size:", len(pairs))
print("pairs前10個元素:", pairs[0:10])
pairs類型: <class 'list'>
pairs的Size: 52456
pairs前10個元素: [['there .', 'where ?'], ['you have my word . as a gentleman', 'you re sweet .'], ['hi .', 'looks like things worked out tonight huh ?'], ['have fun tonight ?', 'tons'], ['well no . . .', 'then that s all you had to say .'], ['then that s all you had to say .', 'but'], ['but', 'you always been this selfish ?'], ['do you listen to this crap ?', 'what crap ?'], ['what good stuff ?', ' the real you . '], ['wow', 'let s go .']]

Note: 實際上,在python當中,所有數據清洗到最後,在轉換成數字之前,基本都轉換成列表的形式:

[
[樣本1],
[樣本2],
[樣本3],
...,
[樣本n],
]

【作者簡介】陳藝榮,男,目前在華南理工大學電子與信息學院廣東省人體數據科學工程技術研究中心攻讀博士,擔任IEEE Access、IEEE Photonics Journal的審稿人。兩次獲得美國大學生數學建模競賽(MCM)一等獎,獲得2017年全國大學生數學建模競賽(廣東賽區)一等獎、2018年廣東省大學生電子設計競賽一等獎等科技競賽獎項,主持一項2017-2019年國家級大學生創新訓練項目獲得優秀結題,參與兩項廣東大學生科技創新培育專項資金、一項2018-2019年國家級大學生創新訓練項目獲得良好結題,發表SCI論文4篇,授權實用新型專利8項,受理髮明專利13項。
我的主頁
我的Github
我的CSDN博客
我的Linkedin

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章