Stanford自然語言推理(SNLI)數據集語句的語義樹構造算法

Stanford自然語言推理(SNLI)數據集語句的語義樹構造算法

本來想造把槍,結果發現在造子彈時就卡死了我去.目前在實現一個基於Stanford自然語言推理(SNLI)數據集的計算語義的模型,結果發現好像很簡單的數據集語句的語義樹構造居然很難實現,最後終於搞出來了,防止忘記,在此記錄一下思路;

問題描述

其實就是把sentence_binary_parse(二分格式)的句子格式構造成語義樹,結果我發現由於深度學習的大行其道,所有baseline model都沒有興趣去處理"語義樹"這種結構信息,直接都是把詞向量暴力按序喂進神經網絡;因此只能自己來完成,考慮如下語句:

u'sentence1_binary_parse': u'( Children ( ( ( smiling and ) waving ) ( at camera ) ) )', 

我們的目標就是要把它轉化爲如下的樹結構:

                Children                                 
                /        \
            waving        at
            |              \
     and - smiling         camera

這個看起來像一個常規的數據結構的算法問題;

難點分析

但是仔細研究一下,考慮一下實現就會發現許多難點:

  • 父節點不一定出現在字節點前面(這個就和什麼先序/後序/中序重構二叉樹的問題不一樣了);
  • 同一層的一個點的詞個數可能不止一個;
  • 這不是二叉樹,子節點的個數也是不確定的;

因此,初步考慮的如下方案在實現過程中遇到了瓶頸:

  • 按序遍歷的方法:需要一個額外的表來記錄每一層是哪些點;但是會遭遇解析判斷子節點歸屬的問題;
  • 分路徑記錄:這樣會清晰一些,將一條自頂向下到葉節點的軌跡定義爲一條路徑;但是在合併解析時會很複雜,且仍繞不過父節點和子節點亂序的問題;

最終方案

最終,發現了一個很微妙的細節,如果我們定義一個計數括號的變量NshiftN_{\text{shift}},稱之爲"括號漂移計數",規則如下:

{Nshift=Nshift+1,  s=(,Nshift=Nshift1,  s=), \begin{cases} N_{\text{shift}} = N_{\text{shift}}+1,\quad \ \ & s = '(',\\ N_{\text{shift}} = N_{\text{shift}}-1, \quad \ \ & s = ')', \end{cases}

就是如此簡單的一個規則,當找到第一個新的單詞時(也就是第一個子節點),我們就再定義一個記錄當前NshiftN_{\text{shift}}的變量NshiftN'_{\text{shift}},那麼繼續更新NshiftN_{\text{shift}},最終如果出現Nshift=N_{\text{shift}}=N’_{\text{shift}}$$那我們就找到了一個新的子節點!這點可以用例句來驗證:

u'sentence1_binary_parse': u'( Children ( ( ( smiling and ) waving ) ( at camera ) ) )', 

另一個潛在的問題就是解決多個單詞爲一個節點的情況,我們不妨將這種節點定義爲平行節點;事實上從語法上也是如此;最終,我們可以定義出解決這個問題的遞歸算法:

class SemancticTree(object):
	"""Semanctic Tree"""
	def __init__(self, sentence):
		super(SemancticTree, self).__init__()
		self.sentence = sentence
		self.ROOT     = 'EMPTY';
		self.PRONOUN  = [];

class SemancticTreeNode(object):
	"""Node in SemancticTreeNode"""
	def __init__(self, word):
		super(SemancticTreeNode, self).__init__()
		self.word   = word;
		self.next   = [];
		self.prev   = 'EMPTY';
		self.end_i  = 0; 

def make_sentences_trees_pair_list(sentence):
    """
    Resolving a sentence into a semantic tree,then make it into a formula tree;
        u'( They ( are ( smiling ( at ( their parents ) ) ) ) )'

                                                   They
                                                   |
                                                   are
                                                   |
                                                   smiling
                                                   |
                                                   at
                                                    \
                                                    their-parents
    """
    brackets = ['(',')'];
    sentence_arr = sentence.split(' ');
    TREE         = [];
    START_INDEX  = 0;
    NEXT_IS_WORD = 0;
    for i in range(len(sentence_arr)):
        if sentence_arr[i] not in brackets:
            START_INDEX = i;
            NOW_NODE    = sentence_arr[START_INDEX];
            while sentence_arr[i+NEXT_IS_WORD+1] not in brackets: NEXT_IS_WORD+=1;
            break;
    for i in range(1,NEXT_IS_WORD+1):
        NOW_NODE += ' '+sentence_arr[START_INDEX+i];
    TREE.append(SemancticTreeNode(NOW_NODE))        
    return make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,START_INDEX+NEXT_IS_WORD);		

def complete_multi_strings_node(sentence_arr,START_INDEX,NOW_NODE):
    brackets = ['(',')'];
    NEXT_IS_WORD = 0;
    for i in range(START_INDEX+1,len(sentence_arr)):
        if sentence_arr[i] not in brackets:
            NOW_NODE += ' '+sentence_arr[START_INDEX+i];
            END_INDEX = i;
        if sentence_arr[i] in brackets:break;
    return NOW_NODE,END_INDEX;

def make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,START_INDEX):
    brackets = ['(',')'];
    INDEX = START_INDEX;
    START_BRACKET_NUM = 0;
    subnodes_of_this_root = [];
    while INDEX < len(sentence_arr):
        if sentence[INDEX] == '(':SHIFT_BRACKET_NUM+=1;
        if sentence[INDEX] == ')':SHIFT_BRACKET_NUM-=1;
        # find the first sub node;
        if sentence_arr[INDEX] not in brackets:
            SUB_BRACKET_NUM = SHIFT_BRACKET_NUM;
            TREE.append(SemancticTreeNode(sentence_arr[INDEX]));
            TREE[-1].word,TREE[-1].end_i = complete_multi_strings_node(sentence_arr,INDEX,TREE[-1].word);
            subnodes_of_this_root.append(TREE[-1]);
            for INDEX_FOR_OTHER_SUB in range(INDEX,len(sentence_arr)):
                if sentence[INDEX] == '(':SHIFT_BRACKET_NUM+=1;
                if sentence[INDEX] == ')':SHIFT_BRACKET_NUM-=1;
                # find another sub-node
                if (sentence_arr[INDEX_FOR_OTHER_SUB] not in brackets) and SUB_BRACKET_NUM==SHIFT_BRACKET_NUM:
                    TREE.append(sentence_arr[INDEX_FOR_OTHER_SUB]);
                    TREE[-1].word,TREE[-1].end_i = complete_multi_strings_node(sentence_arr,INDEX_FOR_OTHER_SUB,TREE[-1].word);
                    subnodes_of_this_root.append(TREE[-1]);
    for SUB_NODE in subnodes_of_this_root:
    	# Rec entry point;
        TREE = make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,SUB_NODE.next_i);               
    return TREE;    
    
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章