Stanford自然語言推理(SNLI)數據集語句的語義樹構造算法
本來想造把槍,結果發現在造子彈時就卡死了我去.目前在實現一個基於Stanford自然語言推理(SNLI)數據集的計算語義的模型,結果發現好像很簡單的數據集語句的語義樹構造居然很難實現,最後終於搞出來了,防止忘記,在此記錄一下思路;
問題描述
其實就是把sentence_binary_parse(二分格式)的句子格式構造成語義樹,結果我發現由於深度學習的大行其道,所有baseline model都沒有興趣去處理"語義樹"這種結構信息,直接都是把詞向量暴力按序喂進神經網絡;因此只能自己來完成,考慮如下語句:
u'sentence1_binary_parse': u'( Children ( ( ( smiling and ) waving ) ( at camera ) ) )',
我們的目標就是要把它轉化爲如下的樹結構:
Children
/ \
waving at
| \
and - smiling camera
這個看起來像一個常規的數據結構的算法問題;
難點分析
但是仔細研究一下,考慮一下實現就會發現許多難點:
- 父節點不一定出現在字節點前面(這個就和什麼先序/後序/中序重構二叉樹的問題不一樣了);
- 同一層的一個點的詞個數可能不止一個;
- 這不是二叉樹,子節點的個數也是不確定的;
因此,初步考慮的如下方案在實現過程中遇到了瓶頸:
- 按序遍歷的方法:需要一個額外的表來記錄每一層是哪些點;但是會遭遇解析判斷子節點歸屬的問題;
- 分路徑記錄:這樣會清晰一些,將一條自頂向下到葉節點的軌跡定義爲一條路徑;但是在合併解析時會很複雜,且仍繞不過父節點和子節點亂序的問題;
最終方案
最終,發現了一個很微妙的細節,如果我們定義一個計數括號的變量,稱之爲"括號漂移計數",規則如下:
就是如此簡單的一個規則,當找到第一個新的單詞時(也就是第一個子節點),我們就再定義一個記錄當前的變量,那麼繼續更新,最終如果出現N’_{\text{shift}}$$那我們就找到了一個新的子節點!這點可以用例句來驗證:
u'sentence1_binary_parse': u'( Children ( ( ( smiling and ) waving ) ( at camera ) ) )',
另一個潛在的問題就是解決多個單詞爲一個節點的情況,我們不妨將這種節點定義爲平行節點;事實上從語法上也是如此;最終,我們可以定義出解決這個問題的遞歸算法:
class SemancticTree(object):
"""Semanctic Tree"""
def __init__(self, sentence):
super(SemancticTree, self).__init__()
self.sentence = sentence
self.ROOT = 'EMPTY';
self.PRONOUN = [];
class SemancticTreeNode(object):
"""Node in SemancticTreeNode"""
def __init__(self, word):
super(SemancticTreeNode, self).__init__()
self.word = word;
self.next = [];
self.prev = 'EMPTY';
self.end_i = 0;
def make_sentences_trees_pair_list(sentence):
"""
Resolving a sentence into a semantic tree,then make it into a formula tree;
u'( They ( are ( smiling ( at ( their parents ) ) ) ) )'
They
|
are
|
smiling
|
at
\
their-parents
"""
brackets = ['(',')'];
sentence_arr = sentence.split(' ');
TREE = [];
START_INDEX = 0;
NEXT_IS_WORD = 0;
for i in range(len(sentence_arr)):
if sentence_arr[i] not in brackets:
START_INDEX = i;
NOW_NODE = sentence_arr[START_INDEX];
while sentence_arr[i+NEXT_IS_WORD+1] not in brackets: NEXT_IS_WORD+=1;
break;
for i in range(1,NEXT_IS_WORD+1):
NOW_NODE += ' '+sentence_arr[START_INDEX+i];
TREE.append(SemancticTreeNode(NOW_NODE))
return make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,START_INDEX+NEXT_IS_WORD);
def complete_multi_strings_node(sentence_arr,START_INDEX,NOW_NODE):
brackets = ['(',')'];
NEXT_IS_WORD = 0;
for i in range(START_INDEX+1,len(sentence_arr)):
if sentence_arr[i] not in brackets:
NOW_NODE += ' '+sentence_arr[START_INDEX+i];
END_INDEX = i;
if sentence_arr[i] in brackets:break;
return NOW_NODE,END_INDEX;
def make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,START_INDEX):
brackets = ['(',')'];
INDEX = START_INDEX;
START_BRACKET_NUM = 0;
subnodes_of_this_root = [];
while INDEX < len(sentence_arr):
if sentence[INDEX] == '(':SHIFT_BRACKET_NUM+=1;
if sentence[INDEX] == ')':SHIFT_BRACKET_NUM-=1;
# find the first sub node;
if sentence_arr[INDEX] not in brackets:
SUB_BRACKET_NUM = SHIFT_BRACKET_NUM;
TREE.append(SemancticTreeNode(sentence_arr[INDEX]));
TREE[-1].word,TREE[-1].end_i = complete_multi_strings_node(sentence_arr,INDEX,TREE[-1].word);
subnodes_of_this_root.append(TREE[-1]);
for INDEX_FOR_OTHER_SUB in range(INDEX,len(sentence_arr)):
if sentence[INDEX] == '(':SHIFT_BRACKET_NUM+=1;
if sentence[INDEX] == ')':SHIFT_BRACKET_NUM-=1;
# find another sub-node
if (sentence_arr[INDEX_FOR_OTHER_SUB] not in brackets) and SUB_BRACKET_NUM==SHIFT_BRACKET_NUM:
TREE.append(sentence_arr[INDEX_FOR_OTHER_SUB]);
TREE[-1].word,TREE[-1].end_i = complete_multi_strings_node(sentence_arr,INDEX_FOR_OTHER_SUB,TREE[-1].word);
subnodes_of_this_root.append(TREE[-1]);
for SUB_NODE in subnodes_of_this_root:
# Rec entry point;
TREE = make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,SUB_NODE.next_i);
return TREE;