Stanford自然语言推理(SNLI)数据集语句的语义树构造算法

Stanford自然语言推理(SNLI)数据集语句的语义树构造算法

本来想造把枪,结果发现在造子弹时就卡死了我去.目前在实现一个基于Stanford自然语言推理(SNLI)数据集的计算语义的模型,结果发现好像很简单的数据集语句的语义树构造居然很难实现,最后终于搞出来了,防止忘记,在此记录一下思路;

问题描述

其实就是把sentence_binary_parse(二分格式)的句子格式构造成语义树,结果我发现由于深度学习的大行其道,所有baseline model都没有兴趣去处理"语义树"这种结构信息,直接都是把词向量暴力按序喂进神经网络;因此只能自己来完成,考虑如下语句:

u'sentence1_binary_parse': u'( Children ( ( ( smiling and ) waving ) ( at camera ) ) )', 

我们的目标就是要把它转化为如下的树结构:

                Children                                 
                /        \
            waving        at
            |              \
     and - smiling         camera

这个看起来像一个常规的数据结构的算法问题;

难点分析

但是仔细研究一下,考虑一下实现就会发现许多难点:

  • 父节点不一定出现在字节点前面(这个就和什么先序/后序/中序重构二叉树的问题不一样了);
  • 同一层的一个点的词个数可能不止一个;
  • 这不是二叉树,子节点的个数也是不确定的;

因此,初步考虑的如下方案在实现过程中遇到了瓶颈:

  • 按序遍历的方法:需要一个额外的表来记录每一层是哪些点;但是会遭遇解析判断子节点归属的问题;
  • 分路径记录:这样会清晰一些,将一条自顶向下到叶节点的轨迹定义为一条路径;但是在合并解析时会很复杂,且仍绕不过父节点和子节点乱序的问题;

最终方案

最终,发现了一个很微妙的细节,如果我们定义一个计数括号的变量NshiftN_{\text{shift}},称之为"括号漂移计数",规则如下:

{Nshift=Nshift+1,  s=(,Nshift=Nshift1,  s=), \begin{cases} N_{\text{shift}} = N_{\text{shift}}+1,\quad \ \ & s = '(',\\ N_{\text{shift}} = N_{\text{shift}}-1, \quad \ \ & s = ')', \end{cases}

就是如此简单的一个规则,当找到第一个新的单词时(也就是第一个子节点),我们就再定义一个记录当前NshiftN_{\text{shift}}的变量NshiftN'_{\text{shift}},那么继续更新NshiftN_{\text{shift}},最终如果出现Nshift=N_{\text{shift}}=N’_{\text{shift}}$$那我们就找到了一个新的子节点!这点可以用例句来验证:

u'sentence1_binary_parse': u'( Children ( ( ( smiling and ) waving ) ( at camera ) ) )', 

另一个潜在的问题就是解决多个单词为一个节点的情况,我们不妨将这种节点定义为平行节点;事实上从语法上也是如此;最终,我们可以定义出解决这个问题的递归算法:

class SemancticTree(object):
	"""Semanctic Tree"""
	def __init__(self, sentence):
		super(SemancticTree, self).__init__()
		self.sentence = sentence
		self.ROOT     = 'EMPTY';
		self.PRONOUN  = [];

class SemancticTreeNode(object):
	"""Node in SemancticTreeNode"""
	def __init__(self, word):
		super(SemancticTreeNode, self).__init__()
		self.word   = word;
		self.next   = [];
		self.prev   = 'EMPTY';
		self.end_i  = 0; 

def make_sentences_trees_pair_list(sentence):
    """
    Resolving a sentence into a semantic tree,then make it into a formula tree;
        u'( They ( are ( smiling ( at ( their parents ) ) ) ) )'

                                                   They
                                                   |
                                                   are
                                                   |
                                                   smiling
                                                   |
                                                   at
                                                    \
                                                    their-parents
    """
    brackets = ['(',')'];
    sentence_arr = sentence.split(' ');
    TREE         = [];
    START_INDEX  = 0;
    NEXT_IS_WORD = 0;
    for i in range(len(sentence_arr)):
        if sentence_arr[i] not in brackets:
            START_INDEX = i;
            NOW_NODE    = sentence_arr[START_INDEX];
            while sentence_arr[i+NEXT_IS_WORD+1] not in brackets: NEXT_IS_WORD+=1;
            break;
    for i in range(1,NEXT_IS_WORD+1):
        NOW_NODE += ' '+sentence_arr[START_INDEX+i];
    TREE.append(SemancticTreeNode(NOW_NODE))        
    return make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,START_INDEX+NEXT_IS_WORD);		

def complete_multi_strings_node(sentence_arr,START_INDEX,NOW_NODE):
    brackets = ['(',')'];
    NEXT_IS_WORD = 0;
    for i in range(START_INDEX+1,len(sentence_arr)):
        if sentence_arr[i] not in brackets:
            NOW_NODE += ' '+sentence_arr[START_INDEX+i];
            END_INDEX = i;
        if sentence_arr[i] in brackets:break;
    return NOW_NODE,END_INDEX;

def make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,START_INDEX):
    brackets = ['(',')'];
    INDEX = START_INDEX;
    START_BRACKET_NUM = 0;
    subnodes_of_this_root = [];
    while INDEX < len(sentence_arr):
        if sentence[INDEX] == '(':SHIFT_BRACKET_NUM+=1;
        if sentence[INDEX] == ')':SHIFT_BRACKET_NUM-=1;
        # find the first sub node;
        if sentence_arr[INDEX] not in brackets:
            SUB_BRACKET_NUM = SHIFT_BRACKET_NUM;
            TREE.append(SemancticTreeNode(sentence_arr[INDEX]));
            TREE[-1].word,TREE[-1].end_i = complete_multi_strings_node(sentence_arr,INDEX,TREE[-1].word);
            subnodes_of_this_root.append(TREE[-1]);
            for INDEX_FOR_OTHER_SUB in range(INDEX,len(sentence_arr)):
                if sentence[INDEX] == '(':SHIFT_BRACKET_NUM+=1;
                if sentence[INDEX] == ')':SHIFT_BRACKET_NUM-=1;
                # find another sub-node
                if (sentence_arr[INDEX_FOR_OTHER_SUB] not in brackets) and SUB_BRACKET_NUM==SHIFT_BRACKET_NUM:
                    TREE.append(sentence_arr[INDEX_FOR_OTHER_SUB]);
                    TREE[-1].word,TREE[-1].end_i = complete_multi_strings_node(sentence_arr,INDEX_FOR_OTHER_SUB,TREE[-1].word);
                    subnodes_of_this_root.append(TREE[-1]);
    for SUB_NODE in subnodes_of_this_root:
    	# Rec entry point;
        TREE = make_sentences_trees_from_sentence(sentence,sentence_arr,TREE,SUB_NODE.next_i);               
    return TREE;    
    
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章