隨機文本生成技術---order-k馬爾科夫鏈文本生成技術

原創

baiwen1979

2020-02-20 19:28

這裏的k = 2:

    int k = 2;
    char inputchars[5000000];
    char *word[1000000];
    int nword = 0;
    首先，掃描整個輸入文本來實現算法從而生成每個單詞。我們將數組word作爲一個指向字母的後綴數組，只是它僅從單詞的邊界開始。變量nword保存了單詞的數目。我們使用下面的代碼讀取文件:
    word[0] = inputchars
    while scanf("%s", word[nword]) != EOF
        word[nword+1] = word[nword] + strlen(word[nword]) + 1
        nword++
    將文件中的每個單詞添加到inputchars中，並通過scanf提供的null字符終止每個單詞。
    第二，在讀取輸入之後，對word數組進行排序，將所有指向同一個k單詞序列的指針收集起來。該函數進行了下列比較
    int wordncmp(char *p, char *q)
        n = k;
        for (; *p == *q; p++, q++)
            if (*p == 0 && --n == 0)
                return 0
        return *p - *q
    當字符相同是，它就掃描兩個字符串，每次遇到null字符，它就將計算器n減1,並在查找到k個相同的單詞後返回0(相同）。當它找到不同的字符時，返回不同（*p - *q)

    讀取輸入之後，在最後的單詞後追加k個null字符（這樣比較函數就不會超過整個字符串的末端），輸出文檔的前k個單詞（以開始隨機輸出），並調用排序：
    for i = [0, k)
        word[nword][i] = 0
    for i = [0, k)
        print word[i]
    qsort(word, nword, sizeof(word[0]), sortcmp)
    我們採用的空間上比較高效的數據結構中現在包含了大量關於文本中"K-gram（K鏈）"信息。如果k爲1，並且輸入文本爲“of the people, by the people, for the people”，word數組如下所示：
    排序前:
    word[0]: of the people,by the people .....
    word[1]: the people,by the people, for ...
    word[2]: people,by the people,for the..
    word[3]: by the people, for the people
    word[4]: the people, for the people
    word[5]: people,for the people
    word[6]: for the people
    word[7]: the people
    word[8]: people
    排序後：
    word[0]: by the people, for the people
    word[1]: for the people
    word[2]: of the people, by the people
    word[3]: people
    word[4]: people, by the people
    word[5]: people, for the people
    word[6]: the people,by the people
    word[7]: the people
    word[8]: the people,for the people
    如果查找“the”後跟的單詞，就在後綴數組中查找它，有三個選擇：兩次"people,"和一次"people"

    現在，我們可以使用以下的僞代碼來生產沒有意義的文本
    phrase = first phrase in input array
    loop
        perform a binary search for phrase in word[0..nword-1] //查找phrase的第一次出現
        for all phrases equal in the first k words //掃描所有相同的詞組，並隨機選擇其中一個。
            select one at random, pointed to by p
        phrase = word following p
        if k-th word of phrase is length 0 //如該詞組的第k個單詞的長度爲0,表明該詞組是文檔末尾，結束循環
            break
        print k-th word of phrase
    完整的僞碼實現爲：
    phrase = inputchars
for (wordsleft = 10000; wordsleft > 0; wordsleft--)
  l = -1
  u = nword
  while l+1 != u
   m = (l + u) / 2
   if wordncmp(word[m], phrase) < 0
    l = m
   else
    u = m
  for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++)
   if rand() % (i+1) == 0
    p = word[u+i]
  phrase = skip(p, 1)
  if strlen(skip(phrase, k-1)) == 0
   break
  print skip(phrase, k-1)