int k = 2;
char inputchars[5000000];
char *word[1000000];
int nword = 0;
首先,掃描整個輸入文本來實現算法從而生成每個單詞。我們將數組word作爲一個指向字母的後綴數組,只是它僅從單詞的邊界開始。變量nword保存了單詞的數目。我們使用下面的代碼讀取文件:
word[0] = inputchars
while scanf("%s", word[nword]) != EOF
word[nword+1] = word[nword] + strlen(word[nword]) + 1
nword++
將文件中的每個單詞添加到inputchars中,並通過scanf提供的null字符終止每個單詞。
第二,在讀取輸入之後,對word數組進行排序,將所有指向同一個k單詞序列的指針收集起來。該函數進行了下列比較
int wordncmp(char *p, char *q)
n = k;
for (; *p == *q; p++, q++)
if (*p == 0 && --n == 0)
return 0
return *p - *q
當字符相同是,它就掃描兩個字符串,每次遇到null字符,它就將計算器n減1,並在查找到k個相同的單詞後返回0(相同)。當它找到不同的字符時,返回不同(*p - *q)
讀取輸入之後,在最後的單詞後追加k個null字符(這樣比較函數就不會超過整個字符串的末端),輸出文檔的前k個單詞(以開始隨機輸出),並調用排序:
for i = [0, k)
word[nword][i] = 0
for i = [0, k)
print word[i]
qsort(word, nword, sizeof(word[0]), sortcmp)
我們採用的空間上比較高效的數據結構中現在包含了大量關於文本中"K-gram(K鏈)"信息。如果k爲1,並且輸入文本爲“of the people, by the people, for the people”,word數組如下所示:
排序前:
word[0]: of the people,by the people .....
word[1]: the people,by the people, for ...
word[2]: people,by the people,for the..
word[3]: by the people, for the people
word[4]: the people, for the people
word[5]: people,for the people
word[6]: for the people
word[7]: the people
word[8]: people
排序後:
word[0]: by the people, for the people
word[1]: for the people
word[2]: of the people, by the people
word[3]: people
word[4]: people, by the people
word[5]: people, for the people
word[6]: the people,by the people
word[7]: the people
word[8]: the people,for the people
如果查找“the”後跟的單詞,就在後綴數組中查找它,有三個選擇:兩次"people,"和一次"people"
現在,我們可以使用以下的僞代碼來生產沒有意義的文本
phrase = first phrase in input array
loop
perform a binary search for phrase in word[0..nword-1] //查找phrase的第一次出現
for all phrases equal in the first k words //掃描所有相同的詞組,並隨機選擇其中一個。
select one at random, pointed to by p
phrase = word following p
if k-th word of phrase is length 0 //如該詞組的第k個單詞的長度爲0,表明該詞組是文檔末尾,結束循環
break
print k-th word of phrase
完整的僞碼實現爲:
phrase = inputchars
for (wordsleft = 10000; wordsleft > 0; wordsleft--)
l = -1
u = nword
while l+1 != u
m = (l + u) / 2
if wordncmp(word[m], phrase) < 0
l = m
else
u = m
for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++)
if rand() % (i+1) == 0
p = word[u+i]
phrase = skip(p, 1)
if strlen(skip(phrase, k-1)) == 0
break
print skip(phrase, k-1)