大部分的代碼、思路參考了《Ansi Common Lisp》P138~P141。
問題:給一篇英文文本,如何讓計算機依據此文本而生成隨機但可讀的文本。如:
|Venture|
The National Venture Capital Association estimates that wealth associated with a deal a big spending by regulations that will spend one another's main reason these projects .
這是計算機學習了Paul Graham的一些文章後生成的隨機文本。它根據Venture這個單詞向兩邊延伸成一個句子。令人驚喜的是,文本常常是可讀的。
算法:記錄每個單詞後面出現的單詞以及出現的次數,如I leave在原文中出現了5次,I want出現了3次,除此之外,其它地方沒有出現過I,所以在生成隨機文章的時候,當遇到I,有5/8的概率選擇leave爲下一個單詞。假如選擇了leave的話,則看看leave後面出現過哪些單詞,重複以上過程。
現用lisp來解決問題。
lisp裏的符號類型,即symbol,可以很好記錄各種字符串還有標點符號,所以採用它來記錄。採用內附的hashtable來建立列表:
(defparameter *words* (make-hash-table :size 10000))
那如何建立列表呢?
(let ((prev '|.|))
(defun see (sym)
(let ((pair (assoc sym (gethash prev *words*))))
(if pair
(incf (cdr pair))
(push (cons sym 1) (gethash prev *words*))))
(setf prev sym)))
以當前單詞爲keyword,以assoc-list關係列表爲該keyword下的值。
如I下有( (|leave| . 5) (|want| . 3) )。沒有單詞word的話,則push入(word . 1)。
如何隨機選一個詞呢?
(defun random-word (word ht)
(let* ((choices (gethash word ht))
(x (random (reduce #'+ choices :key #'cdr))))
(dolist (pair choices)
(decf x (cdr pair))
(if (minusp x)
(return (car pair))))))
這裏巧妙用了reduce函數。
現在再來思考,如何將給定一個詞向兩側延伸成一句話呢?
1)先將文本反向,得到一個反向的列表,也即I leave,I want變成leave I,want I。
2)將hashtable反向,得到另外一個hashtable,以後一個單詞爲關鍵字,前面可能出現的單詞及次數構成assoc-list。
3)碰運氣,從一個標點開始延續文章,直到出現給定單詞爲止。
我用了第二個方法:
(defparameter *r-words* (make-hash-table :size 10000))
(defun push-words (w1 w2 n)
(push (cons w2 n) (gethash w1 *r-words*)))
(defun get-reversed-words ();a cat -> cat a
(maphash #'(lambda (k lst)
(dolist (pair lst)
(push-words (car pair) k (cdr pair))))
*words*))
遍歷原來的hashtable,再把每一對單詞先後換個位置插入另外一個hashtable。
給出雙向延伸句子的自動生成文本代碼:
(defparameter *words* (make-hash-table :size 10000))
(defconstant maxword 100)
(defparameter nwords 0)
(defconstant debug nil)
(let ((prev '|.|))
(defun see (sym)
(incf nwords)
(let ((pair (assoc sym (gethash prev *words*))))
(if pair
(incf (cdr pair))
(push (cons sym 1) (gethash prev *words*))))
(setf prev sym)))
(defun check-punc (c);char to symbol
(case c
(#\. '|.|) (#\, '|,|)
(#\; '|;|) (#\? '|?|)
(#\: '|:|) (#\! '|!|)))
(defun read-text (pathname)
(with-open-file (str pathname :direction :input)
(let ((buf (make-string maxword))
(pos 0))
(do ((c (read-char str nil 'eof)
(read-char str nil 'eof)))
((eql c 'eof))
(if (or (alpha-char-p c)
(eql c #\'))
(progn
(setf (char buf pos) c)
(incf pos))
(progn
(unless (zerop pos)
(see (intern (subseq buf 0 pos)))
(setf pos 0))
(let ((punc (check-punc c)))
(if punc
(see punc)))))))))
(defun print-ht (ht)
(maphash #'(lambda (k v)
(format t "~A ~A~%" k v))
ht))
(defparameter *r-words* (make-hash-table :size 10000))
(defun push-words (w1 w2 n)
(push (cons w2 n) (gethash w1 *r-words*)))
(defun get-reversed-words ();a cat -> cat a
(maphash #'(lambda (k lst)
(dolist (pair lst)
(push-words (car pair) k (cdr pair))))
*words*))
(defun print-a-word (word ht)
(maphash #'(lambda (k lst)
(if (eql k word)
(format t "~A ~A~%" k lst)))
ht))
(if debug
(print-a-word '|leave| *r-words*))
(defun punc-p (sym);symbol to char,nil when fails.
(check-punc (char (symbol-name sym) 0)))
(defun random-word (word ht)
(let* ((choices (gethash word ht))
(x (random (reduce #'+ choices :key #'cdr))))
(dolist (pair choices)
(decf x (cdr pair))
(if (minusp x)
(return (car pair))))))
(defun gen-former (word str)
(let ((last (random-word word *r-words*)))
(if (not (punc-p last))
(progn
(gen-former last str)
(format str "~A " last)))))
(defun gen-latter (word str)
(let ((next (random-word word *words*)))
(format str "~A " next)
(if (not (punc-p next))
(gen-latter next str))))
;(gen-latter '|leave| t)
(defun get-a-word (ht);get a random word
(let ((x (random nwords)))
(maphash #'(lambda (k v)
(dolist (pair v)
(decf x (cdr pair))
(if (minusp x)
(return-from get-a-word (car pair)))))
ht)))
;(get-a-word *words*)
(defun gen-sentence (word str)
(gen-former word str)
(format str "~A " word)
(gen-latter word str))
(defun test ()
(setf nwords 0)
(read-text "essay.txt")
(get-reversed-words)
(let ((word (get-a-word *words*)))
(print word)
(gen-sentence word t)))
(test)
文本語料庫、lisp源代碼見:Here