【Python 自然語言處理 第二版】讀書筆記1:語言處理與Python


原書:《Python 自然語言處理 第二版》

前言

從廣義上講,“自然語言處理”(Natural Language Processing 簡稱NLP)包含所有用計算機對自然語言進行的操作。

NLTK 定義了一個使用Python 進行NLP 編程的基礎工具。它提供重新表示自然語言處理相關數據的基本類,詞性標註、文法分析、文本分類等任務的標準接口以及這些任務的標準實現,可以組合起來解決複雜的問題。

語言處理任務與相應NLTK 模塊以及功能描述

語言處理任務 語言處理任務 功能
訪問語料庫 corpus 語料庫與詞典的標準化接口
字符串處理 tokenize, stem 分詞,分句,提取主幹
搭配的發現 collocations t-檢驗,卡方,點互信息PMI
詞性標註 tag n-gram, backoff, Brill, HMM, TnT
機器學習 classify, cluster, tbl 決策樹,最大熵,貝葉斯,EM,k-means
分塊 chunk 正則表達式,n-gram,命名實體
解析 parse, ccg 圖表,基於特徵,一致性,概率,依賴
語義解釋 sem, inference λ演算,一階邏輯,模型檢驗
指標評測 metrics 精度,召回率,協議係數
概率和估計 probability 頻率分佈,平滑概率分佈
應用 app, chat 圖形化的語料庫檢索工具,分析器WordNet 查看器,聊天機器人
語言學領域的工作 toolbox 處理SIL 工具箱格式的數據

語言處理與Python

一、語言計算:文本和單詞

1、NLTK入門

(1)安裝(nltk、nltk.book)

安裝 nltk.book

import nltk
nltk.download()

使用 nltk.download() 瀏覽可用的軟件包.下載器上Collections 選項卡顯示軟件包如何被打包分組,選擇book 標記所在行,可以獲取本書的例子和練習所需的全部數據。

在這裏插入圖片描述

from nltk.book import *
print("text1 : ", text1)
print("text2 : ", text2)                                  

輸出結果

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1 :  <Text: Moby Dick by Herman Melville 1851>
text2 :  <Text: Sense and Sensibility by Jane Austen 1811>

(2)搜索文本

# 詞語索引:搜索文本text1中含有“monstrous”的句子
print(text1.concordance("monstrous"))
# 搜索文本text1中與“monstrous”相似的單詞
print(text1.similar("monstrous"))
# 搜索文本text2中兩個單詞共同的上下文
print(text2.common_contexts(["monstrous", "very"]))
# 顯示在文本text4中各個單詞的使用頻率,顯示爲詞彙分佈圖
print(text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]))

輸出結果

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
None
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
None
a_pretty am_glad a_lucky is_pretty be_glad
None

在這裏插入圖片描述

(3)詞彙計數

# 文本text3的符號總數
print(len(text3))
# 不重複的符號排序,注意:排序表中大寫字母出現在小寫字母之前。
print(sorted(set(text3)))
# 不重複的符號總數
print(len(set(text3)))
# 詞彙豐富度:不重複符號佔總符號5%,或者:每個單詞平均使用16詞
print(len(set(text3)) / len(text3))
# 文本中“smote”的計數
print(text3.count("smote"))
print(100 * text4.count('a') / len(text4))
print('--------------------'*2)

# 計算詞彙豐富度
def lexical_diversity(text):
	return len(set(text)) / len(text)

# 計算詞word在文本text中出現的頻率
def percentage(word, text):
	return 100 * text.count(word) / text

print(lexical_diversity(text3))
print(lexical_diversity(text5))
print(percentage('a', text4))

輸出結果

44764
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', ...,  'With', 'Woman', 'Ye', 'Yea', 'Yet', 'Zaavan', 'Zaphnathpaaneah', 'Zar', 'Zarah', 'Zeboiim', 'Zeboim', 'Zebul', 'Zebulun', 'Zemarite', 'Zepho', 'Zerah', 'Zibeon', 'Zidon', 'Zillah', 'Zilpah', 'Zimran', 'Ziphion', 'Zo', 'Zoar', 'Zohar', 'Zuzims', 'a', 'abated', 'abide', 'able', 'abode', 'abomination', 'about', 'above', 'abroad', 'absent', 'abundantly', 'accept', 'accepted', 'according', 'acknowledged', 'activity', 'add', ..., 'yielded', 'yielding', 'yoke', 'yonder', 'you', 'young', 'younge', 'younger', 'youngest', 'your', 'yourselves', 'youth']
2789
0.06230453042623537
5
1.4643016433938312
----------------------------------------
0.06230453042623537
0.13477005109975562
1.4643016433938312

2、列表與字符串

(1)列表操作

print('sent2 : ', sent2)
# 連接 : 將多個列表組合爲一個列表。
print('List : ', ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail'])
# 追加 : 增加一個元素
print('sent1 : ', sent1)
sent1.append("Some")
print('sent1 : ', sent1)

輸出結果

sent2 :  ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
List :  ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']
sent1 :  ['Call', 'me', 'Ishmael', '.']
sent1 :  ['Call', 'me', 'Ishmael', '.', 'Some']

(2)索引列表

# 利用索引獲取文本
print(text4[173])
# 利用文本獲得第一次出現的索引
print(text4.index('awaken'))

# 切片:從大文本中任意抽取語言片段,即獲取子列表
print(text5[16715:16735])
print(text6[1600:1625])

sent = ['word1', 'word2', 'word3', 'word4', 'word5', 
		'word6', 'word7', 'word8', 'word9', 'word10']
print(sent[5:8])  # sent[5]\sent[6]\sent[7]
print(sent[0])
print(sent[9])

sent[0] = 'First'
sent[9] = 'Last'
# 用新內容替換掉一整個片段
sent[1:9] = ['Second', 'Third']
print(sent)
# 這個鏈表只有四個元素而要獲取其後面的元素就產生了錯誤
# print(sent[9])

輸出結果

awaken
173
['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good', 'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without', 'buying', 'it']
['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We', 'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive', 'officer', 'for', 'the', 'week']
['word6', 'word7', 'word8']
word1
word10
['First', 'Second', 'Third', 'Last']
# Traceback (most recent call last):
#   File "/home/jie/Jie/codes/nlp/1_nltk.py", line 60, in <module>
#     print(sent[9])
# IndexError: list index out of range

(3)變量

形式:變量 = 表達式

(4)字符串

用來訪問列表元素的一些方法也可以用在單獨的詞或字符串上。

name = 'Monty'

# 索引, 切片
print(name[0])
print(name[:4])

# 乘法,加法
print(name * 2)
print(name + '!')

輸出結果

M
Mont
MontyMonty
Monty!

字符串與列表的相互轉換

print(' '.join(['Monty', 'Python']))
print('Monty Python'.split())

輸出結果

Monty Python
['Monty', 'Python']

二、計算語言:簡單的統計

1、頻率分佈

頻率分佈:在文本中的每一個詞項出現的頻率。

# 頻率分佈: 文本中單詞詞符的總數是如何分佈在詞項中的
fdist1 = FreqDist(text1)
print(fdist1)
print(fdist1.most_common(50))
print(fdist1['whale'])		# whale的詞頻統計
# 累積頻率圖
# 《白鯨記》中50個最常用詞的累積頻率圖:這些詞佔了所有詞符的將近一半。
fdist1.plot(50, cumulative=True)
# 只出現了一次的詞
print(fdist1.hapaxes())

輸出結果

<FreqDist with 19317 samples and 260819 outcomes>
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632), ('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280), ('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103), ('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005), ('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767), ('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680), ('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
906
['Herman', 'Melville', ']', 'ETYMOLOGY', 'Late', 'Consumptive', 'School', 'threadbare', 'lexicons', 'mockingly', 'flags', 'mortality', 'signification', 'HACKLUYT', 'Sw',...'suction', 'closing', 'Ixion', 'Till', 'liberated', 'Buoyed', 'dirgelike', 'padlocks', 'sheathed', 'retracing', 'orphan']

《白鯨記》中50個最常用詞的累積頻率圖:這些詞佔了所有詞符的將近一半。
在這裏插入圖片描述

2、細粒度的選擇詞

(1)選出長度大於15的單詞

V = set(text1)
long_words = [w for w in V if len(w) > 15]
print(sorted(long_words), '\n')

輸出結果

['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly'] 

(2)頻繁出現的長詞

# 所有長度超過7 個字符,且出現次數超過7 次的詞
fdist5 = FreqDist(text5)
long_words1 = [w for w in set(text5) if len(w) > 7 and fdist5[w] > 7]
print(long_words1, '\n')

輸出結果

['remember', '((((((((((', 'listening', '#talkcity_adults', 'actually', 'football', 'seriously', 'something', 'innocent', 'everyone', 'Question', 'watching', '#14-19teens', 'anything', 'computer', 'tomorrow', 'together', '........', 'cute.-ass'] 

(3)提取詞彙中的次對

bigrams_words = bigrams(['more', 'is', 'said', 'than', 'done'])
print(list(bigrams_words), '\n')

輸出結果

[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] 

(4)提取文本中的頻繁出現的雙連詞

collocations():提取頻繁出現的雙連詞

print(text4.collocations(), '\n')

輸出結果

United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
None 

3、計數其他東西

(1)文本中詞長的分佈

# 文本中詞長的頻數
fdist = FreqDist(len(w) for w in text1)
print(fdist) 
print(fdist.most_common())
print(fdist.max())
# 詞頻中詞長爲“3”的頻數
print(fdist[3])
# 詞頻中詞長爲“3”的頻率
print(fdist.freq(3))	

輸出結果

<FreqDist with 19 samples and 260819 outcomes>
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399), (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177), (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
3
50223
0.19255882431878046

分析:最頻繁的詞長度是3,長度爲3 的詞有50,000 多個(約佔書中全部詞彙的20%)

(2)[w for w in text if condition ]

# 選出以ableness結尾的單詞
print(sorted(w for w in set(text1) if w.endswith('ableness')))

# 選出含有gnt的單詞
print(sorted(term for term in set(text4) if 'gnt' in term))

# 選出以大寫字母開頭的單詞
print(sorted(item for item in set(text6) if item.istitle()))

# 選出數字
print(sorted(item for item in set(sent7) if item.isdigit()))

# 選出不全部是小寫字母的單詞
print(sorted(w for w in set(sent7) if not w.islower()))

# 將單詞變爲全部大寫字母
print([w.upper() for w in text1])

# 將text1中過濾掉不是字母的,然後全部轉換成小寫,然後去重,然後計數
print(len(set(word.lower() for word in text1 if word.isalpha())))

輸出結果

['comfortableness', 'honourableness', 'immutableness', 'indispensableness', 'indomitableness', 'intolerableness', 'palpableness', 'reasonableness', 'uncomfortableness']
['Sovereignty', 'sovereignties', 'sovereignty']
['A', 'Aaaaaaaaah', ... , 'Woa', 'Wood', 'Would', 'Y', 'Yapping', 'Yay', 'Yeaaah', 'Yeaah', 'Yeah', 'Yes', 'You', 'Your', 'Yup', 'Zoot']
['29', '61']
[',', '.', '29', '61', 'Nov.', 'Pierre', 'Vinken']
['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', '(', 'SUPPLIED', 'BY', 'SHARP', 'BLEAK', 'CORNER', ',', 'WHERE', ... , 'WILD', 'OATS', 'IN', 'ALL', 'FOUR', 'OCEANS', '.', 'THEY', 'HAD', 'MADE', 'A', 'HARP

(3)條件循環

示例1:

for token in sent1:
	if token.islower():
		print(token, 'is a lowercase word')
	elif token.istitle():
		print(token, 'is a titlecase word')
	else:
		print(token, 'is punctuation')

輸出結果

Call is a titlecase word
me is a lowercase word
Ishmael is a titlecase word
. is punctuation

示例2:

tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
for word in tricky:
	# 不換行打印print(word, end=' ')
	print(word, end=' ')

輸出結果

ancient ceiling conceit conceited conceive conscience conscientious conscientiously deceitful deceive deceived deceiving deficiencies deficiency deficient delicacies excellencies fancied insufficiency insufficient legacies perceive perceived perceiving prescience prophecies receipt receive received receiving society species sufficient 

三、理解自然語言

關鍵點:信息提取、推理與總結

四、作業

1. 下面兩行之間的差異是什麼?哪一個的值比較大?其他文本也是同樣情況嗎?

sorted(set([w.lower() for w in text1])) 
sorted([w.lower() for w in set(text1)] 

第二個更大。因爲第一個是先執行小寫再執行set 相同的元素只保留一個; 而第二個裏先執行了set ,大小寫不同的同一元素都會保留下來,然後再執行小寫操作,會出現相同的都是小寫的元素。

2. w.isupper()not w.islower()這兩個測試之間的差異是什麼?

w.isupper()——返回的是w是否爲全大寫的字母
not w.islower()——返回的是w是否全不是小寫字母(可能包含數字等)

3. 編寫一個切片表達式提取text2中的最後兩個詞。

text2[-2:]
# ['THE', 'END']

4. 找出聊天語聊庫(text5)中所有4個字母的詞。使用頻率分佈函數(FreqDist),以頻率從高到低顯示這些詞。

fdist = FreqDist([w for w in text5 if len(w)==4])
print(fdist.most_common())

輸出結果

[('JOIN', 1021), ('PART', 1016), ('that', 274), ('what', 183), ('here', 181), ('....', 170), ('have', 164), ('like', 156), ('with', 152), ('chat', 142), ('your', 137), ('good', 130), ('just', 125), ('lmao', 107), ..., ('brwn', 1), ('hurr', 1), ('Were', 1)]

5. 寫表達式找出text6中所有符合下列條件的詞。結果應該是單詞列表的形式:[‘word1’, ‘word2’, …]。

  • 以ize 結尾
  • 包含字母z
  • 包含字母序列pt
  • 除了首字母外是全部小寫字母的詞(即titlecase)
print([w for w in text6 if w.endswith('ize')])
print([w for w in text6 if 'z' in w])
print([w for w in text6 if 'pt' in w])
print([w for w in text6 if w.istitle()])
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章