python3 詞頻統計

主要是對正則表達式、字符串內建函數、collections模塊的Counter類的應用。


正則表達式

http://www.runoob.com/python3/python3-reg-expressions.html

re.split      split 方法按照能夠匹配的子串將字符串分割後返回列表,它的使用形式如下:

re.split(pattern, string[, maxsplit=0, flags=0])
>>> import re
>>> txt = "The little prince crossed the desert and met with only one flower."
>>> strings = re.split('\W+', txt)
>>> print(strings)
['The', 'little', 'prince', 'crossed', 'the', 'desert', 'and', 'met', 'with', 'only', 'one', 'flower', '']  
#有多餘的回車符,大小寫區分

Python 的字符串內建函數

參考:http://www.runoob.com/python3/python3-string.html

replace(old, new [, max])

把 將字符串中的 str1 替換成 str2,如果 max 指定,則替換不超過 max 次。

lower()

轉換字符串中所有大寫字符爲小寫.

>>> txt = "The little prince crossed the desert and met with only one flower."
>>> new_txt = txt.replace('\n', '').lower()
>>> print(new_txt)
the little prince crossed the desert and met with only one flower.


collections模塊的Counter類

http://www.pythoner.com/205.html

Counter類的目的是用來跟蹤值出現的次數。它是一個無序的容器類型,以字典的鍵值對形式存儲,其中元素作爲key,其計數作爲value。計數值可以是任意的Interger(包括0和負數)。Counter類和其他語言的bags或multisets很相似。


整合

import re
from collections import Counter

txt = open('test.txt').read()
new_txt = txt.replace('\n', ' ').lower()
strings = re.split('\W+', new_txt)
result = Counter(strings)
#每個單詞出現的次數
print(result)
#出現次數前10的單詞
print(result.most_common(10))
#某個單詞出現的次數
print("flower 出現的次數:%d" % result["flower"])
$ cat test.txt
The little prince crossed the desert and met with only one flower. It was a flower with three petals, a flower of no account at all.
"Good morning," said the little prince.
"Good morning," said the flower.
"Where are the men?" the little prince asked, politely.
The flower had once seen a caravan passing.
"Men?" she echoed. "I think there are six or seven of them in existence. I saw them, several years ago. But one never knows where to find them. The wind blows them away. They have no roots, and that makes their life very difficult."
"Goodbye," said the little prince.
"Goodbye," said the flower.

$ python3 test.py
Counter({'the': 10, 'flower': 6, 'them': 4, 'little': 4, 'said': 4, 'prince': 4, 'a': 3, 'morning': 2, 'men': 2, 'one': 2, 'i': 2, 'good': 2, 'goodbye': 2, 'of': 2, 'where': 2, 'no': 2, 'with': 2, 'and': 2, 'are': 2, 'seen': 1, '': 1, 'ago': 1, 'met': 1, 'several': 1, 'they': 1, 'or': 1, 'all': 1, 'makes': 1, 'knows': 1, 'their': 1, 'echoed': 1, 'asked': 1, 'never': 1, 'six': 1, 'saw': 1, 'had': 1, 'petals': 1, 'seven': 1, 'caravan': 1, 'passing': 1, 'to': 1, 'blows': 1, 'roots': 1, 'but': 1, 'difficult': 1, 'in': 1, 'have': 1, 'only': 1, 'at': 1, 'find': 1, 'was': 1, 'think': 1, 'once': 1, 'life': 1, 'existence': 1, 'years': 1, 'politely': 1, 'she': 1, 'very': 1, 'there': 1, 'three': 1, 'crossed': 1, 'it': 1, 'that': 1, 'away': 1, 'desert': 1, 'wind': 1, 'account': 1})
[('the', 10), ('flower', 6), ('them', 4), ('little', 4), ('said', 4), ('prince', 4), ('a', 3), ('morning', 2), ('men', 2), ('one', 2)]
flower 出現的次數:6





發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章