python3 词频统计

主要是对正则表达式、字符串内建函数、collections模块的Counter类的应用。


正则表达式

http://www.runoob.com/python3/python3-reg-expressions.html

re.split      split 方法按照能够匹配的子串将字符串分割后返回列表,它的使用形式如下:

re.split(pattern, string[, maxsplit=0, flags=0])
>>> import re
>>> txt = "The little prince crossed the desert and met with only one flower."
>>> strings = re.split('\W+', txt)
>>> print(strings)
['The', 'little', 'prince', 'crossed', 'the', 'desert', 'and', 'met', 'with', 'only', 'one', 'flower', '']  
#有多余的回车符,大小写区分

Python 的字符串内建函数

参考:http://www.runoob.com/python3/python3-string.html

replace(old, new [, max])

把 将字符串中的 str1 替换成 str2,如果 max 指定,则替换不超过 max 次。

lower()

转换字符串中所有大写字符为小写.

>>> txt = "The little prince crossed the desert and met with only one flower."
>>> new_txt = txt.replace('\n', '').lower()
>>> print(new_txt)
the little prince crossed the desert and met with only one flower.


collections模块的Counter类

http://www.pythoner.com/205.html

Counter类的目的是用来跟踪值出现的次数。它是一个无序的容器类型,以字典的键值对形式存储,其中元素作为key,其计数作为value。计数值可以是任意的Interger(包括0和负数)。Counter类和其他语言的bags或multisets很相似。


整合

import re
from collections import Counter

txt = open('test.txt').read()
new_txt = txt.replace('\n', ' ').lower()
strings = re.split('\W+', new_txt)
result = Counter(strings)
#每个单词出现的次数
print(result)
#出现次数前10的单词
print(result.most_common(10))
#某个单词出现的次数
print("flower 出现的次数:%d" % result["flower"])
$ cat test.txt
The little prince crossed the desert and met with only one flower. It was a flower with three petals, a flower of no account at all.
"Good morning," said the little prince.
"Good morning," said the flower.
"Where are the men?" the little prince asked, politely.
The flower had once seen a caravan passing.
"Men?" she echoed. "I think there are six or seven of them in existence. I saw them, several years ago. But one never knows where to find them. The wind blows them away. They have no roots, and that makes their life very difficult."
"Goodbye," said the little prince.
"Goodbye," said the flower.

$ python3 test.py
Counter({'the': 10, 'flower': 6, 'them': 4, 'little': 4, 'said': 4, 'prince': 4, 'a': 3, 'morning': 2, 'men': 2, 'one': 2, 'i': 2, 'good': 2, 'goodbye': 2, 'of': 2, 'where': 2, 'no': 2, 'with': 2, 'and': 2, 'are': 2, 'seen': 1, '': 1, 'ago': 1, 'met': 1, 'several': 1, 'they': 1, 'or': 1, 'all': 1, 'makes': 1, 'knows': 1, 'their': 1, 'echoed': 1, 'asked': 1, 'never': 1, 'six': 1, 'saw': 1, 'had': 1, 'petals': 1, 'seven': 1, 'caravan': 1, 'passing': 1, 'to': 1, 'blows': 1, 'roots': 1, 'but': 1, 'difficult': 1, 'in': 1, 'have': 1, 'only': 1, 'at': 1, 'find': 1, 'was': 1, 'think': 1, 'once': 1, 'life': 1, 'existence': 1, 'years': 1, 'politely': 1, 'she': 1, 'very': 1, 'there': 1, 'three': 1, 'crossed': 1, 'it': 1, 'that': 1, 'away': 1, 'desert': 1, 'wind': 1, 'account': 1})
[('the', 10), ('flower', 6), ('them', 4), ('little', 4), ('said', 4), ('prince', 4), ('a', 3), ('morning', 2), ('men', 2), ('one', 2)]
flower 出现的次数:6





發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章