python3 词频统计

原創

wozaiyizhideng

2018-09-05 02:53

主要是对正则表达式、字符串内建函数、collections模块的Counter类的应用。

正则表达式

http://www.runoob.com/python3/python3-reg-expressions.html

re.split split 方法按照能够匹配的子串将字符串分割后返回列表，它的使用形式如下：

re.split(pattern, string[, maxsplit=0, flags=0])

>>> import re
>>> txt = "The little prince crossed the desert and met with only one flower."
>>> strings = re.split('\W+', txt)
>>> print(strings)
['The', 'little', 'prince', 'crossed', 'the', 'desert', 'and', 'met', 'with', 'only', 'one', 'flower', '']  
#有多余的回车符，大小写区分

Python 的字符串内建函数

参考：http://www.runoob.com/python3/python3-string.html

replace(old, new [, max])	把将字符串中的 str1 替换成 str2,如果 max 指定，则替换不超过 max 次。
lower()	转换字符串中所有大写字符为小写.

>>> txt = "The little prince crossed the desert and met with only one flower."
>>> new_txt = txt.replace('\n', '').lower()
>>> print(new_txt)
the little prince crossed the desert and met with only one flower.

collections模块的Counter类

http://www.pythoner.com/205.html

Counter类的目的是用来跟踪值出现的次数。它是一个无序的容器类型，以字典的键值对形式存储，其中元素作为key，其计数作为value。计数值可以是任意的Interger（包括0和负数）。Counter类和其他语言的bags或multisets很相似。

整合

import re
from collections import Counter

txt = open('test.txt').read()
new_txt = txt.replace('\n', ' ').lower()
strings = re.split('\W+', new_txt)
result = Counter(strings)
#每个单词出现的次数
print(result)
#出现次数前10的单词
print(result.most_common(10))
#某个单词出现的次数
print("flower 出现的次数:%d" % result["flower"])

$ cat test.txt
The little prince crossed the desert and met with only one flower. It was a flower with three petals, a flower of no account at all.
"Good morning," said the little prince.
"Good morning," said the flower.
"Where are the men?" the little prince asked, politely.
The flower had once seen a caravan passing.
"Men?" she echoed. "I think there are six or seven of them in existence. I saw them, several years ago. But one never knows where to find them. The wind blows them away. They have no roots, and that makes their life very difficult."
"Goodbye," said the little prince.
"Goodbye," said the flower.

$ python3 test.py
Counter({'the': 10, 'flower': 6, 'them': 4, 'little': 4, 'said': 4, 'prince': 4, 'a': 3, 'morning': 2, 'men': 2, 'one': 2, 'i': 2, 'good': 2, 'goodbye': 2, 'of': 2, 'where': 2, 'no': 2, 'with': 2, 'and': 2, 'are': 2, 'seen': 1, '': 1, 'ago': 1, 'met': 1, 'several': 1, 'they': 1, 'or': 1, 'all': 1, 'makes': 1, 'knows': 1, 'their': 1, 'echoed': 1, 'asked': 1, 'never': 1, 'six': 1, 'saw': 1, 'had': 1, 'petals': 1, 'seven': 1, 'caravan': 1, 'passing': 1, 'to': 1, 'blows': 1, 'roots': 1, 'but': 1, 'difficult': 1, 'in': 1, 'have': 1, 'only': 1, 'at': 1, 'find': 1, 'was': 1, 'think': 1, 'once': 1, 'life': 1, 'existence': 1, 'years': 1, 'politely': 1, 'she': 1, 'very': 1, 'there': 1, 'three': 1, 'crossed': 1, 'it': 1, 'that': 1, 'away': 1, 'desert': 1, 'wind': 1, 'account': 1})
[('the', 10), ('flower', 6), ('them', 4), ('little', 4), ('said', 4), ('prince', 4), ('a', 3), ('morning', 2), ('men', 2), ('one', 2)]
flower 出现的次数:6

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python3 词频统计

再谈23种设计模式（3）：行为型模式（学习笔记）

微前端学习笔记(4):从微前端到微模块之EMP与hel-micro方案探索

微前端学习笔记（1）：微前端总体架构概述，从微服务发微

985 硕士程序员，空窗 4 个月没有 Offer！

一文搞懂 Spring 循环依赖

赛博斗地主——使用大语言模型扮演Agent智能体玩牌类游戏。

VScode右键打开(添加到右键)

记一次 .NET某工控视觉自动化系统卡死分析

WindowsServer--SQL Server搭建主从同步实现读写分离 - 事务性分发

java由于越界导致的报错

測試工具開發（1）

linux系統測試 - IO測試工具之fio詳解

linux系統測試 - iozone 測試磁盤性能

測試工具開發（2）pexpect.spawn

基於httprunner的接口測試

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結