刚看完秦吏,想知道除了黑夫谁是出场率最高的角色,所以用Python中的jieba库做了简单分析
import jieba
import wordcloud
txt = open("秦吏.txt",'r',encoding='utf-8').read()
excludes = {"他们","自己","一个","没有","就是","虽然","还是","不是","知道","已经","继续","什么","有些","只是","因为","众人","还有","如此","眼下","如今","所以", "那些",
"将军","起来","这些","不过","开始","可以","只能","郡守","甚至","便是","这个","不能","这是","最后","出来","作为","说道","看着","于是","一样","过去","地方",
"以为","时候","觉得","兵卒","为了","可能","立刻","而是","现在","之后","今日","发现","不知","二人","不会","这样","除了","这种","如何","这么","只有","真是",
"不少","官府","大军","恐怕","依然","看到","一直","都尉","的话","离开","不敢","不同","几个","一起","却是","十分","郡尉","需要","时间","下来","这时候","为何",
"一边","有人","抵达","记住","一些","两个","当年","明白","得到","此事","一般", "听说", "南方", "后世", "一点", "看来", "无法", "心里", "com", "阅读网", "www",
"mayitxt","只要","才能","东西","希望","想要","一次","的确","必须","士卒","粮食","一切","过来","战争","商贾","这场","摇头","朝廷","就算","楚人","这里","回来",
"律令","官吏","不必","当然","认为","秦朝","秦人","匈奴","秦军","咸阳","天下","秦国","胶东","南郡","楚国","安陆","关中","秦吏","百姓","中原","蚂蚁","楚军",
"直接",}
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "亭长" or word == "武忠侯":
rword = "黑夫"
elif word == "嬴政" or word == "始皇" or word == "皇帝" or word == "陛下" or word == "始皇帝" or word == "秦王":
rword = "始皇"
elif word == "公子":
rword = "扶苏"
elif word == "东门":
rword = "东门豹"
elif word == "大胡子" or word == "美髯公":
rword = "刘季"
else:
counts[word] = counts.get(word,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
word, count = items[i]
print("{0:<15}{1:>5}".format(word,count))
filtered = " ".join(words)
w = wordcloud.WordCloud(font_path="msyh.ttc", width=1000, height=700)
w.generate(filtered)
w.to_file("秦吏" + ".png")
思路就是用jieba库进行精准分词
words = jieba.lcut(txt)
再剔除一个字的词
if len(word) == 1:
continue
将多种词语指一个词的进行归并处理
elif word == "嬴政" or word == "始皇" or word == "皇帝" or word == "陛下" or word == "始皇帝" or word == "秦王":
rword = "始皇"
通过出场次数进行排序
items.sort(key=lambda x:x[1],reverse=True)
第一次执行代码发现问题,有一些不属于我要的词汇例如 他们、自己 是高频词,所以进行了反复筛选
excludes = {"他们","自己","一个","没有","就是","虽然","还是","不是","知道","已经","继续","什么","有些","只是","因为","众人","还有","如此","眼下","如今","所以", "那些",
"将军","起来","这些","不过","开始","可以","只能","郡守","甚至","便是","这个","不能","这是","最后","出来","作为","说道","看着","于是","一样","过去","地方",
"以为","时候","觉得","兵卒","为了","可能","立刻","而是","现在","之后","今日","发现","不知","二人","不会","这样","除了","这种","如何","这么","只有","真是",
"不少","官府","大军","恐怕","依然","看到","一直","都尉","的话","离开","不敢","不同","几个","一起","却是","十分","郡尉","需要","时间","下来","这时候","为何",
"一边","有人","抵达","记住","一些","两个","当年","明白","得到","此事","一般", "听说", "南方", "后世", "一点", "看来", "无法", "心里", "com", "阅读网", "www",
"mayitxt","只要","才能","东西","希望","想要","一次","的确","必须","士卒","粮食","一切","过来","战争","商贾","这场","摇头","朝廷","就算","楚人","这里","回来",
"律令","官吏","不必","当然","认为","秦朝","秦人","匈奴","秦军","咸阳","天下","秦国","胶东","南郡","楚国","安陆","关中","秦吏","百姓","中原","蚂蚁","楚军",
"直接",}
for word in excludes:
del counts[word]
最后输出结果为
黑夫 16914
秦始皇 2705
扶苏 1940
陈平 1427
韩信 1398
李斯 1105
赵高 923
季婴 899
李由 810
刘季 772
李信 726
王贲 702
张苍 641
萧何 612
项籍 583
没想到的是始皇和扶苏居然排第二第三
当然代码还有很多问题,很多因素没考虑到
最后用词云进行一个可视化操作
w = wordcloud.WordCloud(font_path="msyh.ttc", width=1000, height=700)
w.generate(filtered)
w.to_file("秦吏" + ".png")