2018.5.27（python）實例：文本詞頻分析(中英文各一份)及列表的sort（）使用

原創

2018-08-30 11:33

原碼

def getText():
    txt=open("hmlt.txt","r").read() 
    txt=txt.lower()                 
    for ch in '`!@#~$%^&*()_+-=*/{}[];,./?<>':
        txt=txt.replace(ch," ")  
    return txt
hmltTxt=getText()  
words=hmltTxt.split()
counts={}
for word in words:
    counts[word]=counts.get(word,0)+1
items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(100):
    word,count=items[i]         
    print("{0:<10}{1:>5}".format(word,count))

帶解析

def getText():
    txt=open("hmlt.txt","r").read() #打開文件
    txt=txt.lower()                  #將所有單詞轉換爲小寫去掉大小寫的干擾
    for ch in '`!@#~$%^&*()_+-=*/{}[];,./?<>': #去掉所有的特殊符號
        txt=txt.replace(ch," ")   #將特殊符號替換成空格 即去掉
    return txt
hmltTxt=getText()    #對文件進行讀取
words=hmltTxt.split()
#因爲現在單詞間均爲空格分隔開來，所以用split用空格分隔他們並變成列表返回
counts={} #建立一個字典
for word in words:
    counts[word]=counts.get(word,0)+1
    #用當前的某一個單詞作爲鍵索引字典 如果在裏面則返回次數再加一 若不在裏面則直接加1
items=list(counts.items())
#用list將counts變爲一個列表類型  counts.items()-->返回可遍歷的（鍵，值）元組數組
items.sort(key=lambda x:x[1],reverse=True)
#使用list.sort()方法來排序，此時list本身將被修改
for i in range(100):
    word,count=items[i]         
    print("{0:<10}{1:>5}".format(word,count))

#從輸出結果來看，高頻單詞大多數是冠詞，代詞、連接詞等詞彙，並不能代表文章的含義
#進一步的可以採用集合類型構建一個排除詞彙庫excludes，在輸出結果中排除這個詞彙庫中的內容

excludes={"the","and","of","you","a","with","but","as","be","in","or","are"}

def getText():
    txt=open("hmlt.txt","r").read() 
    txt=txt.lower()                 
    for ch in '`!@#~$%^&*()_+-=*/{}[];,./?<>':
        txt=txt.replace(ch," ")  
    return txt
hmltTxt=getText()  
words=hmltTxt.split()
counts={}
for word in words:
    counts[word]=counts.get(word,0)+1
for word in excludes:
    del(counts[word])
items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
    word,count=items[i]         
    print("{0:<10}{1:>5}".format(word,count))

/**************中文文本********************/

import jieba
txt=open("threekingdoms.txt","r",encoding='utf-8').read()
words=jieba.lcut(txt)
counts={}
for word in words:
    if len(word)==1:  #排除單個字符的分詞結果
        continue
    else:
        counts[word]=counts.get(word,0)+1
items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
    word,count=items[i]
    print("{0:<10}{1:>5}".format(word,count))

#輸出結果中，出現了“玄德”、“玄德曰”，應該爲同一個人但jieba劃分爲兩個詞彙，這種情況需要整合處理

excludes={"將軍","卻說","二人","不可","荊州","不能","如此"}
import jieba
txt=open("threekingdoms.txt","r",encoding='utf-8').read()
words=jieba.lcut(txt)
counts={}
for word in words:
    if len(word)==1:  #排除單個字符的分詞結果
        continue
    
    elif word=="諸葛亮"or word=="孔明曰":
        rword="孔明"
        
    elif word=="關公"or word=="雲長":
        rword="關羽"
        
    elif word=="玄德"or word=="玄德曰":
        rword="劉備"
        
    elif word=="孟德"or word=="丞相":
        rword="曹操"
        
    else:
        rword=word
        counts[word]=counts.get(word,0)+1
for word in excludes:
    del(counts[word])
items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
    word,count=items[i]
    print("{0:<10}{1:>5}".format(word,count))

其中sort（）的使用

1.方法sort用於對列表就地排序。就地排序意味着對原來的列表進行修改，使其元素按順序排列，而不是返回排序後的列表的副本

x = [4, 6, 2, 1, 7, 9]
x.sort()
print x   
# [1, 2, 4, 6, 7, 9]

如果需要一個排序好的副本，同時保持原有列表不變，怎麼實現呢

①

>>> x = [4, 6, 2, 1, 7, 9]
>>> y=x[ : ]
>>> y.sort()
>>> print(y)
[1, 2, 4, 6, 7, 9]
>>> print(x)
[4, 6, 2, 1, 7, 9]

注意：y = x[:] 通過分片操作將列表x的元素全部拷貝給y，如果簡單的把x賦值給y：y = x，y和x還是指向同一個列表，並沒有產生新的副本。

②

>>> x = [4, 6, 2, 1, 7, 9]
>>> y=x.copy()
>>> y.sort()
>>> print(y)
[1, 2, 4, 6, 7, 9]
>>> print(x)
[4, 6, 2, 1, 7, 9]

先產生一個副本賦予y然後再對y排序

高級排序

方法sort接收兩個可選參數：key和reverse。這兩個參數通常是按名稱指定的，稱爲關鍵字參數。

參數key類似於參數cmp：你將其設置爲一個用於排序的函數。然而，不會直接使用這個函數來判斷一個元素是否比另一個元素小，而是使用它來爲每個元素創建一個鍵，再根據這些鍵對元素進行排序。

reverse-->只需將其指定爲一個真值（False或True），以指出是否要按相反的順序對列表進行排序

True 降序 False升序

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

2018.5.27（python）實例：文本詞頻分析(中英文各一份)及列表的sort（）使用

2018.4.2（Python）基本圖形繪製科赫雪花

2018.4.1(python) 請編寫程序將用戶輸入華氏度轉換爲攝氏度，或將輸入的攝氏度轉換爲華氏度。 //（米和英寸之間的長度轉換）//熱量轉換

2018.4.7（程序設計與算法）漢諾塔詳解

2018.4.11（Python）星號三角形 // 愷撒密碼 I

2018.4.9（Python）time庫的使用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結