第一步的分詞使用jieba來實現,感覺效果還不錯。
第二步. 統計詞頻
統計詞頻,相對來講比較簡單一些,主要在Python自帶的Counter類基礎上稍作改進。值得注意的是需要去掉停用詞。所謂停用詞,就是出現頻率太高的詞,如逗號,句號等等,以至於沒有區分度。停用詞可以在網上很輕易找到,我事先已經轉化成二進制的格式存儲下來了。
2.1 MulCounter
MulCounter完成的是根據單詞數組來完成統計詞頻的工作。
這是一個繼承自Counter的類。之所以不直接用Counter是因爲它雖然能夠統計詞頻,但是無法完成過濾功能。而MulCounter可以通過larger_than和less_than這兩個方法過濾掉出現頻率過少和過多的詞。
class MulCounter(Counter):
# a class extends from collections.Counter
# add some methods, larger_than and less_than
def __init__(self,element_list):
super().__init__(element_list)
def larger_than(self,minvalue,ret='list'):
temp = sorted(self.items(),key=_itemgetter(1),reverse=True)
low = 0
high = temp.__len__()
while(high - low > 1):
mid = (low+high) >> 1
if temp[mid][1] >= minvalue:
low = mid
else:
high = mid
if temp[low][1]<minvalue:
if ret=='dict':
return {}
else:
return []
if ret=='dict':
ret_data = {}
for ele,count in temp[:high]:
ret_data[ele]=count
return ret_data
else:
return temp[:high]
def less_than(self,maxvalue,ret='list'):
temp = sorted(self.items(),key=_itemgetter(1))
low = 0
high = temp.__len__()
while ((high-low) > 1):
mid = (low+high) >> 1
if temp[mid][1] <= maxvalue:
low = mid
else:
high = mid
if temp[low][1]>maxvalue:
if ret=='dict':
return {}
else:
return []
if ret=='dict':
ret_data = {}
for ele,count in temp[:high]:
ret_data[ele]=count
return ret_data
else:
return temp[:high]
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
2.2 WordCounter
WordCounter完成的是根據文本來統計詞頻的工作。確切的來說,對完整的文本進行分詞,過濾掉停用詞,然後將預處理好的單詞數組交給MulCounter去統計
class WordCounter():
# can calculate the freq of words in a text list
# for example
# >>> data = ['Merge multiple sorted inputs into a single sorted output',
# 'The API below differs from textbook heap algorithms in two aspects']
# >>> wc = WordCounter(data)
# >>> print(wc.count_res)
# >>> MulCounter({' ': 18, 'sorted': 2, 'single': 1, 'below': 1, 'inputs': 1, 'The': 1, 'into': 1, 'textbook': 1,
# 'API': 1, 'algorithms': 1, 'in': 1, 'output': 1, 'heap': 1, 'differs': 1, 'two': 1, 'from': 1,
# 'aspects': 1, 'multiple': 1, 'a': 1, 'Merge': 1})
def __init__(self, text_list):
self.text_list = text_list
self.stop_word = self.Get_Stop_Words()
self.count_res = None
self.Word_Count(self.text_list)
def Get_Stop_Words(self):
ret = []
ret = FI.load_pickle('./static/stop_words.pkl')
return ret
def Word_Count(self,text_list,cut_all=False):
filtered_word_list = []
count = 0
for line in text_list:
res = jieba.cut(line,cut_all=cut_all)
res = list(res)
text_list[count] = res
count += 1
filtered_word_list += res
self.count_res = MulCounter(filtered_word_list)
for word in self.stop_word:
try:
self.count_res.pop(word)
except:
pass
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42