使用python統計高頻詞，模糊匹配不規範公司名稱

原創

奇妙探险家

2020-06-23 01:01

原理：

1、使用jieba分詞，取出不重要的高頻詞（'股份','有限','公司'等），簡化待查公司名，防止影響相似度。

2、使用FuzzyWuzzy計算待處理公司與標準公司名的相似度，取出最相似的（基於編輯距離Levenshtein Distance）

import jieba
import pandas as pd
tokenizer = jieba.dt
data=pd.read_excel(r"C:\Users\data\單詞相似度.xlsx",sheet_name='待處理公司名')
m=[]
for x in set(data['待處理字段']):
    for y in tokenizer.cut(x):
        m.append(y)

pd.Series(m).value_counts()

import pandas as pd
from fuzzywuzzy import fuzz

data1=pd.read_excel(r"C:\Users\data\單詞相似度.xlsx",sheet_name='待處理公司名')
data2=pd.read_excel(r"C:\Users\data\單詞相似度.xlsx",sheet_name='標準公司名')

#根據分詞後統計的高頻詞設置過濾列表stopwords，簡化字符串，處理同義詞
def sinple(x):
    temp=x
    for stopwords in ['有限','責任','公司','投資','證券','集團','（','）','管理','基金','私募','資產','控股','股份','資本','金控','國際','科技','建設','開發','發展','融資','租賃']:
        temp=temp.replace(stopwords,'')
    temp=temp.replace('農村商業銀行','農商行')
    return temp
# print(sinple('廣州越秀融資租賃有限公司'))

#給定一個待處理公司名，在標準公司名列表中找最相近的字符串
def findWord(word,list):
    maxRatio=0
    target=''
    for x in list:
        #先計算簡化版的相似度，再加上未簡化的相似度作爲小數位
        ratio=fuzz.ratio(sinple(word),sinple(x))+fuzz.ratio(word,x)/100
        if ratio>maxRatio:
            maxRatio=ratio
            target=x
    return word,target,maxRatio
# findWord('廣州國邦融資租賃有限公司 ',data2['company_nm'])

#輸入一個待處理列表，爲每一行找到最相近的字符串，並給出相似度，輸出dataframe
def findlist(targetList,fromList):
    data=pd.DataFrame(columns=('target', 'result', 'ratio'))
    index=0
    for x in set(targetList):
        row=findWord(x,fromList)
        index+=1
        data.loc[index]={'target':row[0],'result':row[1],'ratio':row[2]}
    return data

data=findlist(data1['item_client'],data2['company_nm'])
data.to_csv(r'c:\Users\data\單詞相似度.csv')

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用python統計高頻詞，模糊匹配不規範公司名稱

python conda虛擬環境

使用python統計高頻詞，模糊匹配不規範公司名稱

PowerDesigner表結構遷移流程及VBS代碼示例

在無網絡服務器上安裝python包

FineBi記錄

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結