Ps: 參考代碼 原文鏈接:https://blog.csdn.net/Q_QuanTing/article/details/82698229
Why am I doing this?
入職後一段時間被安排去做了nlp的相關工作,爲了收集相關語料庫,發現搜狗輸入法的詞庫有很多已經分類好的詞庫。於是想辦法對其進行了爬取操作。
Function Introduction
- 爬取指定類別下的.scel文件,並保存在scel_bank文件夾中。
def get_cate_list(res_html):
"""
獲取https://pinyin.sogou.com/dict/cate/index/132/default/ 下的
“基礎醫學(39) 西藥學(52) 中醫(71) 中藥(42) 鍼灸(2) 疾病(18) 超聲醫學(5)
耳鼻喉科(3) 法醫學(2) 護理學(4) 解剖學(12) 口腔醫學(9) 美容外科(11) 皮膚科(8)
獸醫(5) 醫療器械(19) 醫學影像學(5) 腫瘤形態學(1) 醫學檢驗(3) 醫療(32) 外科(8)
其它(41)”
的超鏈接。
"""
# 獲取第二種小分類鏈接
dict_cate_dict = {}
soup = BeautifulSoup(res_html, "lxml")
dict_td_lists = soup.find_all("div", class_="cate_no_child no_select")
# 類型1解析
for dict_td_list in dict_td_lists:
dict_td_url = "https://pinyin.sogou.com" + dict_td_list.a['href']
dict_cate_dict[dict_td_list.get_text().replace("\n", "")] = dict_td_url
return dict_cate_dict
def get_page(res_html):
"""
獲取主題頁數
"""
# 頁數
soup = BeautifulSoup(res_html, "html.parser")
dict_div_lists = soup.find("div", id="dict_page_list")
dict_td_lists = dict_div_lists.find_all("a")
if dict_td_lists == []:
return 1
else:
page = dict_td_lists[-2].string
return int(page)
def get_download_list(res_html):
# 獲取當前頁面的下載鏈接
dict_dl_dict = {}
pattern = re.compile(r'name=(.*)')
soup = BeautifulSoup(res_html, "html.parser")
dict_dl_lists = soup.find_all("div", class_="dict_dl_btn")
for dict_dl_list in dict_dl_lists:
dict_dl_url = dict_dl_list.a['href']
dict_name = pattern.findall(dict_dl_url)[0]
dict_ch_name = urllib.parse.unquote(dict_name, 'utf-8').replace("/", "-").replace(",", "-").replace("|", "-")\
.replace("\\", "-").replace("'", "-")
dict_dl_dict[dict_ch_name] = dict_dl_url
return dict_dl_dict
def download_dict(dl_url, path):
# 下載
res = requests.get(dl_url, timeout=5)
#print(res)
#print(res.content)
with open(path, "wb") as fw:
fw.write(res.content)
def get_html(res):
r = requests.get(res)
content = r.text
return content
if __name__ == "__main__":
res = 'https://pinyin.sogou.com/dict/cate/index/132/default/' #大類地址
content = get_html(res)
address = get_cate_list(content)
downloadlist = []
for ad in tqdm(address):
print("Get {} ".format(ad))
res = address[ad] #獲取子類的地址
c = get_html(res) #獲取子類頁面
pages = get_page(c) #獲取子類頁面頁數
for i in range(pages):
if i + 1 == 1:
d = get_download_list(c)
else:
d = get_download_list(c + '/default/' + str(i + 1))
downloadlist.append(d)
print(downloadlist) #獲取所有詞庫的下載地址
print("Downloading...")
s = 0 #詞庫個數
scel_path = "Your path\\scel_bank"
for j in range(len(downloadlist)):
for sub_d in tqdm(downloadlist[j]):
s = s + 1
print(s)
download_dict(downloadlist[j][sub_d], scel_path + sub_d + '.scel')
- 將scel_bank中的文件批量轉換爲.txt格式,並保存在txt_bank文件夾中。
代碼不一一貼上來了,詳細見https://github.com/complicatedlee/Chinese-medical-words-bank
Summary
兩個月前的第一次小爬蟲,記錄一下,python真神奇。