[原創][爬蟲學習·二]爬取nndc上的核素數據
本文爬取的目標:爬取nndc官網上核素的數據(S(n),S(p))。
步驟如下:1)首先爬取所有核素的名稱和質量數,將內容寫入nucleus.txt;
2)去除nucleus.txt中的重複行,得nucleus_new.txt;
3)逐行讀取nucleus_new.txt中的核素信息並構造URL請求,爬取nndc官網上核素的S(n)和S(p)數據,將結果寫入nucleusSnSp.csv文件。
步驟一
先來看一下nndc的搜索頁面:
https://www.nndc.bnl.gov/nudat2/indx_sigma.jsp
得到上圖所示頁面,點擊頁面中的search按鈕。得:
元素左上角爲質量數,審查紅圈內元素,發現爬取其信息是較爲簡單的。寫出代碼如下:
from selenium import webdriver
co = webdriver.ChromeOptions()
co.headless = False #是否有瀏覽界面
chrome_driver = r'D:\anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
browser = webdriver.Chrome(executable_path=chrome_driver, options=co)
url = 'https://www.nndc.bnl.gov/nudat2/indx_sigma.jsp'
browser.get(url)
form = browser.find_element_by_tag_name('form')
p = form.find_element_by_css_selector('p:nth-child(2)')
#模擬點擊search按鈕
p.find_element_by_tag_name('input').click()
#等待30秒,保證頁面加載完成
browser.implicitly_wait(30)
tbody = browser.find_element_by_tag_name('tbody')
trs = browser.find_elements_by_tag_name('tr')
with open('nucleus.txt', 'w', encoding='utf-8') as f:
for i in range(len(trs)):
if i == 0:
continue
elif i % 2 == 1:
continue
else:
nuc_td_num = trs[i].find_element_by_css_selector('td:first-child')
nuc_td_name = trs[i].find_element_by_css_selector('td:nth-child(2)')
nuc_info = nuc_td_num.text+'\n'+nuc_td_name.text
nuc_result = nuc_info.split('\n')
#過濾質量數中帶m的非法數據
if 'm' in nuc_result[0]:
print('error')
else:
#將核素的質量數和名稱寫入txt
f.write(nuc_result[0]+nuc_result[2]+'\n')
print(nuc_result[0]+nuc_result[2])
webdriver的安裝和簡介見上一篇博客:[原創][爬蟲學習·一]爬取天天基金網的基金收益排行信息
爬取完成後,nucleus.txt中內容如下,
步驟二
可以看到txt中有重複行,需要進行去重。代碼如下:
readPath='nucleus.txt'
writePath='nucleus_new.txt'
lines_seen = set()
outfile = open(writePath, 'a+', encoding='utf-8')
f = open(readPath,'r',encoding='utf-8')
for line in f:
if line not in lines_seen:
outfile.write(line)
lines_seen.add(line)
去重完成,得到nucleus_new.txt。
步驟三
2H核素的S(n)和S(p)數據由下面的URL獲得:
https://www.nndc.bnl.gov/nudat2/getdatasetClassic.jsp?nucleus=2H&unc=nds
20N核素的S(n)和S(p)數據由下面的URL獲得:
https://www.nndc.bnl.gov/nudat2/getdatasetClassic.jsp?nucleus=20N&unc=nds
觀察上述的URL,只有nucleus參數是變化的,參數內容正是我們在第二步中得到的necleus_new.txt中的每一行的信息。因此變換該參數,依次構造URL,就能得到所有核素的詳細信息頁面。2H的頁面如下:
審查紅圈元素中可知,S(n)和S(p)也是不難爬取的。代碼如下:
import csv
#構造csv的表頭
header = ['核素','S(n)(keV)', 'S(p)(keV)']
from selenium import webdriver
co = webdriver.ChromeOptions()
co.headless = False #是否有瀏覽界面
chrome_driver = r'D:\anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
browser = webdriver.Chrome(executable_path=chrome_driver, options=co)
url = 'https://www.nndc.bnl.gov/nudat2/getdatasetClassic.jsp?unc=nds'
blank = 'toBeDone'
with open('nucleusSnSp.csv', 'w', encoding='utf-8', newline='') as f:
writer = csv.writer(f)
writer.writerow(header)
for nuc in open('nucleus_new.txt', 'r', encoding='utf-8'):
url_new = '&nucleus='+nuc
url_new = url_new[:-1]
total_url = url+url_new
browser.set_page_load_timeout(60)
try:
browser.get(total_url)
except:
print('!!!time out after 60 seconds when loading page')
writer.writerow([nuc[:-1], blank, blank])
continue
else:
try:
body = browser.find_element_by_tag_name('body')
table = body.find_element_by_tag_name('table')
except:
print('empty dataset')
else:
tbody = table.find_element_by_tag_name('tbody')
tr = tbody.find_element_by_tag_name('tr')
tds = tr.find_elements_by_tag_name('td')
Sn_result = ''
Sp_result = ''
for td in tds:
if 'S(n)' in td.text:
Sn = td.text
SnResult = Sn.split('keV')
#截取Sn的數據
Sn_result = SnResult[0][5:]
if 'S(p)' in td.text:
Sp = td.text
SpResult = Sp.split('keV')
# 截取Sp的數據
Sp_result = SpResult[0][5:]
writer.writerow([nuc[:-1], Sn_result, Sp_result])
print(nuc[:-1] + ',' + Sn_result + ',' + Sp_result)
由於nndc上部分核素是沒有數據的,如24P、27O、23C,因此需要做empty dataset的異常處理。另外,由於網絡原因,頁面加載不出來時,也會導致後續解析出錯,因此也需要做異常處理。
爬取完成後,nucleusSnSp.csv的信息如下: