[原創][爬蟲學習·二]爬取nndc上的核素數據

原創

2020-05-15 14:02

[原創][爬蟲學習·二]爬取nndc上的核素數據

本文爬取的目標：爬取nndc官網上核素的數據（S（n），S（p））。

步驟如下：1）首先爬取所有核素的名稱和質量數，將內容寫入nucleus.txt；

2）去除nucleus.txt中的重複行，得nucleus_new.txt；

3）逐行讀取nucleus_new.txt中的核素信息並構造URL請求，爬取nndc官網上核素的S（n）和S（p）數據，將結果寫入nucleusSnSp.csv文件。

步驟一

先來看一下nndc的搜索頁面：

https://www.nndc.bnl.gov/nudat2/indx_sigma.jsp

得到上圖所示頁面，點擊頁面中的search按鈕。得：

元素左上角爲質量數，審查紅圈內元素，發現爬取其信息是較爲簡單的。寫出代碼如下：

from selenium import webdriver
co = webdriver.ChromeOptions()
co.headless = False #是否有瀏覽界面
chrome_driver = r'D:\anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
browser = webdriver.Chrome(executable_path=chrome_driver, options=co)
url = 'https://www.nndc.bnl.gov/nudat2/indx_sigma.jsp'
browser.get(url)
form = browser.find_element_by_tag_name('form')
p = form.find_element_by_css_selector('p:nth-child(2)')
#模擬點擊search按鈕
p.find_element_by_tag_name('input').click()
#等待30秒，保證頁面加載完成
browser.implicitly_wait(30)
tbody = browser.find_element_by_tag_name('tbody')
trs = browser.find_elements_by_tag_name('tr')
with open('nucleus.txt', 'w', encoding='utf-8') as f:
    for i in range(len(trs)):
        if i == 0:
            continue
        elif i % 2 == 1:
            continue
        else:
            nuc_td_num = trs[i].find_element_by_css_selector('td:first-child')
            nuc_td_name = trs[i].find_element_by_css_selector('td:nth-child(2)')
            nuc_info = nuc_td_num.text+'\n'+nuc_td_name.text
            nuc_result = nuc_info.split('\n')
            #過濾質量數中帶m的非法數據
            if 'm' in nuc_result[0]:
                print('error')
            else:
                #將核素的質量數和名稱寫入txt
                f.write(nuc_result[0]+nuc_result[2]+'\n')
                print(nuc_result[0]+nuc_result[2])

webdriver的安裝和簡介見上一篇博客：[原創][爬蟲學習·一]爬取天天基金網的基金收益排行信息

爬取完成後，nucleus.txt中內容如下，

步驟二

可以看到txt中有重複行，需要進行去重。代碼如下：

readPath='nucleus.txt'
writePath='nucleus_new.txt'
lines_seen = set()
outfile = open(writePath, 'a+', encoding='utf-8')
f = open(readPath,'r',encoding='utf-8')
for line in f:
    if line not in lines_seen:
        outfile.write(line)
        lines_seen.add(line)

去重完成，得到nucleus_new.txt。

步驟三

2H核素的S(n)和S(p)數據由下面的URL獲得:

https://www.nndc.bnl.gov/nudat2/getdatasetClassic.jsp?nucleus=2H&unc=nds

20N核素的S(n)和S(p)數據由下面的URL獲得:

https://www.nndc.bnl.gov/nudat2/getdatasetClassic.jsp?nucleus=20N&unc=nds

觀察上述的URL，只有nucleus參數是變化的，參數內容正是我們在第二步中得到的necleus_new.txt中的每一行的信息。因此變換該參數，依次構造URL，就能得到所有核素的詳細信息頁面。2H的頁面如下：

審查紅圈元素中可知，S（n）和S（p）也是不難爬取的。代碼如下：

import csv
#構造csv的表頭
header = ['核素','S(n)(keV)', 'S(p)(keV)']
from selenium import webdriver
co = webdriver.ChromeOptions()
co.headless = False #是否有瀏覽界面
chrome_driver = r'D:\anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
browser = webdriver.Chrome(executable_path=chrome_driver, options=co)
url = 'https://www.nndc.bnl.gov/nudat2/getdatasetClassic.jsp?unc=nds'
blank = 'toBeDone'
with open('nucleusSnSp.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(header)
    for nuc in open('nucleus_new.txt', 'r', encoding='utf-8'):
        url_new = '&nucleus='+nuc
        url_new = url_new[:-1]
        total_url = url+url_new
        browser.set_page_load_timeout(60)
        try:
            browser.get(total_url)
        except:
            print('！！！time out after 60 seconds when loading page')
            writer.writerow([nuc[:-1], blank, blank])
            continue
        else:
            try:
                body = browser.find_element_by_tag_name('body')
                table = body.find_element_by_tag_name('table')
            except:
                print('empty dataset')
            else:
                tbody = table.find_element_by_tag_name('tbody')
                tr = tbody.find_element_by_tag_name('tr')
                tds = tr.find_elements_by_tag_name('td')
                Sn_result = ''
                Sp_result = ''
                for td in tds:
                    if 'S(n)' in td.text:
                        Sn = td.text
                        SnResult = Sn.split('keV')
                        #截取Sn的數據
                        Sn_result = SnResult[0][5:]
                    if 'S(p)' in td.text:
                        Sp = td.text
                        SpResult = Sp.split('keV')
                        # 截取Sp的數據
                        Sp_result = SpResult[0][5:]
                writer.writerow([nuc[:-1], Sn_result, Sp_result])
                print(nuc[:-1] + ',' + Sn_result + ',' + Sp_result)

由於nndc上部分核素是沒有數據的，如24P、27O、23C，因此需要做empty dataset的異常處理。另外，由於網絡原因，頁面加載不出來時，也會導致後續解析出錯，因此也需要做異常處理。

爬取完成後，nucleusSnSp.csv的信息如下：

nucleusSnSp.csv下載地址

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

[原創][爬蟲學習·二]爬取nndc上的核素數據

[原創][爬蟲學習·二]爬取nndc上的核素數據

步驟一

步驟二

步驟三

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

[原創]Dijkstra算法的簡單實現（C++）

[原創]windows下安裝tensorflow的簡單方法

[原創]C++利用鏈表模板類實現一個簡易隊列

[原創]Linux 802.11n CSI tool安裝教程（親測可用）

[原創]Linux 802.11n CSI Tool下csi數據的實時可視化

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結