[原創][爬蟲學習·一]爬取天天基金網的基金收益排行信息

最近在學習爬蟲，實驗了幾個簡單的小demo，記錄一二。

首先我們打開天天基金網的基金收益排行頁面，瞭解一下要爬取的頁面，網址和截圖如下：

http://fund.eastmoney.com/trade/hh.html?spm=001.1.swh#zwf_,sc_1n,st_desc

現在要爬取該頁面下所有基金的代碼、名稱、日增長率、近一週和近一月的增長率（也就是紅框內的內容），並保存在Excel文件中。思路如下，

（1）設置Excel文件的格式。

1）引入xlwt工具

import xlwt

2）添加FundSheet頁，並設置Excel文件的表頭，用的是worksheet.write(row,col,label)方法，row爲excel表的行，col爲列，label是表格內容。

workbook = xlwt.Workbook(encoding ='utf-8')
worksheet = workbook.add_sheet('FundSheet')
worksheet.write(0, 0, label='基金代碼')
worksheet.write(0, 1, label='基金名稱')
worksheet.write(0, 2, label='日增長率')
worksheet.write(0, 3, label='周增長率')
worksheet.write(0, 4, label='月增長率')

這幾行代碼的設置效果如下：

（2）分析頁面和爬取網站內容。

選取Selenium爬取工具，模擬瀏覽器對該網址發出請求。如果沒有安裝該工具，通過在cmd中執行

pip install selenium

安裝selenium，並在python文件中通過

from selenium import webdriver

引入webdriver。之後需要下載chromedriver.exe，在Chrome瀏覽器中，輸入chrome://version/，查看版本。

下載對應的驅動:

http://chromedriver.storage.googleapis.com/index.html

將下載好的chromedriver.exe放置在合適的路徑，如：

D:\anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe

之後通過co.headless=False語句，設置在可視化的界面下調試，如果設置爲True，爬蟲運行時就看不到瀏覽器界面了。

聲明chromedriver.exe的啓動路徑（上一步設置的路徑）和基金排行的url。使用browser對象的get方法打開url。

co = webdriver.ChromeOptions()
#是否有瀏覽界面，False：有；True：無
co.headless = False 
#chrome_driver路徑
chrome_driver = r'D:\anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
browser = webdriver.Chrome(executable_path=chrome_driver, options=co)
#基金排行的url
url = 'http://fund.eastmoney.com/trade/hh.html?spm=001.1.swh#zwf_,sc_1n,st_desc'
browser.get(url)

在彈出的瀏覽器界面，打開F12開發者工具，如下：

觀察可知，所有基金的信息，都位於類名爲mainTb的table標籤中，且每一欄基金的信息由table標籤下屬的tbody標籤下的一個tr標籤描述，而日、月、周增長率信息，則由tr標籤下的td標籤描述。層級關係爲：

<table>
    <tbody>
        <tr><!- 基金1 ->
            <td><!- 基金1的日增長率 -></td>
            <td><!- 基金1的周增長率 -></td>
            ...
        </tr>
        <tr><!- 基金2 ->
            <td><!- 基金2的日增長率 -></td>
            <td><!- 基金2的周增長率 -></td>
            ...
        </tr>
        ...    
    </tbody>
</table>

我們注意到，基金的日、周、月增長率的數據，分別爲tr標籤下的第4、5、6個td標籤，因此，需要用CSS選擇器對其進行選擇，語法爲

'p:nth-child(n)'

其中p爲標籤名稱，n爲第幾個p標籤。如選擇第四個td標籤，在Python中就可以寫爲：

day_increase = fund.find_element_by_css_selector('td:nth-child(4)')

總之，寫出爬蟲代碼如下：

mainTb = browser.find_element_by_class_name('mainTb')
tbody = mainTb.find_element_by_tag_name('tbody')
funds = tbody.find_elements_by_tag_name('tr')
#excel文件的行
row = 1
#excel文件的列
col = 0
for fund in funds:
        fund_code = fund.find_element_by_tag_name('td')
        fund_name = fund.find_element_by_class_name('fname')
        a_fund_name = fund_name.find_element_by_tag_name('a')
        #通過CSS樣式選擇器選擇第n個td標籤
        day_increase = fund.find_element_by_css_selector('td:nth-child(4)')
        week_increase = fund.find_element_by_css_selector('td:nth-child(5)')
        month_increase = fund.find_element_by_css_selector('td:nth-child(6)')
        worksheet.write(row, col, label=fund_code.text)
        worksheet.write(row, col+1, label=a_fund_name.text)
        worksheet.write(row, col+2, label=day_increase.text)
        worksheet.write(row, col+3, label=week_increase.text)
        worksheet.write(row, col+4, label=month_increase.text)
        row += 1
workbook.save('Fund_Excel_test.xls')

解釋一下，通過browser的尋找類名方法，找到類名爲mainTb的table元素，再找到該table元素下的tbody標籤，之後找到tbody標籤下的tr標籤集合，注意find_elements和find_element的區別。

之後解析集合中的每一個fund信息，找到基金的代碼、名稱和增長率，將這些信息寫入Excel文件，最後保存。

爬取完成後，在項目目錄下就生成了Fund_Excel_test.xls文件，打開該Excel文件，內容如下：

爬取成功！

demo的完整代碼如下：

import xlwt
from selenium import webdriver
workbook = xlwt.Workbook(encoding ='utf-8')
worksheet = workbook.add_sheet('FundSheet')
worksheet.write(0, 0, label='基金代碼')
worksheet.write(0, 1, label='基金名稱')
worksheet.write(0, 2, label='日增長率')
worksheet.write(0, 3, label='周增長率')
worksheet.write(0, 4, label='月增長率')
co = webdriver.ChromeOptions()
#是否有瀏覽界面，False：有；True：無
co.headless = False
#chrome_driver路徑
chrome_driver = r'D:\anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
browser = webdriver.Chrome(executable_path=chrome_driver, options=co)
#基金排行的url
url = 'http://fund.eastmoney.com/trade/hh.html?spm=001.1.swh#zwf_,sc_1n,st_desc'
browser.get(url)
mainTb = browser.find_element_by_class_name('mainTb')
tbody = mainTb.find_element_by_tag_name('tbody')
funds = tbody.find_elements_by_tag_name('tr')
#excel文件的行
row = 1
#excel文件的列
col = 0
for fund in funds:
        fund_code = fund.find_element_by_tag_name('td')
        fund_name = fund.find_element_by_class_name('fname')
        a_fund_name = fund_name.find_element_by_tag_name('a')
        #通過CSS樣式選擇器選擇第n個td標籤
        day_increase = fund.find_element_by_css_selector('td:nth-child(4)')
        week_increase = fund.find_element_by_css_selector('td:nth-child(5)')
        month_increase = fund.find_element_by_css_selector('td:nth-child(6)')
        worksheet.write(row, col, label=fund_code.text)
        worksheet.write(row, col+1, label=a_fund_name.text)
        worksheet.write(row, col+2, label=day_increase.text)
        worksheet.write(row, col+3, label=week_increase.text)
        worksheet.write(row, col+4, label=month_increase.text)
        row += 1
workbook.save('Fund_Excel_test.xls')

[原創][爬蟲學習·一]爬取天天基金網的基金收益排行信息

[原創][爬蟲學習·一]爬取天天基金網的基金收益排行信息

物理機開關機

[原創]Dijkstra算法的簡單實現（C++）

[原創]windows下安裝tensorflow的簡單方法

[原創]C++利用鏈表模板類實現一個簡易隊列

[原創]Linux 802.11n CSI tool安裝教程（親測可用）

[原創]Linux 802.11n CSI Tool下csi數據的實時可視化

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結