這是我參與的第二個項目，進程還是很順利的。這也不斷改進的過程，總共有三個版本：

第一個版本是普通的爬取，對於某個農產品關鍵詞，獲取它全部的內容，後來由於我的網速太差，python運行報錯了，每個農產品都擁有八百多個頁面，如果重新開始，就會浪費很多時間，還不能確保它出錯，於是我就改進成了第二個版本；

第二個版本是對某個農產品定頁爬取，爬取某個頁數區間的產品，用來彌補第一個版本信息出錯後，接着上次的頁數爬取數據

第三個版本是全自動化爬取；在第二個的基礎上，瀏覽器的窗口總是彈出來干擾我做其他的事，所以就在這個版本中把瀏覽器隱藏起來，方便我工作。並且還加入隨機驗證碼，實現全部自動化。

網址： http://nc.mofcom.gov.cn/channel/jghq2017/price_list.shtml

文章目錄

1、開始前工作：

2、第一個版本——普通的爬取

3、第二個版本——定頁爬取

4、第三個版本——全自動化爬取

4.1、全自動化三問題

4.2、代碼彙總

1、開始前工作：

1.1、分析網頁

可以把這個頁面簡單的理解成爲一個查詢接口，必須要輸入驗證碼，點擊搜索才能拿到數據，我經過簡單的分析後，並沒有找到數據的接口，所以就直接確定使用selenium來抓取數據了。

1.2、分析驗證碼

對於存在驗證碼的很多網頁，如果使用驗證碼的次數不多，我們就可以直接使用手動輸入，如果它不斷的有驗證碼，就需要讓它自動識別驗證碼了，但是我們爬取的這個網頁驗證碼並不多，每種產品就只需要輸入一次驗證碼，所以可以使用手動輸入，但是重點來了，該網頁經過我的測試，發現它的驗證碼並沒有實際的作業，也就是說這個驗證碼可以隨便輸入數字，也就爲我的第三個版本做鋪墊了！

1.3、分析URL

雖然說URL對selenium後期的爬取作用不大，但是這個的URL隱藏了大量的信息：

默認時間只有三個月，如果需要查看更多商品的話，自己可以去更改時間
在接下來的爬取中，它只是頁碼會有變化，其他的數據都不會改變
注意：不能直接在URL上使用起始頁不爲1的其他頁數，因爲輸入驗證碼後都是數據都是從第一頁開始的
建議用selenium時，不要讓URL攜帶驗證碼和頁數

2、第一個版本——普通的爬取

2.1、代碼彙總

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from lxml import etree
import time,xlwt

startTime =time.time()#記錄開始時間
all_page = []#用來保存所有頁面爬取到的數據
url = 'http://nc.mofcom.gov.cn/channel/jghq2017/price_list.shtml?par_craft_index=13075&craft_index=15641&par_p_index=&p_index=&startTime=2019-10-04&endTime=2020-01-02'
driver = webdriver.Chrome()
driver.implicitly_wait(5)
chrome_option = webdriver.ChromeOptions()
chrome_option.add_argument('--proxy--server=112.84.55.122:9999')#使用代理IP
driver.get(url)#打開網頁網頁
driver.implicitly_wait(6)#等待加載六秒
time.sleep(10)

def next_page():
    for page in range(1,816,1):#從第1頁爬取到第815頁
        print ("~~~~~~~~~~正在爬取第%s頁，一共有815頁~~~~~~~~~"%page)
        if int(page) == 1:#當頁碼爲1時，不需要點擊下一頁，直接跳轉去獲取html源碼
            get_html()
        #點擊下一頁，下一頁的a標籤是最後一個標籤
        driver.find_element_by_xpath('/html/body/section/div/div[1]/div[4]/a[last()]').click()
        get_html()
        
def get_html():
    driver.implicitly_wait(5)#等待加載，完成後自動執行下一步
    source = driver.page_source#獲取網頁源代碼
    html = etree.HTML(source)#lxml解析網頁
    spider(html)
        
def spider(html):
    for tr in html.xpath('/html/body/section/div/div[1]/table/tbody/tr'):
        time = tr.xpath('./td[1]/text()')
        if len(time) != 0:
            goods = tr.xpath('./td[2]/span/text()')#商品
            price = tr.xpath('./td[3]/span/text()')#價格
            unit = tr.xpath('./td[3]/text()')#單位
            market = tr.xpath('./td[4]/a/text()')#市場
            link = 'http://nc.mofcom.gov.cn/'+tr.xpath('./td[4]/a/@href')[0]   #詳情鏈接                     
            page = [time,goods,price,unit,market,link]#生成數組
            all_page.append(page)
            saveData()
def saveData():
    book = xlwt.Workbook(encoding = 'utf-8')#創建工作簿
    sheet = book.add_sheet('生薑',cell_overwrite_ok=True)#創建表名,cell_overwrite_ok=True用於確認同一個cell單元是否可以重設值
    head = ['時間','產品','價格','單位','市場','鏈接']#定義表頭，即Excel中第一行標題
    for h in range(len(head)):
        sheet.write(0,h,head[h])#寫入表頭

    j = 1#第一行開始
    for list in all_page:
        k = 0
        for date in list:
            sheet.write(j,k,date)#迭代列，並寫入數據，重新設置，需要cell_overwrite_ok=True
            k = k+1
        j = j+1
    book.save('D:\\農產品（生薑）.xls')
    
if __name__ == '__main__':
    next_page()
    endTime =time.time()
    useTime =(endTime-startTime)/60
    print ("該次所獲的信息一共使用%s分鐘"%useTime)

2.2、運行代碼後手段點擊確定，並快速輸入驗證碼

現在的瀏覽器屬於被控制的狀態，但是它是支持認爲操作的。彈出瀏覽器後，該頁面會先有一個彈窗，只用點擊後才能輸入驗證碼，這裏的驗證碼可以隨便輸入四位數，如果怕手速不夠快的話，可以在代碼中多停一點時間，確保代碼正常運行！

2.3、報錯問題

如果你的環境運行這個代碼有誤，請檢查一下你的電腦是否配置了chromedriver.exe，因爲只是安裝selenium是不夠的，還需要配置瀏覽器的才行，如谷歌的需要下載對應版本的才行，沒有的話可以移步去看看：
python selenium新手入門安裝問題，這個方法可以不用配置環境，直接引用！

如果的pip配置過化境的話，簡單來說可以這樣做：

在http://npm.taobao.org/mirrors/chromedriver/下載對於自己瀏覽器的chromedriver.exe，我使用的是谷歌瀏覽器。
把你下載的chromedriver.exe複製到你安裝python的script路徑下，就算完成配置了。
有必要的話，可以重啓電腦加載環境變量。
運行結果：

3、第二個版本——定頁爬取

由於方法一在網不好的時候會導致請求失敗，不可能重新運行該代碼，最好的方式就是跟着上面的進程繼續往下走，所有必須從上面失敗的頁數開始爬取。

3.1、分析網頁

網頁上有快速跳轉到某個頁面的功能，所有可以利用這個方式進行定頁爬取數據，

操作步驟流程：

找到該輸入框的節點
輸入頁數
點擊確定
點擊下一頁

所以在我輸入的頁數就要先減一，回到前一頁，然後就可以調用一起的函數，點擊下一頁回到當前頁，獲取源代碼，繼續下一頁。。。

3.2、代碼彙總

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from lxml import etree
import time,xlwt

startTime =time.time()
all_page = []
url = 'http://nc.mofcom.gov.cn/channel/jghq2017/price_list.shtml?par_craft_index=13075&craft_index=15641&par_p_index=&p_index=&startTime=2019-10-04&endTime=2020-01-02'
driver = webdriver.Chrome()
driver.implicitly_wait(5)
chrome_option = webdriver.ChromeOptions()
chrome_option.add_argument('--proxy--server=112.84.55.122:9999')#使用代理IP
driver.get(url)#打開網頁網頁
driver.implicitly_wait(6)#等待加載六秒
time.sleep(5)
gotopage = int(input("請輸入您開始的頁數："))
gotopage = gotopage-1#減1的目的是前一頁，讓它點擊下一頁後才爬取
endpage = int(input("請輸入結束的頁數："))

def choose_page():
    search = driver.find_element_by_xpath('//*[@id="gotopage"]')#定位輸入框節點
    search.clear()#清空搜索框
    search.send_keys(gotopage)#輸入關鍵詞
    driver.find_element_by_xpath('/html/body/section/div/div[1]/div[4]/input[2]').click()#點擊確定
    next_page()

def next_page():
    for page in range(gotopage+1,endpage+1,1):
        print ("~~~~~~~~~~正在爬取第%s頁，一共有%s頁~~~~~~~~~"%(page,endpage))
        if int(page) == 1:
            get_html()
        driver.find_element_by_xpath('/html/body/section/div/div[1]/div[4]/a[last()]').click()
        get_html()
        
def get_html():
    driver.implicitly_wait(5)
    source = driver.page_source#獲取源碼
    html = etree.HTML(source)
    spider(html)
        
def spider(html):
    for tr in html.xpath('/html/body/section/div/div[1]/table/tbody/tr'):
        time = tr.xpath('./td[1]/text()')
        if len(time) != 0:
            goods = tr.xpath('./td[2]/span/text()')
            price = tr.xpath('./td[3]/span/text()')
            unit = tr.xpath('./td[3]/text()')
            market = tr.xpath('./td[4]/a/text()')
            link = 'http://nc.mofcom.gov.cn/'+tr.xpath('./td[4]/a/@href')[0]                        
            page = [time,goods,price,unit,market,link]#生成數組
            all_page.append(page)
            saveData()
def saveData():
    book = xlwt.Workbook(encoding = 'utf-8')#創建工作簿
    sheet = book.add_sheet('生薑',cell_overwrite_ok=True)#創建表名,cell_overwrite_ok=True用於確認同一個cell單元是否可以重設值
    head = ['時間','產品','價格','單位','市場','鏈接']#定義表頭，即Excel中第一行標題
    for h in range(len(head)):
        sheet.write(0,h,head[h])#寫入表頭

    j = 1#第一行開始
    for list in all_page:
        k = 0
        for date in list:
            sheet.write(j,k,date)#迭代列，並寫入數據，重新設置，需要cell_overwrite_ok=True
            k = k+1
        j = j+1
    book.save('D:\\農產品（生薑）.xls')
    
if __name__ == '__main__':
    choose_page()
    endTime =time.time()
    useTime =(endTime-startTime)/60
    driver.quit()#推出並關閉瀏覽器
    print ("該次所獲的信息一共使用%s分鐘"%useTime)

運行結果：

4、第三個版本——全自動化爬取

由於會彈出瀏覽器頁面，總是會影響我其他的操作，所以我要把瀏覽器關閉掉，代碼在運行的時候，我繼續做其他的事，這就面臨着這幾個問題：

4.1、全自動化三問題

4.1.1、自動點擊彈窗

selenium提供switch_to_alert方法：捕獲彈出對話框（可以定位alert、confirm、prompt對話框）

switch_to_alert() --定位彈出對話框
text() --獲取對話框文本值
accept() --相當於點擊“確認”
dismiss() --相當於點擊“取消”
send_keys() --輸入值（alert和confirm沒有輸入對話框，所以就不用能用了，只能使用在prompt裏）

1、alert窗口處理

# 獲取alert對話框
dig_alert = driver.switch_to.alert

# 打印警告對話框內容
print(dig_alert.text)

# alert對話框屬於警告對話框，我們這裏只能接受彈窗
dig_alert.accept()

2、confirm窗口處理

# 獲取confirm對話框
dig_confirm = driver.switch_to.alert

# 打印對話框的內容
print(dig_confirm.text)

# 點擊“確認”按鈕
dig_confirm.accept()

# 點擊“取消”按鈕
dig_confirm.dismiss()

3、prompt窗口處理

# 獲取prompt對話框
dig_prompt = driver.switch_to.alert

# 打印對話框內容
print(dig_prompt.text)

# 在彈框內輸入信息
dig_prompt.send_keys("Loading")

# 點擊“確認”按鈕，提交輸入的內容
dig_prompt.accept()

該網頁的彈窗正好符合第一種彈窗，只有確定按鈕！

4.1.2、自動輸入隨機驗證碼

產生隨機4位數的驗證碼

import random
random.randrange(1000,9999,1)#1000<=隨機數<=9999,間隔爲1

4.1.3、退出瀏覽器

由於把瀏覽器隱藏起來了，所以代碼運行完以後需要把瀏覽器關閉退出

driver.quit()#推出並關閉瀏覽器

4.2、代碼彙總

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from lxml import etree
import time,xlwt,random

startTime =time.time()#獲取開始時的時間
all_page = []
url = 'http://nc.mofcom.gov.cn/channel/jghq2017/price_list.shtml?par_craft_index=13075&craft_index=15641&par_p_index=&p_index=&startTime=2019-10-04&endTime=2020-01-02'
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')#上面三行代碼就是爲了將Chrome不彈出界面，實現無界面爬取
driver = webdriver.Chrome(chrome_options=chrome_options)

option = webdriver.ChromeOptions()
option.add_argument('--proxy--server=112.84.55.122:9999')#使用代理IP

driver.get(url)#打開網頁網頁
driver.implicitly_wait(6)#等待加載六秒

alert = driver.switch_to.alert #切換到alert
# print('alert text : ' + alert.text) #打印alert的文本
alert.accept() #點擊alert的【確認】按鈕

gotopage = int(input("請輸入您開始的頁數："))
gotopage = gotopage-1#減1的目的是前一頁，讓它點擊下一頁後才爬取
endpage = int(input("請輸入結束的頁數："))
randomNumber = random.randrange(1000,9999,1)#隨機生成一個四位數的驗證碼
def choose_page():
    #自動輸入驗證碼
    validity = driver.find_element_by_xpath('//*[@id="formJghqIndex"]/div/input[1]')
    validity.clear()#清空搜索框
    validity.send_keys(randomNumber)#輸入隨機驗證碼
    driver.find_element_by_xpath('//*[@id="formJghqIndex"]/div/input[2]').click()
    driver.implicitly_wait(3)#等待加載三秒
    #選擇起始頁
    search = driver.find_element_by_xpath('//*[@id="gotopage"]')#定位輸入框節點
    search.clear()#清空搜索框
    search.send_keys(gotopage)#輸入關鍵詞
    driver.find_element_by_xpath('/html/body/section/div/div[1]/div[4]/input[2]').click()
    time.sleep(4)
    next_page()

#點擊下一頁
def next_page():
    for page in range(gotopage+1,endpage+1,1):
        print ("~~~~~~~~~~正在爬取第%s頁，一共有%s頁~~~~~~~~~"%(page,endpage))
        if int(page) == 1:#第一頁時直接獲取源代碼
            get_html()
        #點擊下一頁
        driver.find_element_by_xpath('/html/body/section/div/div[1]/div[4]/a[last()]').click()
        get_html()

#獲取網頁源碼並解析        
def get_html():
    driver.implicitly_wait(5)
    source = driver.page_source#獲取源代碼
    html = etree.HTML(source)#使用lxml解析網頁
    spider(html)

#提取信息
def spider(html):
    for tr in html.xpath('/html/body/section/div/div[1]/table/tbody/tr'):
        time = tr.xpath('./td[1]/text()')
        if len(time) != 0:
            goods = tr.xpath('./td[2]/span/text()')
            price = tr.xpath('./td[3]/span/text()')
            unit = tr.xpath('./td[3]/text()')
            market = tr.xpath('./td[4]/a/text()')
            link = 'http://nc.mofcom.gov.cn/'+tr.xpath('./td[4]/a/@href')[0]                        
            page = [time,goods,price,unit,market,link]#生成數組
            all_page.append(page)
            saveData()
def saveData():
    book = xlwt.Workbook(encoding = 'utf-8')#創建工作簿
    sheet = book.add_sheet('生薑',cell_overwrite_ok=True)#創建表名,cell_overwrite_ok=True用於確認同一個cell單元是否可以重設值
    head = ['時間','產品','價格','單位','市場','鏈接']#定義表頭，即Excel中第一行標題
    for h in range(len(head)):
        sheet.write(0,h,head[h])#寫入表頭

    j = 1#第一行開始
    for list in all_page:
        k = 0
        for date in list:
            sheet.write(j,k,date)#迭代列，並寫入數據，重新設置，需要cell_overwrite_ok=True
            k = k+1
        j = j+1
    book.save('D:\\農產品（生薑）.xls')
    
if __name__ == '__main__':
    choose_page()
    endTime =time.time()#獲取結束時的時間
    useTime =(endTime-startTime)/60
    driver.quit()#推出並關閉瀏覽器
    print ("該次所獲的信息一共使用%s分鐘"%useTime)

編輯器運行結果：

Excel結果截屏：

python selenium 爬取《全國農產品商務信息公共服務平臺》

文章目錄

1、開始前工作：

1.1、分析網頁

1.2、分析驗證碼

1.3、分析URL

2、第一個版本——普通的爬取

2.1、代碼彙總

2.2、運行代碼後手段點擊確定，並快速輸入驗證碼

2.3、報錯問題

3、第二個版本——定頁爬取

3.1、分析網頁

3.2、代碼彙總

4、第三個版本——全自動化爬取

4.1、全自動化三問題

4.1.1、自動點擊彈窗

4.1.2、自動輸入隨機驗證碼

4.1.3、退出瀏覽器

4.2、代碼彙總

2024年DataOps趨勢預測：AI不會取代數據工程師

雲原生週刊：K8s 中的服務和網絡｜ 2024.4.29

通過Http鏈接地址爬取有贊微信商城商品信息及下載至EXCEL

多人同時導出 Excel 幹崩服務器！新來的阿里大佬給出的解決方案太優雅了！

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

華爲云云原生FinOps解決方案，釋放雲原生最大價值

python爬蟲之爬取《書趣閣》小說教學

數據分析入門之Numpy 矩陣與通用函數

數據分析入門之好萊塢百萬級評論數據分析

python爬取美團評論做詞雲分析

圖像處理之opencv圖像美化

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結