前言：這算是和研究生老師第一次正式的進行項目，雖說開始的有點隨便，但是第二戰還是挺成功的！自己作爲一個本科生，也很幸運可以跟着學點知識，隨便幫自己學院的老師出點力。還記得第一次交接任務時是和陳瑞學長，初戰《貴州農經網》，還算成功，後來由於需要分類，暫時告一段落。
這次的目標是《中國農藥網》，這是一個農藥行業門戶網站，集信息資訊、農藥信息，交易服務於一體的專業化、電子商務平臺。我主要就是獲取到某類農藥的具體信息，如：名稱，品牌，生產許可證，預防對象，農藥登記號等信息

1、分析目標

這個項目初步看起來有點棘手，但也不是很難，主要是它和很多網頁不一樣的是，它的商品信息介紹不規範，詳細信息是由廠家自己寫上去的，所有隻能使用正則表達式去匹配到關鍵詞，再獲取信息。

1.1、實現思路

可以把這個網站分爲子頁和主頁兩個先分開進行，分別寫好對應的代碼後，再把它進行合併。這樣會更方便對網頁的解析學習，從而提升開發效率。

1.2、思路解析

我需要先從主頁面獲取到每個商品的具體鏈接，然後通過這個鏈接請求到網頁源代碼，再進行相關信息匹配。
首頁如圖示：

從這裏我們可以看到它信息都是以列表的形式來顯現的，並且很規範，對於這樣的信息，是非常容易拿到手的。
我從這裏獲取到一個商品的鏈接後，就可以進去它的詳細頁面了，如圖：

通過對幾個頁面的信息進行比較，會發現這些信息的排布並沒有規律，標籤也不統一，直接就是用戶自己描述的，但是它們信息的關鍵詞沒有多大的變化，所有可以使用正則表達式來進行匹配到相關信息。

2、使用正則匹配商品信息

2.1、請求網頁源碼

這個網站沒有反爬蟲的措施，我直接使用requests請求網頁源碼，不需要做任何僞裝,拿到源碼並沒有任何難度。

import requests
url = "http://www.agrichem.cn/u850386/2019/02/22/ny1535604683.shtml"
html = requests.get(url).text
print(html)

建議拿到網頁源碼後，先別忙着去提取信息，最好先檢查結果一下是否包含我們需要的信息：

2.2、分析網頁

爲什麼要分析網頁？分析網頁的目的就是爲了選擇恰當的方法拿到更準確的信息，特別是對於這種沒有規則的網站，非常有必要進行詳細的分析。通過Ctrl+F實現查找，有時候可以本身是存在的，但是就沒有檢索到結果，這時候就有必要檢查一下符號是否中英文一致了，或者缺少空格之類的，建議用來查找的關鍵詞字數不要太多！

2.3、匹配網頁信息

這個地方最好的方式是選擇正則表達式來匹配信息，簡單的介紹一下正的用法，比如我截取的這個代碼片段來提取信息：

html = """
<div class="product-content-txt">
        <p>品牌:&nbsp諾爾特</br>成分含量:&nbsp1%－30%</br>包裝規格:&nbsp25毫升+2包</br>助劑淨重:&nbsp0.02kg</br>毒性:&nbsp低毒</br>劑型:&nbsp乳油</br>農藥成分:&nbsp烯草酮</br>農藥類型:&nbsp有機農藥</br>農藥登記證號:&nbspPD20132201</p>
         <p style="text-align:center;margin-top:50px;">
          <img src="http://tradepic.jinnong.cn/userfiles/850386/images/npriceProduct/npriceProduct/2019/02/xct8.jpg" > 
         </p>         
       </div> 
"""

從上面的信息來看，我們要提取的信息都有很關鍵的分隔標記，比如：和,我們要的東西正好在這裏面，但是該信息中有一些信息是多餘的，如&nbsp，拿到信息後還需要把它去掉，爲了程序的重複性，可以先把它去掉，在提取信息

2.3.1、網頁預處理

（1） 先去掉干擾標籤"&nbsp"
（2） 把英文:替換爲中文：

html.replace("&nbsp","").replace(":","：")

結果如下：

2.3.2、匹配信息測試與方法改進

現在雖然文中還有\n和空格，但是已經不會影響我們匹配信息了，正則表達式的簡單運用，掌握.*?就可以要到自己想要的信息了。

2.3.2.1、普通匹配信息測試

使用方法：.*?代替不需要的部分+關鍵標記+(.*?)需要提取的信息+關鍵標記，如下例所示：
找到品牌:

import re
re.findall('.*?品牌:(.*?)</br>.*?',html)#品牌

#運行結果
['諾爾特']

找到毒性:

找到圖片鏈接:

import re
re.findall('.*?src="(.*?)".*?',html)

2.3.2.2、替換中英文符號的重要性

這樣爬取信息很方便吧？但是問題來了，你有沒有想過關鍵詞的後面符號是中文狀態，或者是英文狀態，它不就匹配不了了嗎？如：

重點： 所以在拿到這個網頁源碼的時候，必須先它的英文符號:替換爲中文的：，統一字符，方便定位信息。

2.3.2.3、結束標籤的選擇

從上文來看，我們對文字信息選擇的結束標籤都是,替換掉我們要匹配的關鍵詞，似乎都能完成我們所要匹配的任務，但是如果我們要匹配的信息在該段落後面呢，這樣它的結束標籤是,這樣的話使用不就匹配不到了嗎？如html中的“農藥登記證號：”

import re
html = """</br>農藥登記證號：PD20132201</p>"""
re.findall('.*?農藥登記證號：(.*?)</br>.*?',html)

#輸出結果：
[]

改進方法：

re.findall('.*?農藥登記證號：(.*?)</.*?',html)

重點： 我們選擇結束標籤時必須要選擇所有關鍵詞都共同擁有的結束標記</,這樣無論是還是結束，都可以完美解決了！

2.3.2.4、模糊定位匹配

爲什麼還要進行模糊定位匹配呢？
主要是因爲用戶上傳的說明千奇百怪，好在關鍵詞不離其中，或者關鍵詞不在末尾和ming
案例1：
如“農藥登記證號：”，有些用戶寫成“產品登記證號”，所有就只能選擇“登記證號：”作爲關鍵詞

import re
html = """</br>**登記證號：PD20132201</p>"""
re.findall('.*?登記證號：(.*?)</.*?',html)

#運行結果：
['PD20132201']

案例2：
如“生產許可證”在有些地方又叫“產品標準號”，所以必須要考慮到，並且不能把：作爲關鍵字符，防止它關鍵詞在中間匹配不到信息，最後匹配到的信息，以：作爲定位符切割數據。如圖：

提取方式：

html = """農藥生產許可證/批准文號：HNP32224-D3889</p>"""
standard = re.findall('.*?生產許可證(.*?)</.*?',html)#生產許可證號
if len(standard) ==0:
    standard = re.findall('.*?產品標準號(.*?)</.*?',html)#生產許可證號
if len(standard) !=0:
    standard = str(standard[0]).split('：')[-1]#[-1]表示向右邊截取所有數據
print(standard)

輸出結果：

HNP32224-D3889

2.4、匹配全部信息源代碼彙總

import requests,re,time
from lxml import etree
start = time.time()

url = "http://www.agrichem.cn/u850386/2019/02/22/ny1535604683.shtml"
html = requests.get(url).text
etrees = etree.HTML(html)

good_type = etrees.xpath('/html/body/div[3]/div[1]/a[last()-1]/text()')#投入品類型
input_name = etrees.xpath('/html/body/div[3]/div[1]/a[last()]/text()')#投入品名稱

html = html.replace("&nbsp","").replace(":","：")
brank = re.findall('.*?品牌：(.*?)</.*?',html)#品牌
if len(brank) == 0:
    brank = re.findall('.*?名稱：(.*?)</.*?',html)#品牌

standard = re.findall('.*?生產許可證(.*?)</.*?',html)#生產許可證號
if len(standard) ==0:
    standard = re.findall('.*?產品標準號(.*?)</.*?',html)#生產許可證號
if len(standard) !=0:
    standard = str(standard[0]).split('：')[-1]

prevention = re.findall('.*?防治對象：(.*?)</.*?',html)#防治對象

toxicity = re.findall('.*?毒性：(.*?)</.*?',html)#毒性

register = re.findall('.*?登記證號(.*?)</.*?',html)#農藥登記證號
if len(register) != 0:
    register = str(register[0]).split('：')[-1]
    
print (good_type,input_name,brank,standard,prevention,toxicity,register)
end = time.time()
use_time = (end-start)/60
print ("您所獲獲取的信息一共使用%s分鐘"%use_time)

輸出結果：

['除草劑'] ['烯草酮'] ['諾爾特'] [] [] ['低毒'] PD20132201
您所獲獲取的信息一共使用0.14976612329483033分鐘

3、使用BS4爬取主頁信息

剛開始我還以爲這個頁面的信息很容易爬取到，因爲它對源碼沒有反爬措施，但是，它對信息的提取就有了很大的限制，就相當於給你HTML源碼，但是不讓你篩選信息，否則就隱藏自己的數據，我也是第一次見這種情況，還是花了一點時間才搞定的

3.1、requests請求數據

import requests
index = "http://www.agrichem.cn/nylistpc/%E5%86%9C%E8%B5%84-%E5%86%9C%E8%8D%AF-%E6%9D%80%E8%8F%8C%E5%89%82-----1-.htm?type=&isvip=&personreal=&companyreal="
indexHtml = requests.get(index).text
print(indexHtml)

成功拿到數據：

3.2、兩次提取數據失敗

然後我就開始順手的使用lxml來解析網頁提取數據了，但是經過多次測試，居然都失敗了，這裏就不描述了。。。
我開始使用正則表達式來提取信息，先來看看源網頁：
總之，干擾項實在是太多，先去掉這些無用的東西吧

indexHtml = indexHtml.replace("\r\n","").replace("\t","")

這裏也是有點奇怪了，我昨天爬取時，先拿到了源碼，主要是使用替換功能，它就把我需要的數據隱藏起來，今天居然可以看到數據了，繼續。。。

用正則提取所有數據

import requests,re
index = "http://www.agrichem.cn/nylistpc/%E5%86%9C%E8%B5%84-%E5%86%9C%E8%8D%AF-%E9%99%A4%E8%8D%89%E5%89%82------.htm?type=&isvip=&personreal=&companyreal="
indexHtml = requests.get(index).text
indexHtml = indexHtml.replace("\r\n","").replace("\t","")
r = re.compile('.*class="first-td">.*?href="(.*?)".*?list-yin-a">.*?src="(.*?)".*?class="small-grey-font">(.*?)</.*?</div></td><td>*(.*?)</td>.*?')
name= re.findall(r,indexHtml)
print (name)

運行結果：

[('http://www.agrichem.cn/u850386/2019/02/21/ny4036860408.shtml', 'http://tradepic.jinnong.cn/userfiles/850386/images/npriceProduct/npriceProduct/2019/02/ys9.jpg', '金農網農藥商城', '黑龍江')]

它居然就只拿到一條數據就不執行了！！！

3.3、提取數據成功

後來我把目標發放在BS4的上面，結果重要成功了！

import requests
from bs4 import BeautifulSoup

def get_html(index):
    indexHtml = requests.get(index).text
    remove(indexHtml)
    
def remove(indexHtml):
    soup = BeautifulSoup(indexHtml, "html.parser")
    for tr in soup.find_all('tr'):
        try:
            company = tr.find(attrs={"class":"small-grey-font"}).get_text()#公司名稱
            good_link = tr.find_all('a')[0].get('href')#商品鏈接
            address = tr.find_all('td')[3].get_text()#產地
            good_pic = tr.find_all('img')[0].get('src')#圖片路徑
            print (company,good_link,address,good_pic)
        except:
            print ("這是標題，沒有找到數據")
    
if __name__ == '__main__':
    index = "http://www.agrichem.cn/nylistpc/農資-農藥-殺菌劑-----4-.htm?type=&isvip=&personreal=&companyreal="
    get_html(index)

提取結果如下：

這是標題，沒有找到數據
濰坊奧豐作物病害防治有限公司 http://www.agrichem.cn/u462913/2018/03/06/ny5758621412.shtml 山東 http://tradepic.jinnong.cn/userfiles/462913/images/npriceProduct/npriceProduct/2018/03/%E5%BE%AE%E4%BF%A1%E5%9B%BE%E7%89%87_20180207123707_%E5%89%AF%E6%9C%AC_%E5%89%AF%E6%9C%AC.jpg
濰坊奧豐作物病害防治有限公司 http://www.agrichem.cn/u462913/2018/03/19/ny2819540869.shtml 山東 http://tradepic.jinnong.cn/userfiles/462913/images/npriceProduct/npriceProduct/2018/11/C3A4BEC77F701DE5AFCC18B2831353B5.jpg
濰坊奧豐作物病害防治有限公司 http://www.agrichem.cn/u832227/2018/10/10/ny1423363037.shtml 山東 http://tradepic.jinnong.cn/userfiles/832227/images/npriceProduct/npriceProduct/2018/10/43b1OOOPICe7%20(1)_%E5%89%AF%E6%9C%AC22.jpg
河南卓美農業科技有限公司 http://www.agrichem.cn/u819902/2018/01/08/ny5108840639.shtml 河南 http://tradepic.jinnong.cn/userfiles/819902/_thumbs/images/npriceProduct/npriceProduct/2018/01/%E5%BE%AE%E4%BF%A1%E5%9B%BE%E7%89%87_20180106183557.jpg
河南卓美農業科技有限公司 http://www.agrichem.cn/u819902/2018/01/03/ny3604101956.shtml 河南 http://tradepic.jinnong.cn/userfiles/819902/_thumbs/images/npriceProduct/npriceProduct/2018/01/1-1G1141J10O48.jpg
濰坊奧豐作物病害防治有限公司 http://www.agrichem.cn/u462913/2018/03/19/ny4721304356.shtml 山東 http://tradepic.jinnong.cn/userfiles/462913/images/npriceProduct/npriceProduct/2018/11/LGICJ9%7B%25S3V%5BO9%40F)7L9%24OA_%E5%89%AF%E6%9C%AC.jpg
河南卓美農業科技有限公司 http://www.agrichem.cn/u819902/2018/01/07/ny4744431710.shtml 河南 http://tradepic.jinnong.cn/userfiles/819902/_thumbs/images/npriceProduct/npriceProduct/2018/01/%E5%BE%AE%E4%BF%A1%E5%9B%BE%E7%89%87_20180106183544.jpg
濰坊奧豐作物病害防治有限公司 http://www.agrichem.cn/u839662/2018/08/18/ny4431648810.shtml 江西 http://tradepic.jinnong.cn/userfiles/839662/images/npriceProduct/npriceProduct/2018/08/%E6%9E%9D%E5%B9%B2%E6%BA%83%E8%85%90%E7%81%B5.jpg
河南卓美農業科技有限公司 http://www.agrichem.cn/u819902/2018/01/03/ny0402204907.shtml 河南 http://tradepic.jinnong.cn/userfiles/819902/images/npriceProduct/npriceProduct/2018/01/1-1G1141J53S19.jpg
河南卓美農業科技有限公司 http://www.agrichem.cn/u819902/2018/01/03/ny4136385189.shtml 河南 http://tradepic.jinnong.cn/userfiles/819902/_thumbs/images/npriceProduct/npriceProduct/2018/01/1-1G1141JAU02.jpg
河南卓美農業科技有限公司 http://www.agrichem.cn/u819902/2018/01/03/ny2515692603.shtml 河南 http://tradepic.jinnong.cn/userfiles/819902/_thumbs/images/npriceProduct/npriceProduct/2018/01/1-1G1141K445D6.jpg
河南卓美農業科技有限公司 http://www.agrichem.cn/u819902/2018/01/03/ny4340308460.shtml 河南 http://tradepic.jinnong.cn/userfiles/819902/_thumbs/images/npriceProduct/npriceProduct/2018/01/1-1G1141K044W7.jpg

注意：

爲什麼這裏要用 try:?
一方面來說，是爲了防止標籤或數據缺失而報錯，只要的目的是跳過標題欄的數據，因爲它的標籤和正文的一樣

它們都在tr標籤之中，但是標題的內容是th,所有輸出的數據爲[ ],try:的目的數據跳過這些空值，當然也可以使用條件語句它判斷它。

3.4、構造所有首頁路徑

for page in range(1,5,1):
    index = "http://www.agrichem.cn/nylistpc/農資-農藥-除草劑-----%s-.htm?type=&isvip=&personreal=&companyreal="%page
    print ("正在爬取第%s個主頁的信息"%page)
    print(index)

運行結果：

正在爬取第1個主頁的信息
http://www.agrichem.cn/nylistpc/農資-農藥-除草劑-----1-.htm?type=&isvip=&personreal=&companyreal=
正在爬取第2個主頁的信息
http://www.agrichem.cn/nylistpc/農資-農藥-除草劑-----2-.htm?type=&isvip=&personreal=&companyreal=
正在爬取第3個主頁的信息
http://www.agrichem.cn/nylistpc/農資-農藥-除草劑-----3-.htm?type=&isvip=&personreal=&companyreal=
正在爬取第4個主頁的信息
http://www.agrichem.cn/nylistpc/農資-農藥-除草劑-----4-.htm?type=&isvip=&personreal=&companyreal=

3.5、隨機模擬不同的客戶端

如果需要使用模擬不同的客戶端，可以使用fake_useragent隨機生成UserAgent，但是在這裏並須需要這個，可以簡單的說下這個方法：

隨機生成5個不同瀏覽器的UserAgent：

from fake_useragent import UserAgent
for i in range(5):
    print(UserAgent().random)

生成結果：

Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36 Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10
Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.2; Trident/4.0; Media Center PC 4.0; SLCC1; .NET CLR 3.0.04320)
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20100101 Firefox/19.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36

隨機生成5個谷歌瀏覽器的UserAgent：

from fake_useragent import UserAgent
for i in range(5):
    print(UserAgent().chrome)

生成結果：

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36
Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.517 Safari/537.36
Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36

4、所有源碼彙總

import requests,re,csv,time
from lxml import etree
from bs4 import BeautifulSoup
start = time.time()

#創建CSV文件
fp = open('D:\\中國農藥網.csv','a',newline='',encoding='utf-8')
writer = csv.writer(fp)
writer.writerow(('生成廠商','商品鏈接','投入品類型','投入品名稱','品牌','生產許可證','預防對象','毒性','農藥登記號','地址','圖片鏈接')) #csv頭部

def get_html(index):
    indexHtml = requests.get(index).text
    print ("~"*80)
    soup = BeautifulSoup(indexHtml, "html.parser")
    for tr in soup.find_all('tr'):
        try:
            company = tr.find(attrs={"class":"small-grey-font"}).get_text()#公司名稱
            address = tr.find_all('td')[3].get_text()#產地
            good_pic = tr.find_all('img')[0].get('src')#圖片路徑
            good_link = tr.find_all('a')[0].get('href')#商品鏈接，並請求該鏈接
            print ("-"*80)
            try:
                html = requests.get(good_link).text
                etrees = etree.HTML(html)
                good_type = etrees.xpath('/html/body/div[3]/div[1]/a[last()-1]/text()')[0]#投入品類型
                input_name = etrees.xpath('/html/body/div[3]/div[1]/a[last()]/text()')[0]#投入品名稱
                html = html.replace("&nbsp","").replace(":","：")
                brank = re.findall('.*?品牌：(.*?)</.*?',html)[0]#品牌
                if len(brank) == 0:
                    brank = re.findall('.*?名稱：(.*?)</.*?',html)[0]#品牌

                standard = re.findall('.*?生產許可證(.*?)</.*?',html)[0]#生產許可證號
                if len(standard) ==0:
                    standard = re.findall('.*?產品標準號(.*?)</.*?',html)[0]#生產許可證號
                if len(standard) !=0:
                    standard = str(standard).split('：')[-1]#截取：後面的所有數據

                prevention = re.findall('.*?防治對象：(.*?)</.*?',html)[0]#防治對象

                toxicity = re.findall('.*?毒性：(.*?)</.*?',html)[0]#毒性

                register = re.findall('.*?登記證號(.*?)</.*?',html)[0]#農藥登記證號
                if len(register) != 0:
                    register = str(register).split('：')[-1]
            except:
                pass
            position = (company,good_link,good_type,input_name,brank,standard,prevention,toxicity,register,address,good_pic)
            print (position)
            writer.writerow((position))#寫入數據 
        except:
            print ("這是標題，沒有找到數據")
            pass
    
def main():
    for page in range(1,8,1):
        index = "http://www.agrichem.cn/nylistpc/農資-農藥-除草劑-----%s-.htm?type=&isvip=&personreal=&companyreal="%page
        print ("正在爬取第%s個主頁的信息"%page)
        get_html(index)

if __name__ == '__main__':
    main()
    end = time.time()
    use_time = (end-start)/60
    fp.close() #關閉文件 
    print ("您所獲獲取的信息一共使用%s分鐘"%use_time)

編輯器運行結果截屏：

csv結果截屏：

總結：對於一個剛入門的小白來說，可能在爬某個網站的時候會遇見很多看是簡單，但是很複雜的網頁，反正先不要怕，試作把它細分，一步一步的來完成，多嘗試用不同的庫來解析網頁，總會找到自己忽略掉的地方，這樣才能不斷的提升自己的能力！

python爬蟲實戰之爬取中國農藥網

文章目錄