python爬蟲——分頁爬取京東商城商品信息（手機爲例）

原創

轩辕剣

2020-06-02 22:04

1、最近剛開始學習python 寫了個爬蟲練習，感覺主要是得會用F12查詢網站結構代碼。還涉及到反爬蟲，每個網站都不一樣，拿到的解析出的json數據格式也不同。得有些Web知識的基礎才行。

https://www.bilibili.com/video/av54287470/ 視頻講解

2、上代碼

import urllib.request
import time
# xpath  lxml第三方包   將html轉化爲樹形結構
from lxml import etree
# re系統包  正則
import re

"""
爬取數據方式：                     1、通過網頁源代碼（數據綁定在html標籤中）
根據目標網址綁定數據的方式決定      2、通過接口獲取json（json綁定）

京東商品信息    獲取目標url（開發者模式）    獲取網頁源代碼抓取數據
                                            搜索頁面、商品子頁面{獲取搜索頁面的所有商品 class="gl-i-wrap"，所有商品的子頁面【二次爬取】}
                發送url獲取響應數據   反爬    
                進行數據處理
"""
# 分頁抓取手機商品信息
def jdPhone_spider(url,beginPage,endPage):
    for page in range(beginPage,endPage+1):
        #京東頁碼奇數遞增處理  反爬
        pn = page*2 -1
        print("正在抓取第"+str(page)+"頁")
        fullurl = url+"&page="+str(pn)#構建帶頁碼的url
        # 防止被封
        time.sleep(2)
        #讀取頁面
        load_page(fullurl)
# 讀取查詢頁面
def load_page(url):
    #定義請求頭
    headers = {
        "user_agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
    }
    # 給請求添加請求頭
    request = urllib.request.Request(url,headers=headers)
    html = urllib.request.urlopen(request).read()
    content = etree.HTML(html)  #轉化爲tree  使用xpth進行匹配（標籤匹配  跟正則挺像）
    # print(content)
    #獲取商品的詳情頁面  div下的@代表屬性
    content_list = content.xpath('//div[@class="gl-i-wrap"]/div[@class="p-img"]/a/@href')
    # print(content_list)
    #url統一處理
    for i in range(1,31):
        # 異常處理，防止頁碼越界
        try:
            # ""前加r代表字符串（原始字符串）  正則截取數據   根據：截取
            result = re.split(r":",content_list[i-1])[1]
            content_list[i-1] = result
        except Exception as e:
            continue
    # print(content_list)
    # j是所有的商品詳情頁面
    for j in content_list:
        # print(j)
        new_url = "http:"+j #拿到完整url   接下來讀取子頁面信息
        load_link_page(new_url,headers)

# 讀取商品詳情子頁面的信息
def load_link_page(url,headers):
    request = urllib.request.Request(url, headers=headers)
    html = urllib.request.urlopen(request).read()
    content = etree.HTML(html)  # 轉化爲tree  使用xpth進行匹配（標籤）
    #獲取手機的品牌
    brand = content.xpath('//div[@class="p-parameter"]/ul[@id="parameter-brand"]/li/@title')
    print(brand)


if __name__ == '__main__':
    beginPage = int(input("請輸入起始頁："))
    endPage = int(input("請輸入結束頁："))
    # 目標url  手機變成unicode編碼%E6%89%8B%E6%9C%BA
    url = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8"
    jdPhone_spider(url,beginPage,endPage)

3、部分結果

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲——分頁爬取京東商城商品信息（手機爲例）

爲校園超市系統增加購物車與訂單功能

學習筆記2—逆向生成代碼快速開發

python爬蟲——分頁爬取京東商城商品信息（手機爲例）

學習筆記3—分佈式組件

學習筆記1—Docker安裝及鏡像的安裝與配置

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結