網絡爬蟲之實戰 4-2 淘寶商品比價定向爬蟲

文章截圖均來自中國大學mooc Python網絡爬蟲與信息提取的教程,以上僅作爲我的個人學習筆記。

下面是教程鏈接:https://www.icourse163.org/learn/BIT-1001870001?tid=1450316449#/learn/content?type=detail&id=1214620493&cid=1218397635&replay=true
 


 功能描述:

  • 獲取淘寶搜索頁面的信息,提取其中的商品名稱和價格進行判斷

理解:

  • 獲得淘寶的搜索接口
  • 翻頁的處理

鼠標起始頁:

https://s.taobao.com/search?q=鼠標&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306

第二頁:

https://s.taobao.com/search?q=鼠標&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=44

第三頁:

https://s.taobao.com/search?q=鼠標&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=0&ntoffset=6&p4ppushleft=1%2C48&s=88

注意:

  • 淘寶每頁展現的商品數量剛好是44個,可以推出每個頁面的url信息
  • 我們還需要知道淘寶是否允許我們爬取相關的信息:這裏不允許百度蜘蛛爬取
https://www.taobao.com/robots.txt

User-agent: Baiduspider
Disallow: /

User-agent: baiduspider
Disallow: /

程序設計步驟:

  • 步驟1:提交商品搜索請求,循環獲取頁面
  • 步驟2:對於每個頁面,提取商品名稱和價格信息
  • 步驟3:將名稱和價格輸出到屏幕界面上

框架:

#淘寶商品比價爬蟲框架
import requests
import re
def getHTML(url):
    try:
        headers = {
            user-agent = "Chrome/10",
            cookies = "你自己的cookie"
        }
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillList(ls,html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price=eval(plt[i].split(":")[1])
            title=eval(tlt[i].split(":")[1])
            ls.append([price,title])
    except:
        print("")
        
def printList(ls):
    tplt = "{0:^4}\t{1:<8}\t{2:{3}<16}"
    print(tplt.format("序號","價格","商品名稱",chr(12288)))
    count = 0
    for g in

def main();
    depth = 2
    goodsname = "書包"
    infoList = []
    start = 'https://s.taobao.cn/research?s='
    url = start + goodsname
    for i in range(depth):
        try:
            url = url + '&s=' + str(i*44)
            html = getHTML(url)
            fillList(infoList,html)
        except:
            continue
    printList(infoList)
  
main()
import requests
import re
 
def getHTMLText(url):
    try:
        headers = {
                        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
                    "cookie": "自己查到的cookie"
                    }
        r=requests.get(url,timeout=30, headers=headers)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""
def parsePage(ilt,html):
    try:
        plt=re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt=re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price=eval(plt[i].split(":")[1])
            title=eval(tlt[i].split(":")[1])
            ilt.append([price,title])
    except:
        print("")
          
def printGoodsList(ilt):
    tplt = "{0:^4}\t{1:<8}\t{2:{3}<16}"#增加了對其方式
#    tplt = "{0:^4}\t{1:<8}\t{2:{3}<16}"
    print(tplt.format("序號","價格","商品名稱", chr(12288)))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1], chr(12288)))
 
def main():
    goods = '鼠標'
    depth = 2
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
             
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)
 
main()

cookie查看方法:登陸搜索鼠標,起始頁面按F12

查找結果:

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章