网络爬虫之实战 4-2 淘宝商品比价定向爬虫

文章截图均来自中国大学mooc Python网络爬虫与信息提取的教程,以上仅作为我的个人学习笔记。

下面是教程链接:https://www.icourse163.org/learn/BIT-1001870001?tid=1450316449#/learn/content?type=detail&id=1214620493&cid=1218397635&replay=true
 


 功能描述:

  • 获取淘宝搜索页面的信息,提取其中的商品名称和价格进行判断

理解:

  • 获得淘宝的搜索接口
  • 翻页的处理

鼠标起始页:

https://s.taobao.com/search?q=鼠标&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306

第二页:

https://s.taobao.com/search?q=鼠标&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=44

第三页:

https://s.taobao.com/search?q=鼠标&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=0&ntoffset=6&p4ppushleft=1%2C48&s=88

注意:

  • 淘宝每页展现的商品数量刚好是44个,可以推出每个页面的url信息
  • 我们还需要知道淘宝是否允许我们爬取相关的信息:这里不允许百度蜘蛛爬取
https://www.taobao.com/robots.txt

User-agent: Baiduspider
Disallow: /

User-agent: baiduspider
Disallow: /

程序设计步骤:

  • 步骤1:提交商品搜索请求,循环获取页面
  • 步骤2:对于每个页面,提取商品名称和价格信息
  • 步骤3:将名称和价格输出到屏幕界面上

框架:

#淘宝商品比价爬虫框架
import requests
import re
def getHTML(url):
    try:
        headers = {
            user-agent = "Chrome/10",
            cookies = "你自己的cookie"
        }
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillList(ls,html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price=eval(plt[i].split(":")[1])
            title=eval(tlt[i].split(":")[1])
            ls.append([price,title])
    except:
        print("")
        
def printList(ls):
    tplt = "{0:^4}\t{1:<8}\t{2:{3}<16}"
    print(tplt.format("序号","价格","商品名称",chr(12288)))
    count = 0
    for g in

def main();
    depth = 2
    goodsname = "书包"
    infoList = []
    start = 'https://s.taobao.cn/research?s='
    url = start + goodsname
    for i in range(depth):
        try:
            url = url + '&s=' + str(i*44)
            html = getHTML(url)
            fillList(infoList,html)
        except:
            continue
    printList(infoList)
  
main()
import requests
import re
 
def getHTMLText(url):
    try:
        headers = {
                        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
                    "cookie": "自己查到的cookie"
                    }
        r=requests.get(url,timeout=30, headers=headers)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""
def parsePage(ilt,html):
    try:
        plt=re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt=re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price=eval(plt[i].split(":")[1])
            title=eval(tlt[i].split(":")[1])
            ilt.append([price,title])
    except:
        print("")
          
def printGoodsList(ilt):
    tplt = "{0:^4}\t{1:<8}\t{2:{3}<16}"#增加了对其方式
#    tplt = "{0:^4}\t{1:<8}\t{2:{3}<16}"
    print(tplt.format("序号","价格","商品名称", chr(12288)))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1], chr(12288)))
 
def main():
    goods = '鼠标'
    depth = 2
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
             
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)
 
main()

cookie查看方法:登陆搜索鼠标,起始页面按F12

查找结果:

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章