网络爬虫之实战 4-2 淘宝商品比价定向爬虫

原創

2020-07-04 02:59

文章截图均来自中国大学mooc Python网络爬虫与信息提取的教程，以上仅作为我的个人学习笔记。

下面是教程链接：https://www.icourse163.org/learn/BIT-1001870001?tid=1450316449#/learn/content?type=detail&id=1214620493&cid=1218397635&replay=true

功能描述：

获取淘宝搜索页面的信息，提取其中的商品名称和价格进行判断

理解：

获得淘宝的搜索接口
翻页的处理

鼠标起始页：

https://s.taobao.com/search?q=鼠标&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306

第二页：

https://s.taobao.com/search?q=鼠标&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=44

第三页：

https://s.taobao.com/search?q=鼠标&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=0&ntoffset=6&p4ppushleft=1%2C48&s=88

注意：

淘宝每页展现的商品数量刚好是44个，可以推出每个页面的url信息
我们还需要知道淘宝是否允许我们爬取相关的信息：这里不允许百度蜘蛛爬取

https://www.taobao.com/robots.txt

User-agent: Baiduspider
Disallow: /

User-agent: baiduspider
Disallow: /

程序设计步骤：

步骤1：提交商品搜索请求，循环获取页面
步骤2：对于每个页面，提取商品名称和价格信息
步骤3：将名称和价格输出到屏幕界面上

框架：

#淘宝商品比价爬虫框架
import requests
import re
def getHTML(url):
    try:
        headers = {
            user-agent = "Chrome/10",
            cookies = "你自己的cookie"
        }
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillList(ls,html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price=eval(plt[i].split(":")[1])
            title=eval(tlt[i].split(":")[1])
            ls.append([price,title])
    except:
        print("")
        
def printList(ls):
    tplt = "{0:^4}\t{1:<8}\t{2:{3}<16}"
    print(tplt.format("序号","价格","商品名称",chr(12288)))
    count = 0
    for g in

def main();
    depth = 2
    goodsname = "书包"
    infoList = []
    start = 'https://s.taobao.cn/research?s='
    url = start + goodsname
    for i in range(depth):
        try:
            url = url + '&s=' + str(i*44)
            html = getHTML(url)
            fillList(infoList,html)
        except:
            continue
    printList(infoList)
  
main()

import requests
import re
 
def getHTMLText(url):
    try:
        headers = {
                        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
                    "cookie": "自己查到的cookie"
                    }
        r=requests.get(url,timeout=30, headers=headers)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""
def parsePage(ilt,html):
    try:
        plt=re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt=re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price=eval(plt[i].split(":")[1])
            title=eval(tlt[i].split(":")[1])
            ilt.append([price,title])
    except:
        print("")
          
def printGoodsList(ilt):
    tplt = "{0:^4}\t{1:<8}\t{2:{3}<16}"#增加了对其方式
#    tplt = "{0:^4}\t{1:<8}\t{2:{3}<16}"
    print(tplt.format("序号","价格","商品名称", chr(12288)))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1], chr(12288)))
 
def main():
    goods = '鼠标'
    depth = 2
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
             
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)
 
main()

cookie查看方法：登陆搜索鼠标，起始页面按F12

查找结果：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

网络爬虫之实战 4-2 淘宝商品比价定向爬虫

網絡爬蟲之框架 5-1 Scrapy爬蟲框架

網絡爬蟲之實戰 4-2 股票數據定向爬蟲

網絡爬蟲之實戰 4-1 正則表達式庫Re入門

網絡爬蟲之實戰 4-2 淘寶商品比價定向爬蟲

網絡爬蟲提取 3-3 中國大學排名爬蟲

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結