安裝和使用scrapy

scrapy的中文文檔

https://scrapy-chs.readthedocs.io/zh_CN/0.24/

在windows下必須安裝pypiwin32

pip install  scrapy
pip install pypiwin32

在win10上報錯需要安裝 Twisted-20.3.0-cp38-cp38-win32.whl

報錯的部分內容如下:
ERROR: Command errored out with exit status 1:
command: ‘c:\users\15870\appdata\local\programs\python\python37-32\python.exe’ -u -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘"’"‘C:\Users\15870\AppData\Local\Temp\pip-install-2wcyweho\wordcloud\setup.py’"’"’; file=’"’"‘C:\Users\15870\AppData\Local\Temp\pip-install-2wcyweho\wordcloud\setup.py’"’"’;f=getattr(tokenize, ‘"’"‘open’"’"’, open)(file);code=f.read().replace(’"’"’\r\n’"’"’, ‘"’"’\n’"’"’);f.close();exec(compile(code, file, ‘"’"‘exec’"’"’))’ install --record ‘C:\Users\15870\AppData\Local\Temp\pip-record-9qx3thr5\install-record.txt’ --single-version-externally-managed --compile

Twisted-20.3.0-cp38-cp38-win32.whl 下載地址https://www.lfd.uci.edu/~gohlke/pythonlibs/

安裝

 pip install ./Twisted-20.3.0-cp38-cp38-win32.whl

在這裏插入圖片描述

在ubuntu上安裝scrapy之前需要先安裝依賴

suod apt-get install python-dev python-pip libxm12-dev libxsltl-dev zliblg-dev libffi-dev libssl-dev

然後在安裝 scrapy

pip install scrapy

創建項目

要使用scrapy框架創建項目

scrapy startproject  [項目名稱]

使用命令創建一個爬蟲

scrapy genspider [爬蟲名] [域名]

運行項目
scrapy crawl [爬蟲名稱]

爬取內容

# -*- coding: utf-8 -*-
import scrapy
import bdbk.items
from copy import deepcopy #深拷貝
import re

class QcwxjsSpider(scrapy.Spider):
    name = 'qcwxjs'
    allowed_domains = ['www.qcwxjs.com']
    start_urls = ['https://www.autohome.com.cn/tech/3/#liststart']

    def parse(self, response):
        # print(response.body.decode(response.encoding))
        aa =re.findall('<script type="text/javascript">(.*?)</script>',response.body.decode(response.encoding))
        print(aa)
        list = response.xpath("//div[@id='auto-channel-lazyload-article']/ul/li")
        for li in list:
            item = li.xpath(".//a/@href").extract_first()
            url = item.split('.html')
            urls = "https:"+ url[0]+'-all.html'+url[1]
            yield scrapy.Request(urls,callback= self.getInfo, dont_filter=True)
            # 獲取下一頁
        next_page_url = "https://www.autohome.com.cn/" + response.xpath('//div[@id="channelPage"]//a[contains(text(),"下一頁")]/@href').extract()[0]
        if next_page_url is not None:
            yield scrapy.Request(next_page_url, callback=self.parse, dont_filter=True)

    def getInfo(self, response):
        item = bdbk.items.BdbkItem()
        item['title'] = response.xpath("//div[@class='container article']//h1/text()").extract()
        item['yyr'] = response.xpath("//div[@class='container article']//span[@class='time']/text()").extract_first()
        item['action'] = response.xpath("//div[@class='container article']//div[@id='articleContent']//text()").extract()
        item['imgUrl'] = response.xpath("//div[@class='container article']//div[@id='articleContent']//img/@src").extract()
        # print(item['action'])
        # print(response.meta['item']);
        yield item

在settings.py中設置

LOG_LEVEL = "WARNING" # 輸出日誌錯誤級別
# LOG_FILE = "./log.txt"  日誌輸出到本地
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

在pipelines.py中處理爬取到的數據

import re

class BdbkPipeline(object):
    def open_spider(self, spider): #爬蟲開啓會執行一次  只執行一次 // 這裏可以用來連接數據庫
        print('爬蟲開始執行')
    def close_spider(self, spider): #爬蟲結束後執行 只執行一次
        print('爬蟲結束後執行')
    def process_item(self, item, spider):
        item['title'] = self.process_content(item['title'])
        item['yyr'] = ''.join(self.process_content(item['yyr']))
        item['action'] = ''.join(self.process_content(item['action']))
        print(item)
        return item

    def process_content(self, content):
        content = [re.sub(r"\r|\n|\s","",i) for i in content]
        return  content

在items.py中定義需要爬取到的數據

import scrapy
class BdbkItem(scrapy.Item):
    title = scrapy.Field()
    yyr = scrapy.Field()
    action = scrapy.Field()
    imgUrl = scrapy.Field()

Scrapy中CrawlSpider的使用

生成crawlspider的命令:

scrapy genspider --t crawl [項目名稱] [爬取的域名]

例如:
scrapy genspider --t crawl qczj autohome.com.cn

LinkExtractor: 鏈接提取器。
主要參數:
1.allow:滿足括號中“正則表達式”的值會被提取,如果爲空,則會全部匹配。
2.deny:與這個正則表達式(或正則表達式列表)不匹配的URL一定不提取
3.allow_domains:會被提取的鏈接domains。
4.deny_domains:一定不會被提取鏈接的domains
5.restrick_xpaths:使用xpath表達式,和allow共同作用過濾鏈接 
LinkExtractor(
	allow=r'Items/'# 滿足括號中“正則表達式”的值會被提取,如果爲空,則全部匹配。
	deny=xxx, # 滿足正則表達式的則不會被提取。
	restrict_xpaths=xxx, # 滿足xpath表達式的值會被提取
	restrict_css=xxx, # 滿足css表達式的值會被提取
	deny_domains=xxx, # 不會被提取的鏈接的domains。  
) 

rules

在rules中包含一個或多個Rule對象
每個Rule對爬取網站的動作定義了特定的操作。
如果多個rule匹配了相同的鏈接,則根據規則在本集合中被定義的順序,第一個會被使用

rules參數的介紹
link_extractors:是一個LinkExtractor對象,用於定義需要提取的鏈接

callback:從link_extractor中沒獲取鏈接時,參數所制定的值作爲回調函數,該回調函數接受一個response作爲起第一個參數
注意:當編寫爬蟲規則是,避免使用parse作爲回調函數。由於CrawlSpider使用parse方法來實現其邏輯,如果覆蓋了parse方法,CrawlSpider將會運行失敗

follow:是一個布爾值(boolean),制定了根據該規則從response提取的鏈接是偶需要跟進。如果callback爲None,follow默認設置爲True,否則默認爲Flase

process_links:指定該Spider中那個的函數將會被調用,從link_extractor中獲取到鏈接列表是將會調用該函數。該方法主要用來過濾

process_request:指定該Spider中那個的函數將會被調用,該規則提取到每個request是都會調用該函數。(用來過濾request)

使用 CrawlSpider

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class QczjSpider(CrawlSpider):
    name = 'qczj'
    allowed_domains = ['www.autohome.com.cn']
    start_urls = ['https://www.autohome.com.cn/tech/1/#liststart']

    # allow 正則 callback 回調函數  follow
    rules = (
        Rule(LinkExtractor(allow=r'/tech/202004/\d+\.html#pvareaid=102624'), callback='parse_item'), #抓取詳情頁
        Rule(LinkExtractor(allow=r'/tech/\d+\/#liststart'), follow=True), # 分頁
    )

    def parse_item(self, response):
        channel =response.xpath("//div[@id='articlewrap']/h1/text()").extract_first()
        print(channel)

scrapy中使用cookies

爬139郵箱

獲取cookie
在這裏插入圖片描述

# -*- coding: utf-8 -*-
import scrapy
import re

class QqemalSpider(scrapy.Spider):
    name = 'qqemal'
    allowed_domains = ['appmail.mail.10086.cn']
    start_urls = ['https://appmail.mail.10086.cn/m6/html/index.html?sid=00U4Njc0MTQwNjAwMTkyMTAy000C58C3000004&rnd=612&tab=&comefrom=54&v=&k=9667&cguid=0930000138494&mtime=56&h=1']
	#重寫start_requests方法
    def start_requests(self):
        cookes = "_139_index_isSimLogin=0; UUIDToken=4afc50aa-6607-446a-b345-5431eee12a19; _139_index_login=15867411864850926030579662; _139_index_isSmsLogin=1; pwdKey=46b91fdbdfecd0a97b6aec8eb0c31de9; sid=sid9107791ca56aa7c95f266dc9f60619dd; umckey=c743ccb52fd8bf0eb7d9d7ad7e5b23a4; PICTUREUIN=z3TxlBa82BTORKtplViHiw==; PICTURELOGIN=NGMyYTY5YWQ0MzM5OTNjMDQ3NDg4NjQ0ZDNmYTNiYnwxMzYwNjk2Njd8MTU4Njc0MTQwNjMxOHxSSUNISU5GTzg4OA==; agentid=311e3dc3-0aec-4d40-a4ba-67cf66fbe8b8; RMKEY=b0553db038fc1fe3; Os_SSo_Sid=00U4Njc0MTQwNjAwMTkyMTAy000C58C3000004; cookiepartid9667=12; ut9667=2; cookiepartid=12; Login_UserNumber=15178866572; UserData={}; SkinPath29667=; rmUin9667=208808947; provCode9667=31; areaCode9667=3102; _139_login_version=60; welcome=s%3ACbkjj8o1dNMHGtBEGuBZogkW0ZaVfMXL.nSZ7FTOywqyQLRdjv6xXrEl9UKVvvbshedhVKDBkppY; loginProcessFlag="
        COOKISE = {i.split("=")[0]: i.split("=")[1] for i in cookes.split("; ")}
        yield scrapy.Request(
            self.start_urls[0],
            callback= self.parse,
            cookies = COOKISE
        )

    def parse(self, response):
        print(response.body.decode(response.encoding))
        aa = re.findall('首頁', response.body.decode(response.encoding))
        print(aa)

下載中間件

配置多個user-agent

USER_AGENT_LIST = [
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)", #QQ瀏覽器
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)" #360
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)", #傲遊
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)", #IE 8.0
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0", #IE 9.0
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko", #IE 11
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
]

在Middleware設置User-Agent

    def process_request(self, request, spider):
        proxy = random.choice(spider.settings.get('PROXIES'))
        #設置代理IP
        # request.meta["proxy"] = proxy  
        us = random.choice(spider.settings.get('USER_AGENT_LIST'))
        request.headers["User-Agent"] = us
        return None

    def process_response(self, request, response, spider):
        print(request.headers["User-Agent"])
        return response

scrapy發送POST請求

import scrapy
class GiteeSpider(scrapy.Spider):
    name = 'gitee'
    allowed_domains = ['gitee.com']
    start_urls = ['https://gitee.com/login']

    def parse(self, response):
        urlList = response.xpath("//body").extract()
        print(urlList)
        input_tokin = response.xpath("//form//input[@name='authenticity_token']/@value").extract_first()
        commit = response.xpath("//form//input[@name='commit']/@value").extract_first()
        ga_id = response.xpath("//form//input[@name='ga_id']/@value").extract_first()
        webauthnsupport = response.xpath("//form//input[@name='webauthn-support']/@value").extract_first()
        webauthniuvpaasupport = response.xpath("//form//input[@name='webauthn-iuvpaa-support']/@value").extract_first()
        return_to = response.xpath("//form//input[@name='return_to']/@value").extract_first()
        timestamp = response.xpath("//form//input[@name='timestamp']/@value").extract_first()
        timestamp_secret = response.xpath("//form//input[@name='timestamp_secret']/@value").extract_first()
        post_data = {
            "commit": commit,
            "authenticity_token": input_tokin,
            "ga_id": ga_id,
            "login" : "[email protected]",
            "password" : "yubo@0128",
            "webauthn-support": webauthnsupport,
            "webauthn-iuvpaa-support": webauthniuvpaasupport,
            "return_to": return_to,
            "required_field_9ef7": "",
            "timestamp": timestamp,
            "timestamp_secret": timestamp_secret
        }
        yield scrapy.FormRequest(
            "https://github.com/session",
            formdata = post_data,
            callback= self.login
        )

    def login(self, response):
        print(response)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章