Scrapy github模擬登陸

原創

萧风_2016

2019-03-03 20:34

1、創建項目

scrapy startproject GitHub

2、創建爬蟲

scrapy genspider github github.com

3、編輯github.py:

# -*- coding: utf-8 -*-

import scrapy

from scrapy import Request, FormRequest

class GithubSpider(scrapy.Spider):

name = 'github'

allowed_domains = ['github.com']

headers = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Accept-Encoding': 'gzip, deflate, br',

'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',

'Connection': 'keep-alive',

'Referer': 'https://github.com/',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0',

'Content-Type': 'application/x-www-form-urlencoded'

}

# 請求頭

def start_requests(self):

# 重寫start_requests方法

urls = ['https://github.com/login']

for url in urls:

yield Request(url, meta={'cookiejar': 1}, callback=self.github_login)

# 通過meta傳入cookiejar特殊key，爬取url作爲參數傳給回調函數

# meta：字典格式的元數據

# cookiejar：是meta的一個特殊的key，通過cookiejar參數可以支持多個會話對某網站進行爬取

# 可以對cookie做標記1, 2, 3, 4......這樣scrapy就維持了多個會話

def github_login(self, response):

authenticity_token = response.xpath(".//*[@id='login']/form/input[2]/@value").extract_first()

# 首先從源碼中獲取到authenticity_token的值

return FormRequest.from_response(

response,

url='https://github.com/session',

meta={'cookiejar': response.meta['cookiejar']},

headers=self.headers,

formdata={

'authenticity_token': authenticity_token,

'commit': 'Sign in',

'login': '[email protected]',

'password': 'aaqqfu1017463614',

'utf8': '✓'

callback=self.github_after,

dont_click=True

# dont_click如果是True，表單數據將被提交，而不需要單擊任何元素

)

def github_after(self, response):

home_page = response.xpath(".//*[@class='btn btn-outline mt-2']/text()").extract()

# 獲取登錄成功後頁面中的文本“Browse activity”

if 'Explore GitHub' in home_page:

self.logger.info('成功')

# 如果含有“Browse activity”，則打印登錄成功

else:

self.logger.error('失敗')

4、新建debug.py調試腳本：

# -*- coding: utf-8 -*-

from scrapy import cmdline

cmdline.execute('scrapy crawl github'.split())

5、修改settings.py配置文件：

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# 遵循Robots協議

6、運行腳本

scrapy crawl github

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy github模擬登陸

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

過濾或者查找敏感詞（DFA 算法）

MySQL之Field‘***’doesn’t have a default value錯誤解決辦法

隱藏滾動條 (對各種瀏覽器進行的scroll-bar設置)

Mac下安裝Memcached

mysql5.7 報錯1055:Expression #1 of SELECT list is not in GROUP BY clause and contains non

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結