scrapy 爬取糗事百科段子篇章一

目前還是簡單的爬取數據，目前的數據不完全，個人目前相當於做個筆記，可簡單的做個參看。

一、創建爬蟲項目

scrapy  startproject ITtest

startproject	#開始創建項目
ITtest01		#創建項目名

二、生成爬蟲

cd  qiushi/qiushi/spiders  &&  scrapy  genspider ITtest  www.qiushibaike.com/text/page/1

cat ITtest.py

上面說到url地址，現在說明分析下：
1、訪問第二頁

2、接着url路由將2改成1試試

三、配置爬蟲文件
vim settings.py

BOT_NAME = 'qiushi'
SPIDER_MODULES = ['qiushi.spiders']
NEWSPIDER_MODULE = 'qiushi.spiders'
ROBOTSTXT_OBEY = False   #不遵守robots 協議
CONCURRENT_REQUESTS = 3  #隔3s爬取一次（正常普通網友點擊全部頁面的話3分鐘）
COOKIES_ENABLED = False    
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'    #模擬用戶瀏覽器的型號
}
ITEM_PIPELINES = {
    'qiushi.pipelines.QiushiPipeline': 300, #優先級
}

四、自定義item字段

vim  items.py

五、編寫爬蟲文件

vim ITtest.py

import scrapy
from qiushi.items import QiushiItem   #導入糗事項目下items中QiushiItem函數
from scrapy.http.response.html import HtmlResponse   #導入HtmlXPathSelector模塊
from scrapy.selector.unified   import SelectorList   #導入SelectorList模塊

class IttestSpider(scrapy.Spider):
    name = 'ITtest'
    allowed_domains = ['www.qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']
    bash_domain = "https://www.qiushibaike.com"

    def parse(self, response):
        body = response.xpath('//div[@class="col1 old-style-col1"]/div')
        for duanzhi in body:
            touxiang = duanzhi.xpath('.//div//@src').get()
            neirong = duanzhi.xpath('.//div[@class="content"]//text()').getall()
            neirong = "".join(neirong).strip()
            zuozhe  = duanzhi.xpath('.//div//h2/text()').get().strip()
            item = QiushiItem(頭像=touxiang,作者=zuozhe,內容=neirong)
            yield item
        next_url = response.xpath("//ul[@class='pagination']/li[last()]/a/@href").get()
        if not next_url:
            return
        else:
            yield  scrapy.Request(self.bash_domain+next_url,callback=self.parse)

六、檢查爬蟲語法

 scrapy check ITtest

七、執行爬蟲腳本

scrapy  crawl  ITtest

八、數據處理儲存

vim pipelines.py

from itemadapter import ItemAdapter
import json

class QiushiPipeline:
    def __init__(self):
        #open自動判斷是否有文件有則不進行創建。w寫入，字符編碼中文
        self.fp = open("qiushi.json","w",encoding='utf-8')

    def process_item(self, item, spider):
        print("爬蟲開始")

    def process_item(self, item, spider):
        #json.dumps 序列化時對中文默認使用的ascii編碼.想輸出真正的中文需要指定ensure_ascii=False
        #item_json = json.dumps(item,ensure_ascii=False)
        item_json = json.dumps(dict(item),ensure_ascii=False)
        self.fp.write(item_json+'\n')
        return item

    def close_spider(self,spider):
        #關閉文件流
        self.fp.close()
        print("爬蟲結束了")

九、再次執行

scrapy crawl ITtest

cat qiushi.json  -n

我們隨便哪一個頭像的url訪問試試

後期規劃，目前只是入門的爬蟲還需要優化，後期將使用mongodb、redis作爲緩存與存儲，當爬蟲腳本多起來將製作成docker鏡像放入k8s中保證服務的穩定。

scrapy 爬取糗事百科段子篇章一

985 碩士程序員，空窗 4 個月沒有 Offer！

我真的從測試轉成了開發......

nginx添加相應配置，通過瀏覽器訪問或curl時返回客戶端對應公網IP

[oeasy]python020在遊戲中體驗數值自由_勇闖地下城_終端文字遊戲

爲何我建議你學會抄代碼

營銷系統黑名單優化：位圖的應用解析

解密遊戲神作

導入地址表鉤取技術解析

盛大發布 | Zabbix 7.0 LTS--性能與擴展的卓越融合

mmsql 臨時表和主表 merge into 語法

scrapy 爬取糗事百科段子篇章一

設置主機IP靜態地址

zabbix配置——簡單檢查

scraoy 命令

ubutun 搭建scrapy環境

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結