Python3網絡爬蟲——（5）Scrapy爬蟲基礎

原創

Asia-Lee

2020-06-28 22:26

1、Scrapy常用命令行

scrapy startproject QuotesSpider #創建項目
scrapy crawl XX #運行XX蜘蛛
scrapy shell http://www.scrapyd.cn #調試網址爲http://www.scrapyd.cn的網站
scrapy version #查看版本
scrapy list #顯示有多少個蜘蛛

2、Scrapy爬取內容

#scrapy爬取 http://lab.scrapyd.cn 中的名言內容、作者、標籤

import scrapy

class QuoteSpider(scrapy.Spider):   # #需要繼承scrapy.Spider類
    name = 'mingyan'  #定義蜘蛛名
    # 定義爬取的鏈接
    #start_urls = ['http://lab.scrapyd.cn/page/1/',]
    start = 1
    end = 50
    start_urls = ('http://lab.scrapyd.cn/page/' + str(i) + '/' for i in range(1, end))

    # 通過鏈接爬取頁面，定義規則提取數據
    def parse(self, response):
        mingyan = response.css('div.quote')  # 提取首頁所有名言，保存至變量mingyan

        for v in mingyan:  # 循環獲取每一條名言裏面的：名言內容、作者、標籤
            text = v.css('.text::text').extract_first()  # 提取名言
            autor = v.css('.author::text').extract_first()  # 提取作者
            tags = v.css('.tags .tag::text').extract()  # 提取標籤
            tags = ','.join(tags)  # 數組轉換爲字符串
            """
            接下來進行寫文件操作，每個名人的名言儲存在一個txt文檔裏面
            """
            fileName = '%s-語錄.txt' % autor  # 定義文件名,如：木心-語錄.txt

            with open(fileName, "a+") as f:  # 不同人的名言保存在不同的txt文檔，“a+”以追加的形式
                f.write(text)
                f.write('\n')  # ‘\n’ 表示換行
                f.write('標籤：' + tags)
                f.write('\n-------\n')
                f.close()

3、Scrapy數據提取

xpath選擇器
css選擇器

4、Scrapy爬取中國大陸明星

import scrapy
#中國大陸明星姓名爬取
class itemSpider(scrapy.Spider):
    name = 'starname'
    start_urls = ['http://www.manmankan.com/dy2013/mingxing/neidi/#']
    def parse(self, response):
        for i in range(1,27):
            name_list=response.css('div.i_cont_s')[i].xpath('./a/@title').extract()
            with open('data/name_list','a+') as f:
                for name in name_list:
                    f.write(str(name)+'\n')

5、12306火車站站名爬取

import requests
import re

def getStation():
    # 12306的火車站名和代碼js文件url
    url = 'https://kyfw.12306.cn/otn/resources/js/framework/station_name.js?station_version=1.9018'
    r = requests.get(url, verify=False)
    pattern = u'([\u4e00-\u9fa5]+)\|([A-Z]+)'
    result = re.findall(pattern, r.text)
    station = dict(result)
    return station
file=open('data/railway_station','w')
stations=getStation()
stations_list=[]
for key,value in stations.items():
    stations_list.append(key)
    file.write(str(key)+'\n')
file.close()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python3網絡爬蟲——（5）Scrapy爬蟲基礎

1、Scrapy常用命令行

2、Scrapy爬取內容

3、Scrapy數據提取

4、Scrapy爬取中國大陸明星

5、12306火車站站名爬取

NLP數據增強方法總結及實現

基於樹模型的lightGBM文本分類

TextRank算法介紹及實現

Linux環境下編譯TensorFlow C++ API和測試方法總結（完美版）

Python3讀取和寫入excel表格數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結