【爬蟲】使用 Python Scrapy 爬取靜態網頁中所有文字

Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject URLCrawler

Our first Spider

This is the code for our first Spider. Save it in a file named my_spider.py under the URLCrawler/spiders directory in your project:

import scrapy
 
class MySpider(scrapy.Spider):
    name = "my_spider"
 
    def start_requests(self):
        urls = [
            'http://www.4g.haval.com.cn/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
 
    def parse(self, response):
        domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        #寫文件調用open()函數時,傳入標識符'w'或者'wb'表示寫文本文件或寫二進制文件
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' %filename)

How to run our spider

To put our spider to work, go to the project’s 最高一層的目錄 and run:

scrapy crawl my_spider

A shortcut to the start_requests method

Instead of implementing a start_requests() method that 通過URLs生成 scrapy.Request objects, 你可以僅僅定義一個 start_urls 類 attribute with a list of URLs. This list will then be used by the 默認實現 of start_requests() to 創建最初的請求 for your spider:

import scrapy
 
class MySpider(scrapy.Spider):
    name = "my_spider"
    
    start_urls = [
        'http://www.4g.haval.com.cn/',
    ]
 
    def parse(self, response):
        domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        with open(filename, 'wb') as f:
            f.write(response.body)

Extracting data

The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:

scrapy shell "http://www.4g.haval.com.cn/"

XPath: a brief intro

In [1]: response.xpath('//title')
Out[1]: [<Selector xpath='//title' data='<title>哈弗SUV官網移動版</title>'>]

In [2]: response.xpath('//title/text()').extract_first()
Out[2]: '哈弗SUV官網移動版'

Extracting data 並將其保存至文件

textlist = response.selector.xpath('//text()').extract()
texts = []
#with open不需要close
with open('filename', 'w', encoding='utf-8') as f:
    for i in range(0, len(textlist)):
        text = textlist[i].strip()
        if text != '':
            texts.append(text)
            f.write(text + '\n')

發現JavaScript代碼也被提取出來了,如果不需要此部分代碼:

textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
with open('filename_no_scripts', 'w', encoding='utf-8') as f:
    for i in range(0, len(textlist_no_scripts)):
        text = textlist_no_scripts[i].strip()
        f.write(text + '\n')

至此這個網頁的文本內容已經被我們順利下載到了文件中,下一步可以對文本內容進行分詞、聚類等。但是也發現了一個問題,對於某些動態加載的網頁,此方法不適用,比如以下代碼下載下來的網頁和瀏覽器訪問該URL時看到的頁面並不一樣:

# write a script to download the html
import scrapy
class MySpider(scrapy.Spider):
    name = "my_spider"
    start_urls = [
        'http://www.4g.haval.com.cn/',
        'http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2',
    ]
    def parse(self, response):
        domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        with open(filename, 'wb') as f:
            f.write(response.body)

# execute in prompt
scrapy crawl my_spider

下載下來的文本內容也少了很多:

# explore the data in prompt
scrapy shell "http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2"
response.selector.xpath('//text()').extract()
response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()

# download text to file in promt
scrapy shell "http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2"
textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
with open('filename_no_scripts', 'w', encoding='utf-8') as f:
    for i in range(0, len(textlist_no_scripts)):
        text = textlist_no_scripts[i].strip()
        f.write(text + '\n')

原因參考:https://chenqx.github.io/2014/12/23/Spider-Advanced-for-Dynamic-Website-Crawling/

解決該問題參考:https://blog.csdn.net/sinat_40431164/article/details/81200207

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章