Creating a project
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:
scrapy startproject URLCrawler
Our first Spider
This is the code for our first Spider. Save it in a file named my_spider.py
under the URLCrawler/spiders
directory in your project:
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
def start_requests(self):
urls = [
'http://www.4g.haval.com.cn/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
domain = response.url.split("/")[-2]
filename = '%s.html' %domain
#寫文件調用open()函數時,傳入標識符'w'或者'wb'表示寫文本文件或寫二進制文件
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' %filename)
How to run our spider
To put our spider to work, go to the project’s 最高一層的目錄 and run:
scrapy crawl my_spider
A shortcut to the start_requests method
Instead of implementing a start_requests()
method that 通過URLs生成 scrapy.Request
objects, 你可以僅僅定義一個
start_urls
類 attribute with a list of URLs. This list will then be used by the 默認實現 of start_requests()
to 創建最初的請求 for your spider:
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = [
'http://www.4g.haval.com.cn/',
]
def parse(self, response):
domain = response.url.split("/")[-2]
filename = '%s.html' %domain
with open(filename, 'wb') as f:
f.write(response.body)
Extracting data
The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:
scrapy shell "http://www.4g.haval.com.cn/"
XPath: a brief intro
In [1]: response.xpath('//title')
Out[1]: [<Selector xpath='//title' data='<title>哈弗SUV官網移動版</title>'>]
In [2]: response.xpath('//title/text()').extract_first()
Out[2]: '哈弗SUV官網移動版'
Extracting data 並將其保存至文件
textlist = response.selector.xpath('//text()').extract()
texts = []
#with open不需要close
with open('filename', 'w', encoding='utf-8') as f:
for i in range(0, len(textlist)):
text = textlist[i].strip()
if text != '':
texts.append(text)
f.write(text + '\n')
發現JavaScript代碼也被提取出來了,如果不需要此部分代碼:
textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
with open('filename_no_scripts', 'w', encoding='utf-8') as f:
for i in range(0, len(textlist_no_scripts)):
text = textlist_no_scripts[i].strip()
f.write(text + '\n')
至此這個網頁的文本內容已經被我們順利下載到了文件中,下一步可以對文本內容進行分詞、聚類等。但是也發現了一個問題,對於某些動態加載的網頁,此方法不適用,比如以下代碼下載下來的網頁和瀏覽器訪問該URL時看到的頁面並不一樣:
# write a script to download the html
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = [
'http://www.4g.haval.com.cn/',
'http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2',
]
def parse(self, response):
domain = response.url.split("/")[-2]
filename = '%s.html' %domain
with open(filename, 'wb') as f:
f.write(response.body)
# execute in prompt
scrapy crawl my_spider
下載下來的文本內容也少了很多:
# explore the data in prompt
scrapy shell "http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2"
response.selector.xpath('//text()').extract()
response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
# download text to file in promt
scrapy shell "http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2"
textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
with open('filename_no_scripts', 'w', encoding='utf-8') as f:
for i in range(0, len(textlist_no_scripts)):
text = textlist_no_scripts[i].strip()
f.write(text + '\n')
原因參考:https://chenqx.github.io/2014/12/23/Spider-Advanced-for-Dynamic-Website-Crawling/
解決該問題參考:https://blog.csdn.net/sinat_40431164/article/details/81200207