Scrapy 爬蟲入門

1. scrapy 項目目錄

scrapy是通過命令運行的爬蟲框架

scrapy startproject projectname  // 創建scrapy項目

通過命令創建scrapy項目後，會自動生成一個projectname的文件夾，其中包括一個spiders文件夾，以及items.py、pipelines.py、middlewares.py、settings.py等python文件。spiders文件夾用於放置不同的spider文件，items.py中用於定義需要獲取的數據容器，pipelines.py中定義的數據進一步加工與處理(如存放數據庫、下載圖片等)，middlewares.py中定義自定義的中間件，一般用於自定義去重url邏輯、給request添加消息頭模擬瀏覽器、定義ip池等，settings.py爲爬蟲的設置文件。

2. 爬蟲items

import scrapy
class AppsSpiderItem(scrapy.Item):
    app_id = scrapy.Field()
    platform = scrapy.Field()
    ...

定義結構化數據信息的格式如下：

結構化數據名 = scrapy.Feild()

對於每一個需要提取的數據均需要定義對應的結構化數據名

3. 定義爬蟲

Spider類是Scrapy中與爬蟲相關的一個基類，所有爬蟲文件必須繼承該類（scrapy.Spider）

在爬蟲項目中可以通過genspider命令自動在spiders文件夾中創建一個爬蟲

scrapy genspider name domain
scrapy genspider test www.baidu.com

import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['www.baidu.com']
    start_urls = ['http://www.baidu.com/']

    def parse(self, response):
        pass

默認生成的spider只重寫了parse方法，運行時其會自動調用默認的start_requests()，從start_urls中獲取url發起請求，請求完成後會默認調用parse方法，在parse中編寫具體解析邏輯。

4. 通過post請求url

默認生成的request是通過get請求，當需要同過post請求時可以通過重寫start_requests方法，添加formdata

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['www.baidu.com']
    start_urls = ['http://www.baidu.com/']

    def start_requests(self):
        url = self.start_urls[0]
        formdata = {'name':'mike'}
        yield scrapy.FormRequest(url=url, formdata=formdata, callback=self.parse)

    def parse(self, response):
        pass

自定義的request需要顯式地定義callback方法

5. 運行爬蟲

默認的運行方式是通過命令調用爬蟲

scrapy runspider test.py （爬蟲文件名）
scrapy crawl test（爬蟲名）

當想通過idea運行scrapy爬蟲時，可以通過py文件模擬cmdline

創建run.py文件，在其中定義

from scrapy import cmdline

name = 'google_play'
cmd = 'scrapy crawl {0}'.format(name) + ' --nolog'
cmdline.execute(cmd.split())

這種方式運行的爬蟲可以在idea中debug爬蟲，推薦使用

6. BeautifulSoup解析網頁

解析網頁一般通過正則表達式，xpath和BeautifulSoup解析，在此簡要介紹BeautifulSoup的常用方法

BeautifulSoup是將複雜的HTML文檔轉化爲一個複雜的樹型結構，每個節點都是一個python對象，所有對象可以歸納爲4種：

Tag
NavigableString
BeautifulSoup
Comment

1). Tag

Tag表示HTML中的一個個標籤，獲得一個節點後，可以通過 .tagname 獲得其子tag，如

soup.div
soup.a

這種方式當且僅當其這種子節點類型只有一個時使用，當其子節點同一類型有多個時，可硬通過 .contents獲得其所有子節點的集合。

soup.contents

.attrs 獲取標籤的所有屬性

print(soup.p.attrs)
#{'class': ['title'], 'name': 'dromouse'}

獲得某個屬性

print(soup.p['class']
#['title']
print(soup.p.get('class'))
#['title']

2). NavigableString

當想要獲取標籤的內容時可以通過 .string獲得

print soup.p.string
#The Dormouse's story

3). BeautifuSoup

BeautifulSoup對象表示一個文檔的全部內容

4). Comment (一般不用)

常用方法：

description = soup.find_all('div', attrs={'class': 'W4P4ne '})  // 獲得特定屬性的div標籤的集合
description[0].meta.attrs['content']                            // 獲得集合中第一個標籤的meta子標籤的‘content’屬性值

7. 模擬瀏覽器自定義user-agent

一般情況下，我們需要不停更換用戶代理，降低被ban的風險。默認情況下，scrapy會給所有的request一個默認的user-agent。當需要自定義user-agent時，具體實現是通過request對象的headers屬性，以及自定義一個downloadmiddleware。

class AppSpiderDownloaderMiddleWare(object):

    def __init__(self, user_agent_list):
        self.user_agent = user_agent_list

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        middleware = cls(crawler.settings.get('MY_USER_AGENT'))
        return middleware

    def process_request(self, request, spider):
        #隨機選擇一個user-agent
        request.headers['user-agent'] = random.choice(self.user_agent)

在上列中，我們自定義了一個下載中間件，這樣就可以保證每一個請求在交給下載器之前都會經過這裏，同時在創建中間件時，傳入了定義在settings中的user-agent集合。

MY_USER_AGENT = ["Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/45.0.2454.101+Safari/537.36",
    "Mozilla/5.0+(Windows+NT+5.1)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/28.0.1500.95+Safari/537.36+SE+2.X+MetaSr+1.0",
    "Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/50.0.2657.3+Safari/537.36",
    "Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/51.0.2704.106+Safari/537.36",
    "Mozilla/5.0+(Windows+NT+6.1)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/47.0.2526.108+Safari/537.36+2345Explorer/7.1.0.12633",
    "Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_11_4)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/49.0.2623.110+Safari/537.36",
    "Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_9_5)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/42.0.2311.152+Safari/537.36",
    "Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/42.0.2311.152+Safari/537.36",
    "Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_10_2)+AppleWebKit/600.3.18+(KHTML,+like+Gecko)+Version/8.0.3+Safari/600.3.18",
    "Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/49.0.2623.22+Safari/537.36+SE+2.X+MetaSr+1.0",
    "Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_11_4)+AppleWebKit/601.5.17+(KHTML,+like+Gecko)+Version/9.1+Safari/601.5.17",
    "Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/48.0.2564.103+Safari/537.36",
    "Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/47.0.2526.80+Safari/537.36+Core/1.47.640.400+QQBrowser/9.4.8309.400",
    "Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_10_5)+AppleWebKit/600.8.9+(KHTML,+like+Gecko)+Version/8.0.8+Safari/600.8.9",
    "Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/39.0.2171.99+Safari/537.36+2345Explorer/6.4.0.10356",
    "Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/45.0.2454.87+Safari/537.36+QQBrowser/9.2.5584.400",
    "Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/47.0.2526.111+Safari/537.36",
    "Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/33.0.1750.146+BIDUBrowser/6.x+Safari/537.36",
    "Mozilla/5.0+(Windows+NT+6.1)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/39.0.2171.99+Safari/537.36+2345Explorer/6.5.0.11018",
    "Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/42.0.2311.154+Safari/537.36+LBBROWSER"]

然後在settings.py文件中開啓Middleware配置

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'apps_spider.middlewares.AppsSpiderDownloaderMiddleware': 543,
}

8. 設置ip池

在settings文件中定義ip集合

IPPOOL=[
	{"ipaddr":"61.129.70.131:8080"},
	{"ipaddr":"61.152.81.193:9100"},
	{"ipaddr":"120.204.85.29:3128"},
	{"ipaddr":"219.228.126.86:8123"},
	{"ipaddr":"61.152.81.193:9100"},
	{"ipaddr":"218.82.33.225:53853"},
	{"ipaddr":"223.167.190.17:42789"}
]

自定義中間件，在每次request前隨機添加ip地址

import random
from scrapy import signals
from myproxies.settings import IPPOOL
 
class ProxiesSpiderMiddleware(object):
 
      def __init__(self,ip=''):
          self.ip=ip
       
      def process_request(self, request, spider):
          thisip=random.choice(IPPOOL)
          print("this is ip:"+thisip["ipaddr"])
          request.meta["proxy"]="http://"+thisip["ipaddr"]

在settings文件中啓用這個中間件

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'apps_spider.middlewares.ProxiesSpiderMiddleware': 543,
}

ip地址可以在免費ip，或在github中搜索proxy ip獲得

9. 保存數據到json

在pipelines中編寫處理items的邏輯

import codecs
import json
import urllib.request

class AppsSpiderPipeline(object):
    icon_path = "./icons"

    def __init__(self):
        self.file = codecs.open("./apps.json", "wb", encoding="utf-8")

    def process_item(self, item, spider):
        d = dict(item)
        icon_url = d.pop('icon_url')
        
        # 下載圖片
        local_path = self.icon_path + '/' + d['app_id'] + '.jpg'
        urllib.request.urlretrieve(icon_url, filename=local_path)
        
        # 持久化到json
        i = json.dumps(d, ensure_ascii=False)
        line = i + '\n'
        self.file.write(line)

        return item

    def close_spider(self, spider):
        self.file.close()

在settings中啓用pipelines

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'apps_spider.pipelines.AppsSpiderPipeline': 300,
}

10.解決ajax異步加載內容

某些網頁第一次請求僅獲取部分內容，當用戶瀏覽到頁面底端才通過js再次加載新的內容。對於這種頁面，只需在每次請求得到的response中定位到下次請求的網址，即可遞歸爬取到該頁面所有內容。以google play列表頁爲例：

Scrapy 爬蟲入門

1. scrapy 項目目錄

2. 爬蟲items

3. 定義爬蟲

4. 通過post請求url

5. 運行爬蟲

6. BeautifulSoup解析網頁

7. 模擬瀏覽器自定義user-agent

8. 設置ip池

9. 保存數據到json

10.解決ajax異步加載內容

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

在Android端實現基於OPENGL ES 的深度學習前向傳播框架

Cython Tutorial 基本用法

基於attention機制實現 CRNN OCR文字識別

opencv 加載tensorflow pb模型

window下編譯opencv python3

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Scrapy 爬蟲入門

1. scrapy 項目目錄

2. 爬蟲items

3. 定義爬蟲

4. 通過post請求url

5. 運行爬蟲

6. BeautifulSoup解析網頁

7. 模擬瀏覽器 自定義user-agent

8. 設置ip池

9. 保存數據到json

10.解決ajax異步加載內容

7. 模擬瀏覽器自定義user-agent