Scrapy抓取數據存入數據庫（示例一）

原創

Sunrise0929

2020-02-20 19:14

一、示例一：Scrapy抓取豆瓣編程分類第一頁的圖書名稱和鏈接並存入數據庫

參考文章：http://tech.sina.com.cn/s/s/2008-12-24/09322685698.shtml

1. 要抓取的文件在items.py中定義，我們要抓取的是圖書的名稱和鏈接

2. spiders下的文件

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from second.items import bbs
 
class bbsSpider(BaseSpider):
    name = "boat"
    allow_domains =["http://book.douban.com/tag/編程?type=S"]
    start_urls =["http://book.douban.com/tag/編程?type=S"]
    def parse(self,response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = bbs()
        item['title'] =hxs.select('//ul/li[position()>0]/div[2]/h2/a/@title').extract()
        item['link'] =hxs.select('//ul/li[position()>0]/div[2]/h2/a/@href').extract() 
        items.append(item)
        return items

3. pipelines文件，關於scrapy保存到數據庫請看twisted的資料

# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINESsetting
# See: http://doc.scrapy.org/topics/item-pipeline.html

from scrapy import log
from twisted.enterprise import adbapi
from scrapy.http import Request  
from scrapy.exceptions import DropItem 
from scrapy.contrib.pipeline.images import ImagesPipeline 
import time  
import MySQLdb  
import MySQLdb.cursors
import socket
import select
import sys
import os
import errno
#連接數據庫
class MySQLStorePipeline(object):
    def__init__(self):
        self.dbpool = adbapi.ConnectionPool('MySQLdb', 
              db = 'test', 
              user = 'root', 
              passwd = 'root', 
              cursorclass =MySQLdb.cursors.DictCursor,  
              charset = 'utf8', 
              use_unicode = False 
       )  
    #pipeline默認調用
    def process_item(self,item, spider):
        query = self.dbpool.runInteraction(self._conditional_insert,item)  
        return item
        #將每行寫入數據庫中
    def_conditional_insert(self, tx, item):  
        if item.get('title'):
            for i in range(len(item['title'])):
                tx.execute('insert into book values (%s, %s)',(item['title'][i], item['link'][i]))

4. 在setting.py中添加pipeline:

ITEM_PIPELINES =['second.pipelines.MySQLStorePipeline']

5. 需要提前在數據庫中建立test數據庫和book表。

新建數據庫：create database庫名，爲了讓mysql正常顯示中文，在建立數據庫的時候使用如下語句：

CREATE DATABASE testDEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

新建表：create table book (title char(15) not null, link varchar(50) COLLATE utf8_general_ciDEFAULT NULL);

6. 爬蟲的結果如下：

{'link': [u'http://book.douban.com/subject/1885170/',

u'http://book.douban.com/subject/1477390/',

……

u'http://book.douban.com/subject/3288908/'],

'title': [u'\u7b97\u6cd5\u5bfc\u8bba',

……

u'\u96c6\u4f53\u667a\u6167\u7f16\u7a0b']}

由上面的爬取結果可以看出，爬取結果是字典嵌套一個列表。所以在寫數據庫的時候，for循環中item['title']表示字典的每個鍵對應的值，len(item['title'])表示值的列表的長度。注意：如果寫數據庫不正確，數據庫爲空。

for i in range(len(item['title'])):

tx.execute('insert into book values (%s, %s)',(item['title'][i], item['link'][i]))

Sunrise0929

發佈了59 篇原創文章 · 獲贊 2 · 訪問量 3萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy抓取數據存入數據庫（示例一）

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

linux安裝cuda和cudnn

Mellanox網卡開啓SR-IOV

模擬手機設備：使用 Playwright 實現移動端自動化測試

HTML 00 Tutorial

全面系統的AI學習路徑，幫助普通人也能玩轉AI

從零開始：使用 Playwright 腳本錄製實現自動化測試

uni-app實現上拉加載

MapReduce與遺傳算法、MapReduce與粒子羣算法結合與實現

2013年01月01日

POJ1018 Communication System

POJ1050 To the Max

POJ1125 Stockbroker Grapevine

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結