Scrapy抓取数据存入数据库（示例一）

原創

Sunrise0929

2020-02-20 19:14

一、示例一：Scrapy抓取豆瓣编程分类第一页的图书名称和链接并存入数据库

参考文章：http://tech.sina.com.cn/s/s/2008-12-24/09322685698.shtml

1. 要抓取的文件在items.py中定义，我们要抓取的是图书的名称和链接

2. spiders下的文件

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from second.items import bbs
 
class bbsSpider(BaseSpider):
    name = "boat"
    allow_domains =["http://book.douban.com/tag/编程?type=S"]
    start_urls =["http://book.douban.com/tag/编程?type=S"]
    def parse(self,response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = bbs()
        item['title'] =hxs.select('//ul/li[position()>0]/div[2]/h2/a/@title').extract()
        item['link'] =hxs.select('//ul/li[position()>0]/div[2]/h2/a/@href').extract() 
        items.append(item)
        return items

3. pipelines文件，关于scrapy保存到数据库请看twisted的资料

# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINESsetting
# See: http://doc.scrapy.org/topics/item-pipeline.html

from scrapy import log
from twisted.enterprise import adbapi
from scrapy.http import Request  
from scrapy.exceptions import DropItem 
from scrapy.contrib.pipeline.images import ImagesPipeline 
import time  
import MySQLdb  
import MySQLdb.cursors
import socket
import select
import sys
import os
import errno
#连接数据库
class MySQLStorePipeline(object):
    def__init__(self):
        self.dbpool = adbapi.ConnectionPool('MySQLdb', 
              db = 'test', 
              user = 'root', 
              passwd = 'root', 
              cursorclass =MySQLdb.cursors.DictCursor,  
              charset = 'utf8', 
              use_unicode = False 
       )  
    #pipeline默认调用
    def process_item(self,item, spider):
        query = self.dbpool.runInteraction(self._conditional_insert,item)  
        return item
        #将每行写入数据库中
    def_conditional_insert(self, tx, item):  
        if item.get('title'):
            for i in range(len(item['title'])):
                tx.execute('insert into book values (%s, %s)',(item['title'][i], item['link'][i]))

4. 在setting.py中添加pipeline:

ITEM_PIPELINES =['second.pipelines.MySQLStorePipeline']

5. 需要提前在数据库中建立test数据库和book表。

新建数据库：create database库名，为了让mysql正常显示中文，在建立数据库的时候使用如下语句：

CREATE DATABASE testDEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

新建表：create table book (title char(15) not null, link varchar(50) COLLATE utf8_general_ciDEFAULT NULL);

6. 爬虫的结果如下：

{'link': [u'http://book.douban.com/subject/1885170/',

u'http://book.douban.com/subject/1477390/',

……

u'http://book.douban.com/subject/3288908/'],

'title': [u'\u7b97\u6cd5\u5bfc\u8bba',

……

u'\u96c6\u4f53\u667a\u6167\u7f16\u7a0b']}

由上面的爬取结果可以看出，爬取结果是字典嵌套一个列表。所以在写数据库的时候，for循环中item['title']表示字典的每个键对应的值，len(item['title'])表示值的列表的长度。注意：如果写数据库不正确，数据库为空。

for i in range(len(item['title'])):

tx.execute('insert into book values (%s, %s)',(item['title'][i], item['link'][i]))

Sunrise0929

发布了59 篇原创文章 · 获赞 2 · 访问量 3万+

私信关注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy抓取数据存入数据库（示例一）

python gdal 安装使用（Windows， python 3.6.8）

MapReduce與遺傳算法、MapReduce與粒子羣算法結合與實現

2013年01月01日

POJ1018 Communication System

POJ1050 To the Max

POJ1125 Stockbroker Grapevine

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結