背景:
Python版本:Anaconda3
数据库:MongoDB
爬虫框架:Scrapy
IDE:PyCharm
前言:
前面我们已经安装和配置好Mongodb,接下来让我们在编程中获得对Mongodb更多的了解。
MongoDB安装教程:Python网络爬虫(六):MongoDB安装和使用(windows)
一、可视化工具安装、连接数据库:
面对命令行的Mongodb查看数据不是特别方便,我们最好选择一款可视化软件,这里我们选择了RoboMongo。
这是是RoboMongo的下载链接:
https://robomongo.org/download
安装过程很简单,一路next就可以完成了,也可以自定义你的安装路径:
安装完成后,我们首先启动Mongodb,然后打开RoboMongo。
RoboMongo连接数据库:
(1)创建新的Mongodb连接:
(2)自定义连接名称:
Name是连接名称,这个可以随便填,Address是连接数据库的ip地址,这里是本地,所以是localhost,后面是端口号,Mongodb默认的端口号是27017。
因为我没有为我的Mongodb数据库设置用户名和密码,如果你设置了数据库的用户名和密码,则要在Authentication中设置:
完成后,我们可以看到左面工作栏已经有了我们数据库的信息。
二、牛刀小试:
我们曾用scrapy爬取过笔趣阁中的一本小说遮天:
Python网络爬虫(五):Scrapy框架安装、介绍、以及实战
不过上次我们是保存在了txt文件中,这次我们将下载的小说存入MongoDB数据库中。
1、在程序中如何连接MongoDB 数据库:
#获取链接
client = MongoClient("mongodb://127.0.0.1:27017")
#连接sdust数据库
db =client.sdust
#连接集合名
zhetian=db.zhetian
其中sdust是我MongoDB中创建的一个数据库,zhetian是我sdust数据库的中一个集合。连接数据库后就可以对数据库进行各项操作。
2、完整项目:
项目视图:
这里把完整项目代码贴出来:
biqukan.py
# -*- coding: utf-8 -*-
from zhetian import settings
from bs4 import BeautifulSoup
import os
from urllib import request
from pymongo import MongoClient
class ZhetianPipeline(object):
def process_item(self, item, spider):
#如果获取了章节链接,进行如下操作
if "link_url" in item:
#获取链接
client = MongoClient("mongodb://127.0.0.1:27017")
#连接数据库名
db =client.sdust
#连接集合名
zhetian=db.zhetian
response = request.Request(url =item['link_url'])
download_response = request.urlopen(response)
download_html = download_response.read().decode('gbk', 'ignore')
soup_texts = BeautifulSoup(download_html, 'lxml')
texts = soup_texts.find_all(id='content', class_='showtxt')
soup_text = BeautifulSoup(str(texts), 'lxml')
write_flag = True
string=''
# 将爬取内容写入文件
for each in soup_text.div.text.replace('\xa0', ''):
if each == 'h':
write_flag = False
if write_flag == True and each != ' ':
string+=each
if write_flag == True and each == '\r':
string+='\n'
zhetian.insert({"dir_name": item['dir_name'], "dir_url": "link_url","content":string})
return item
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ZhetianItem(scrapy.Item):
# define the fields for your item here like:
#每个章节的章节名
dir_name = scrapy.Field()
#每个章节的章节链接
link_url = scrapy.Field()
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for zhetian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'zhetian'
SPIDER_MODULES = ['zhetian.spiders']
NEWSPIDER_MODULE = 'zhetian.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'zhetian (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'zhetian.pipelines.ZhetianPipeline': 1,
}
TEXT_STORE='F:/爬取的文件/遮天'
#Cookie使能,这里禁止Cookie;
COOKIES_ENABLED = False
#下载延时,这里使用250ms延时。
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
pipelines.py
# -*- coding: utf-8 -*-
from zhetian import settings
from bs4 import BeautifulSoup
import os
from urllib import request
from pymongo import MongoClient
class ZhetianPipeline(object):
def process_item(self, item, spider):
#如果获取了章节链接,进行如下操作
if "link_url" in item:
#获取链接
client = MongoClient("mongodb://127.0.0.1:27017")
#连接数据库名
db =client.sdust
#连接集合名
zhetian=db.zhetian
response = request.Request(url =item['link_url'])
download_response = request.urlopen(response)
download_html = download_response.read().decode('gbk', 'ignore')
soup_texts = BeautifulSoup(download_html, 'lxml')
texts = soup_texts.find_all(id='content', class_='showtxt')
soup_text = BeautifulSoup(str(texts), 'lxml')
write_flag = True
string=''
# 将爬取内容写入文件
for each in soup_text.div.text.replace('\xa0', ''):
if each == 'h':
write_flag = False
if write_flag == True and each != ' ':
string+=each
if write_flag == True and each == '\r':
string+='\n'
zhetian.insert({"dir_name": item['dir_name'], "dir_url": "link_url","content":string})
return item
3、运行:
打开Anaconda Prompt,进入项目所在目录,键入命令,运行项目。
scrapy crawl biqukan
项目开始运行:
RoboMongo已经可以看到下载的小说: