在慕課網上學習的第一個爬蟲

原創

SQTTTTTTT

2019-06-10 13:07

在慕課上學習python爬蟲，具體課程大家上慕課網詳細聽講。https://www.imooc.com/learn/1017

採用的環境
macOS Mojave
python 2.7
Scrapy 1.6.0
MySQL Version: 5.6.21

視頻中的環境
python 3.7
MongoDB
兩個區別，就是python的版本和選擇不一樣的數據庫。

項目介紹
採用Scrapy框架，對豆瓣電影前250名進行爬蟲。

詳細代碼已經上傳到coding，大家可以對照着視頻一起學習
https://git.dev.tencent.com/shaoqt/study_python_scrapy.git

案例上傳
https://download.csdn.net/download/shaoqianting/11218160

這邊就講幾點，因爲版本問題，導致代碼和視頻不一樣的。
1.數據庫的選擇
在視頻中，老師選擇的數據選擇MongoDB，因爲我這邊自帶了mysql，就直接使用了mysql的數據庫，也是在網上找的方法，寫的大致方法和視頻中一樣，但是MongoDB和MySQL在庫上的方法是不一樣。記得現在要數據庫中提前建表。

下面附上數據庫的代碼

# -*- coding: utf-8 -*-
import pymysql

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class DoubanPipeline(object):
	#數據庫語句，插入
    quotesInsert = '''insert into tb_pythonscrapy(serial_num,movie_name,introduct,star,evaluta)
                            values('{serial_num}','{movie_name}','{introduct}','{star}','{evaluta}')'''
	
	#同步setting
    def __init__(self, settings):
        self.settings = settings

    #核心處理
    def process_item(self, item, spider):
        sqltext = self.quotesInsert.format(
            serial_num=pymysql.escape_string(item['serial_num']),
            movie_name=pymysql.escape_string(item['movie_name']),
            introduct=pymysql.escape_string(item['introduct']),
            star=pymysql.escape_string(item['star']),
            evaluta=pymysql.escape_string(item['evaluta'])

        )
        spider.log(sqltext)
        self.cursor.execute(sqltext)
        return item

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def open_spider(self, spider):
        # 連接數據庫
        self.connect = pymysql.connect(
            host=self.settings.get('MYSQL_HOST'),
            port=self.settings.get('MYSQL_PORT'),
            db=self.settings.get('MYSQL_DBNAME'),
            user=self.settings.get('MYSQL_USER'),
            passwd=self.settings.get('MYSQL_PASSWD'),
            charset='utf8',
            use_unicode=True)

        # 通過cursor執行增刪查改
        self.cursor = self.connect.cursor();
        self.connect.autocommit(True)

    def close_spider(self, spider):
        self.cursor.close()
        self.connect.close()

參考

https://blog.csdn.net/u010151698/article/details/79371234
在setting加上一下內容

#只需要講上面註釋打開
ITEM_PIPELINES = {
    'douban.pipelines.DoubanPipeline': 300,
}

MYSQL_HOST="localhost"
MYSQL_PORT=3306
MYSQL_DBNAME="test"
MYSQL_USER="root"
MYSQL_PASSWD=""

2.報錯’Selector’ object has no attribute ‘split’

  File "/Users/shaoqianting/douban/douban/spiders/douban_spider.py", line 26, in parse
    content_s="".join(i_content.split())
AttributeError: 'Selector' object has no attribute 'split'

出現這個原因是沒有加extract()，需要添加的位置在下面的代碼中。

 content=i_item.xpath(".//div[@class='info']//div[@class='bd']/p[1]/text()").extract()
            for i_content in content:
                content_s="".join(i_content.split())
                douban_items['introduct']=content_s

3.報錯No module named scrapy
直接在pycharm裏面運行main.py，無法運行，一直顯示No module named scrapy，但是我在terminal中運行main.py，是可以直接運行。

在這個之前，在pycharm中，還有報錯，具體原因是忘了，大致的意思就是包導入以後，提示無此包。我是這邊處理，但問題還是治標不治本，還是會出現No module named scrapy。
處理路徑

把這兩個地方勾打上，當時也百度很多。