Python爬蟲實戰——豆瓣電影Top250

原創

2020-06-27 05:33

第一篇博客，用我昨天學的爬蟲來見證一下，純粹記錄自己的學習。

廢話不多說，show your code!!

#!/usr/bin/python
# -*- encoding:utf-8 -*-

"""
@author : kelvin
@file : douban_movie
@time : 2017/2/22 23:04
@description : 

"""
import sys
import requests
import re
from bs4 import BeautifulSoup
import csv

reload(sys)
sys.setdefaultencoding('utf-8')  # 由於編譯器的問題，需要設置默認字符集格式，不然會報unicodeError

# 先創建一個csv文件，寫好頭部
with open("douban_top250_movies.csv", 'w') as filed:  # a+爲添加，w爲擦除重寫
    csv_writer = csv.DictWriter(filed, [
        u'片名',
        u'評分',
        u'評分人數',
        u'一句話描述',
        u'豆瓣鏈接',
    ])
    csv_writer.writeheader()


def get_mov_info(response):
    mov_info = {}
    soup = BeautifulSoup(response.text, "lxml")
    movies = soup.find_all('div', class_="info")

    for info in movies:
        # 獲得電影的中文名
        mov_info['mov_name'] = info.find('span', class_='title').text  # find()只找到一個，結果以樹結構返回

        # 獲得電影在豆瓣中的鏈接
        mov_info['mov_link'] = info.find('a').get('href')

        # 找到評分以及評價人數
        rating_num = info.find(class_='rating_num')
        mov_info['rating_score'] = rating_num.text
        comment = rating_num.find_next_sibling().find_next_sibling()
        # 對評價字段切分
        comment_num = re.findall('\d{0,}', comment.text)
        mov_info['comment_nums'] = comment_num[0]    # 正則匹配re中沒有find()，findall()以列表形式返回結果

        # 獲得一句話評價
        comment_one = info.find('span', class_='inq')
        if comment_one is None:
            mov_info['inq_comment'] = u' '
        else:
            mov_info['inq_comment'] = comment_one.text
        print mov_info

        # 一條條存入csv文件
        write_csv(mov_info)


def write_csv(info_dict):
    with open("douban_top250_movies.csv", 'a+') as f:
        csv_write = csv.DictWriter(f, [
            u'片名',
            u'評分',
            u'評分人數',
            u'一句話描述',
            u'豆瓣鏈接',
        ])
        csv_write.writerow({                   # writerow()寫入單行，writerows寫入多行，這裏只有一行數據，用writerows報錯
            u'片名': info_dict['mov_name'],
            u'評分': info_dict['rating_score'],
            u'評分人數': info_dict['comment_nums'],
            u'一句話描述': info_dict['inq_comment'],
            u'豆瓣鏈接': info_dict['mov_link']
        })

for num in xrange(0, 10):
    page = num * 25
    response = requests.get("https://movie.douban.com/top250?start=%d&filter=" % page)
    print response
    get_mov_info(response)

結果截圖如下：

主要用了BeautiSoup這個強大的節點搜索庫，所以實現思路比較簡單。

不過有幾點需要注意：

1.編碼問題

不同編譯器默認編碼形式不一致，要在文件頭聲明編碼字符集，另外還要用sys.setdefultencoding

2.find()與find_all()

find()找到的是距離節點最近的一個，以樹結構返回，findall找到所有滿足條件的，以列表形式返回

3.存csv文件的問題

用csv.DictWriter()時，注意writerows(),writerow()的區別，前者寫入多行，當數據只有一行時會報Error：Key0 ；後者寫入單行，用字典指定對應字段需要寫入的內容。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲實戰——豆瓣電影Top250

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

Python爬蟲實戰——豆瓣電影Top250

解讀URL的組成部分

Python爬蟲實戰——模擬登錄教務系統

爬蟲奇遇記——爬不到想要的內容

Python文本相似度實戰——基於gensim和nltk庫

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結