[Python]記錄第一次Python寫爬蟲的過程（貓眼電影TOP10）

原創

zytjasper

2020-06-07 12:15

爬的對象爲貓眼電影排行榜TOP10，Website：http://maoyan.com/board

思路和程序參考了課時14：Request+正則表達式爬取貓眼電影

首先下面是爬取的結果：

貼出代碼：

import json
import requests
from requests.exceptions import RequestException
import re

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
    }
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.text
    except RequestException:
        return None

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>'  
                         '.*?<p.*?"name"><.*?title="(.*?)"'  
                         '.*?"star">(.*?)</p>' 
                         '.*?"releasetime">(.*?)</p>'  
                         '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>' 
                         , re.S)
    res = re.findall(pattern,html)

    for item in res:
        yield {
            'index': item[0],
            'title': item[1],
            'actor': item[2].strip()[3:],
            'time':item[3].strip()[5:],
            'score': item[4] + item[5]
        }

def write_to_file(content):
    with open('result.txt','a',encoding='utf-8')as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')
        f.close()

def main():
    url = 'http://maoyan.com/board'
    html=get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)


if __name__ == '__main__':
    main()

問題及解決方法：

（1）IndentationError: unexpected indent

分析：拼寫錯誤（exceptions，注意s...）及縮進問題（遇到了很多次縮進的問題，去報錯行的上一行回車就能找到正確的縮進間距）。

（2）takes 0 positional arguments but 1 was given

分析：是自定義函數裏面沒有帶參數，原程序爬的排行榜有100個，需要翻頁所以main函數括號裏設置了參數（offset），但我這個程序裏不需要翻頁（現在的貓眼只有TOP10）。

（3）'yield' outside function

分析：yield必須在function裏面使用，不能直接用在function外面，注意是否寫在def下面。

（4）<title>貓眼訪問控制</title>

<h3><span class="icon">⛔️</span>很抱歉，您的訪問被禁止了</h3>

分析：需要僞裝瀏覽器，在headers中添加’User-Agent’字典內容如下：

 headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
    }

並修改：

response = requests.get(url,headers=headers)

My Prey Is Near.

部分內容參考：https://blog.csdn.net/wenboyu/article/details/78166713

https://blog.csdn.net/u013205877/article/details/70332612

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

[Python]記錄第一次Python寫爬蟲的過程（貓眼電影TOP10）

將輸入的阿拉伯數字轉化爲大寫漢字輸出【大疆秋招編程題】

慣性組合導航原理—[1] 方向餘弦矩陣

慣性組合導航原理—[5] 初始對準之粗對準

[Matlab]在Matlab中安裝與使用LibSVM工具箱（小白指南）

慣性組合導航原理—[4] 步長可變的快速Allan Variance：傳感器隨機誤差建模

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結