python_爬蟲_豆瓣TOP250_url

原創

2020-06-16 15:24

本文僅供學習使用，如有侵權，聯繫刪除。

獲得豆瓣top 250書單的url

import lxml
import requests
import re
import csv
from requests.exceptions import RequestException

url_lt = []

def get_one_page(url):
    try:
        headers = {
                "User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"
                }
        response = requests.get(url,headers=headers,timeout = 5)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None


def get_book_url_list(html):
    soup = BeautifulSoup(html,'lxml')
    url_list_info = soup.find_all(class_ = 'pl2')
    pattern = re.compile('<a.*?href=(.*?)onclick=.*?title.*?>.*?</a>',re.S)

    for url in url_list_info:
        url = str(url)
        url = re.search(pattern,url)
        url_lt.append(url.group(1).strip())


def main(offset):
    url = 'https://book.douban.com/top250?start=' + str(offset)
    html = get_one_page(url)
    get_book_url_list(html)
    print(len(url_lt))


def write_csv(file,url_list):
    with open(file,'a',encoding='utf-8',newline='') as csvfile:
        fieldnames = ["rank","book_url"]
        writer = csv.DictWriter(csvfile,fieldnames=fieldnames)
        writer.writeheader()
        for i in range(len(url_list)):
            writer.writerow({"rank":i+1,"book_url":url_list[i]})



if __name__ == '__main__':
    for i in range(10):
        main(i)
    write_csv("douban_TOP250_data.csv",url_lt)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

【Python爬蟲】基本原理和框架

開發者工具介紹參考：https://blog.csdn.net/m0_37724356/article/details/79884006 右擊網頁-檢查，或者F12，打開網頁開發者工具 get方式將請求的參數包含在url裏面

2020-07-08 10:40:50

【Python 爬蟲】使用友盟API獲取數據

一、需求每天需要從友盟網站獲取若干應用如下信息二、實現-分解 1）獲取api祕鑰 #獲取api祕鑰 def authorize(user, pasw): url = 'http://api.umeng.com/aut

2020-07-08 10:40:48

【Python 爬蟲】使用新榜API獲取數據

一、需求每日獲取新榜網站收藏的微信公衆號對應的【點贊數、排名、頭條閱讀數】二、實現-分解 1）獲取新榜api 打開並登錄新榜網站-數據服務-數據API 點擊試用即可，到達控制檯，你的賬號就獲得API密鑰，以及2000unit額

2020-07-08 10:40:48

Python爬蟲精簡步驟 HTML基礎（上）

開門見山，HTML的學習可分爲三個層次。讀懂，修改，編寫。讀懂：只有讀懂了HTML，才能看得懂網頁結構，纔有可能運用Python的其他模塊去解析數據和提取數據。想寫爬蟲程序一定要先學好HTML基礎。修改：在讀懂HTML文檔的

2020-07-07 11:50:32

一個簡單的校園網登錄程序 || 爬蟲+tkinter

僅用於登錄中國礦業大學校園網可以自動記錄用戶名和密碼,省去二次填寫的麻煩效果代碼 import tkinter as tk import requests import re import tkinte

2020-07-07 06:57:17

Python爬取快代理

前天，本人在爬取某網站時，第一次遇到IP被封的情況，等了幾個小時之後，還是不行。最後，迫於無奈，還是請出了大招，使用代理IP。今天，閒來無事，本人爬取了快代理網站上 5 萬多條免費高匿名代理IP。首先，我們進入網站免費代理頁面，

2020-07-05 21:50:04

Python爬取有道翻譯

轉載自https://blog.csdn.net/nunchakushuang/article/details/75294947一、正常的爬蟲流程：如果你要爬取他的翻譯接口，這個流程還是不能少的。首先我們打開有道翻譯的鏈接：http://

2020-07-05 19:45:53

http常見狀態碼彙總

1、200：請求正常，返回數據 2、301：永久重定向，例如訪問www.jingdong.com,京東會永久重新定向到www.jd.com，這個是京東公司之前變動的，從jingdong變到jd 3、302：臨時重定向 4、400：請求的U

2020-07-05 12:32:14

partially initialized module 'requests' has no attribute 'get'解決方法

出現“AttributeError: module ‘requests’ has no attribute ‘get’”的錯誤提示解決方法：檢查是不是自己寫的代碼文件命名爲requests.py，修改文件名，即可。

2020-07-05 12:32:14

decode與encode區別

計算機存儲的是bytes，如果需要將bytes轉換爲str時需要decode編碼來實現如果要將str轉換爲bytes時，需要通過encode來實現。 bytes-> decode->str str-> encode->bytes

2020-07-05 12:32:04

ProxyHandler處理器及編碼學習

1、今天學習ProxyHandler，主要用這個來請求代理服務器首先創建一個handler=request.ProxyHandler({"http":"111.20.101.70:80"}) (handler是僞裝本地地址)我通過http

2020-07-05 12:32:04

python selenium爬取百度文庫

參考博客 selenium官方教程注意事項電腦版的網頁源碼中沒有文庫的內容，該方法適用於手機版的百度文庫文檔的內容有多頁時需要點擊繼續閱讀和加載更多按鈕在對這兩個標籤進行定位時，開始參考了這位博主寫的方法，由於文庫的網

2020-07-05 03:45:46

【Python-爬蟲】基礎學習

目錄 1，爬蟲入門, 2，定向採集 3，數據存儲, 4，信息提取 5，Crawler4j 6，網頁排重 7，網頁分類一，urllib 二，數據解析：xpath,bs4,正則表達三，數據存儲：MYSQL,MongDB 四，線程，se

2020-07-04 23:33:37

【爬蟲2019,9月】攜程eleven參數解密

攜程： url='https://hotels.ctrip.com/Domestic/Tool/AjaxHotelList.aspx' 1，經過多次postman,傳入參數：page 爲翻頁的頁數 for page in range(

2020-07-04 23:33:27

在線反混淆網站

https://www.bm8.com.cn/jsConfusion/

2020-07-04 11:12:41

24小時熱門文章

最新文章

最新評論文章