[ Python ] 爬蟲類庫學習之 requests

原創

2020-04-07 19:12

requests

文檔：http://cn.python-requests.org/zh_CN/latest/

安裝：pip --timeout=100 install requests

[ python ] pip 配置國內鏡像源（親測有效）

百度搜索

一個簡單地小例子
基於requests模塊的get請求
爬取百度搜索首頁

import requests

if __name__ == "__main__":
    url = "https://www.baidu.com"
    response = requests.get(url)
    response.encoding = 'utf-8'
    print("狀態碼：" + str(response.status_code))

    page_text = response.text
    print("頁面內容：" + page_text)
    with open('./baidu.html', 'w', encoding='utf-8') as fp:
        fp.write(page_text)

    print('爬取數據結束！')

搜狗搜索

基於requests模塊的get請求
爬取搜狗指定詞條對應的搜索結果頁面

import requests

if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
    }
    url = 'https://www.sogou.com/web'
    kw = input('輸入查詢關鍵字：')
    param = {
        'query': kw
    }
    response = requests.get(url, param, headers=headers)

    page_text = response.text
    fileName = kw + '.html'
    with open(fileName, 'w', encoding='utf-8') as fp:
        fp.write(page_text)

    print('數據爬取結束！')

百度翻譯

基於requests模塊的post請求
破解百度翻譯

import requests
import json

if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
    }
    post_url = 'https://fanyi.baidu.com/sug'
    word = input('輸入查詢關鍵字：')
    data = {
        'kw': word
    }
    response = requests.post(post_url, data, headers=headers)
    dic_obj = response.json()
    print(dic_obj)
    fileName = word + '.json'
    fp = open(fileName, 'w', encoding='utf-8')
    json.dump(dic_obj, fp, ensure_ascii=False)
    print('數據爬取結束！')

豆瓣喜劇電影排行榜

基於requests模塊ajax的get請求
爬取鏈接：https://movie.douban.com/
爬取豆瓣電影分類排行榜 - 喜劇片

import requests
import json

if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
    }
    param = {
        "type": "24",
        "interval_id": "100:90",
        "action": "",
        "start": "0",
        "limit": "20",
    }
    url = 'https://movie.douban.com/j/chart/top_list'
    response = requests.get(url, param, headers=headers)
    dic_obj = response.json()
    print(dic_obj)
    fileName = '豆瓣電影排行榜.json'
    fp = open(fileName, 'w', encoding='utf-8')
    json.dump(dic_obj, fp, ensure_ascii=False)
    print('數據爬取結束！')

企業信息爬取

爬取鏈接：http://125.35.6.84:81/xk/
爬取企業化妝品生產許可證信息

import requests
import json
if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/78.0.3904.108 Safari/537.36 '
    }
    url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
    # 企業 id 列表
    id_list = []
    detail_list = []
    
    # 獲取前兩頁企業 id，30 條id
    for page in range(1, 3):
        page = str(page)
        param = {
            "on": "true",
            "page": page,
            "pageSize": "15",
            "productName": "",
            "conditionType": "1",
            "applyname": "",
            "applysn": "",
        }
        response = requests.post(url, param, headers=headers)
        json_ids = response.json()
        for dic in json_ids['list']:
            id_list.append(dic['ID'])

    post_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
    for id in id_list:
        data = {
            'id': id
        }
        res = requests.post(post_url, data, headers=headers)
        detail_json = res.json()
        detail_list.append(detail_json)

    fileName = '企業信息.json'
    fp = open(fileName, 'w', encoding='utf-8')
    json.dump(detail_list, fp, ensure_ascii=False)
    print('數據爬取結束！')

來源：爬蟲開發入門丨老男孩IT教育

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

[ Python ] 爬蟲類庫學習之 requests

requests

百度搜索

搜狗搜索

百度翻譯

豆瓣喜劇電影排行榜

企業信息爬取

七天.NET 8操作SQLite入門到實戰 - （2）第七天Blazor班級管理頁面編寫和接口對接

自學編程兩個月，現在我月入 4 萬元

百度安全多篇議題入選Blackhat Asia以硬技術發現“芯”問題

「實戰應用」如何用圖表控件LightningChart創建2D氣泡圖

GtkSharp 設置窗口背景透明

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

[ Java ] 一文搞懂設計模式常用的七大原則

[ Python ] 爬蟲類庫學習之 re 正則解析，爬取糗事百科的糗圖

總結了 150 餘個神奇網站，你不來瞅瞅嗎？

如何用 Java 實現有序，無序線性表的合併倒置

[ Java ] 最通俗易懂的 Java8 新特性 Lambda表達式講解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結