Python爬蟲實例——2019中國大學排名100強

原創

2020-07-02 20:11

Python爬蟲實例——2019中國大學排名100強

僞裝headers

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'
    }

url地址

http://gaokao.xdf.cn/201812/10838484.html

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'

請求

#請求
    response = requests.get(url=url,headers=headers)
    response.encoding = 'utf-8'
    response = response.text

這個頁面的content-type，沒有使用charset:utf-8,所以我們要定義一下
response.encoding = ‘utf-8’

如果沒有定義，得到結果全是亂碼，具體的還要根據網頁來說：

將頁面的資源加載到BeautifulSoup對象中

    soup = BeautifulSoup(response,'lxml')

然後分析一下網頁，抓取我們想要的數據

因爲學校一百個呢，所以我們用select （）方法來進行數據分析，select返回的是一個列表，緊接着就可以循環列表分別解析每一個學校的數據

#獲取學校的信息
    schoole_list = soup.select('.air_con.f-f0 tr')

在這裏注意一點：

在這裏顯示中間有一個空格，但是在select（）中空格代表多個層級，如果在代碼中使用空格的話是找不到相應的內容的。
直接copy selector

body > div.content.wrap1000 > div.conL-box > div > div.article > div.air_con.f-f0

這時看到其實中間是有“.”的

循環遍歷出每個學校並持久化存儲

    for li in schoole_list:
        detail = li.text
        school_detail = (' '.join(detail.split())+'\n')
        print(school_detail+'爬取成功！！！')
        fp.write(school_detail)

在持久化存儲時，我選擇了直接存儲到txt文本文件

但是循環出來的數據是列表的形式，所以我們需要進行轉換

以school_list中第一個元素爲例

title = schoole_list[0]
    title_data = title.text
    print(title_data)

結果：

所以先分片：

    new_title = title_data.split()

分片結果：

下一步就是轉化爲字符串的形式：

    new_data = (' '.join(new_title))

結果：

完整代碼：

import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
    #僞裝
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'
    }
    #url
    url = 'http://gaokao.xdf.cn/201812/10838484.html'
    fp = open('./高校.txt','w',encoding='utf-8')
    #請求
    response = requests.get(url=url,headers=headers)
    response.encoding = 'utf-8'
    response = response.text
    #將頁面的資源加載到BeautifulSoup對象中
    soup = BeautifulSoup(response,'lxml')
    #獲取學校的信息
    schoole_list = soup.select('.air_con.f-f0 tr')

    # title = schoole_list[0]
    # title_data = title.text
    # new_title = title_data.split()
    # new_data = (' '.join(new_title))
    # print(new_data)
    #text 讀取標籤下的所有內容，但是因爲html中有大量的空格，所以對空格切片,
    # 切片後的數據成爲一個列表，把數據持久化存儲到txt文件
    for li in schoole_list:
        detail = li.text
        school_detail = (' '.join(detail.split())+'\n')
        print(school_detail+'爬取成功！！！')
        fp.write(school_detail)

踩過的坑：

爬取網頁的時候首先要觀察一下網頁的信息
要注意文字格式的轉換
要了解解析方法，清楚每個方法的作用，方法之間的不同

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲實例——2019中國大學排名100強

Python爬蟲實例——2019中國大學排名100強

僞裝headers

url地址

請求

將頁面的資源加載到BeautifulSoup對象中

然後分析一下網頁，抓取我們想要的數據

循環遍歷出每個學校並持久化存儲

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

django連接已經有數據的表

Python實用案例——讀取Excel文件導入數據庫中

Python進程通信—— multiprocessing.Pipe()|Queue()

pycharm中Python的使用及設置

Python爬蟲中文亂碼

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結