中國大學排名爬蟲(基於python中的requests和BeautifulSoup庫)

python 中國大學排名爬蟲

首先，給一個最好大學網URL：http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html，點擊這裏進入.

功能描述

輸入：大學排名URL鏈接
輸出：大學排名信息的屏幕輸出（排名，大學名稱，總分）
技術路線：requests‐bs4
定向爬蟲：僅對輸入URL進行爬取，不擴展爬取

定向爬蟲可行性

在最好大學網點擊右鍵，查看源代碼，Ctrl+F搜索清華大學，可以到與每一個大學額相關對應的代碼部分，如下圖：
可以看到這段代碼是使用tr標籤來索引的一段信息，從代碼段可以看到相關大學的排名，名稱和總分信息，因此可以實現定向爬蟲。

此外還要注意該網站是否提供了robots協議的約定：
可以手工查看是否符合：http://www.zuihaodaxue.cn/robots.txt，可以發現該網站並不存在，說明該網站沒有對爬蟲作相關限制。

程序的結構設計

步驟1：從網絡上獲取大學排名網頁內容 getHTMLText()
步驟2：提取網頁內容中信息到合適的數據結構 fillUnivList()
步驟3：利用數據結構展示並輸出結果 printUnivList()

直接上代碼：

導入相關庫：

import requests
from bs4 import BeautifulSoup
import bs4

編寫函數 `getHTMLText()`：

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

編寫函數 `fillUnivList()`：

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string,tds[1].string,tds[3].string])

編寫函數 `printUnivList()`：

def printUnivList(ulist, num):
    tplt="{0:^10}\t{1:{3}^8}\t{2:^10}"
    print("{0:^8}\t{1:{3}^8}\t{2:^8}".format("排名","學校名稱","總分", chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2], chr(12288)))

中文對齊問題原因及解決方案

①　中文對齊原因：
②　解決方案：
利用chr(12288)修改輸出格式。

運行結果：

完整程序代碼：

import requests
from bs4 import BeautifulSoup
import bs4


def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def fillUnivList(ulist, html):
    soup = BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string,tds[1].string,tds[3].string])

def printUnivList(ulist, num):
    tplt="{0:^10}\t{1:{3}^8}\t{2:^10}"
    print("{0:^8}\t{1:{3}^8}\t{2:^8}".format("排名","學校名稱","總分", chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2], chr(12288)))


def main():
    uinfo = []
    url = "http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html"
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20)


main()

聲明：本博客內容爲學習慕課上嵩天副教授的python網絡爬蟲與信息提取課程後所做的總結。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

中國大學排名爬蟲(基於python中的requests和BeautifulSoup庫)

python 中國大學排名爬蟲

功能描述

定向爬蟲可行性

程序的結構設計

直接上代碼：

導入相關庫：

編寫函數 `getHTMLText()`：

編寫函數 `fillUnivList()`：

編寫函數 `printUnivList()`：

中文對齊問題原因及解決方案

運行結果：

完整程序代碼：

Django之簡單路由配置（B站學習筆記）

Django之靜態文件配置（B站學習筆記）

MATLAB入門之系統環境與數值數據（B站學習筆記）

中國大學排名爬蟲(基於python中的requests和BeautifulSoup庫)

計算機組成原理第八章輸入輸出系統（慕課學習筆記）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

中國大學排名爬蟲(基於python中的requests和BeautifulSoup庫)

python 中國大學排名爬蟲

功能描述

定向爬蟲可行性

程序的結構設計

直接上代碼：

導入相關庫：

編寫函數 getHTMLText()：

編寫函數 fillUnivList()：

編寫函數 printUnivList()：

中文對齊問題原因及解決方案

運行結果：

完整程序代碼：

編寫函數 `getHTMLText()`：

編寫函數 `fillUnivList()`：

編寫函數 `printUnivList()`：