網絡爬蟲（四） lxml——爬取2019中國大學排名

原創

hyhooo

2020-04-24 11:19

三、lxml提取

3.1 2019中國大學排名

3.1.1目標

目標地址：

http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html

爬取中國大學2019的排名信息，爬取‘排名’，‘學校名’，‘省份’，‘總分’，這四個字段信息

3.1.2 環境配置

打開 cmd 命令行（win + r）
輸入 pip install lxml 完成lxml庫的安裝。

3.1.3 請求網頁

def get_html(url):
    '''
    獲得 HTML
    '''
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/53\
        7.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        response.encoding = 'utf-8'
        return response.text
    else:
        return

我們看到此處代碼的參數比之前多了.encoding，即

 response.encoding = 'utf-8'

這個的意思是說，把響應的結果的 html 源碼的編碼格式設置成 utf-8，不這樣做的話，我們提取到的數據中如果有中文的，那顯示就會是亂碼

3.1.4 分析數據

檢查查看元素如下圖所示：

我們可以發現我們所需要的數據在<tr class=’alt’>...</tr>標籤中，每一條‘排名’‘學校名’‘省份’‘總分’都對應一個 <tr class="alt">...</tr> 標籤，其中有多少個大學就有多少個這樣的標籤，而每一所大學所對應的數據也在其標籤中，即對應的<td>..</td>標籤，所以我們分析好我們的數據以後就進行下一步的提取。

3.1.5 提取數據

html = etree.HTML(html)
# 提取所有的大學標籤信息
ls = html.xpath('//tr[@class="alt"]')
for info in ls:
    # 排名
    rank = info.xpath('./td[1]/text()')[0]
    # 學校名
    name = info.xpath('./td[2]/div/text()')[0]
    # 省份
    province = info.xpath('./td[3]/text()')[0]
    # 總分
    score = info.xpath('./td[4]/text()')[0]
    data = {
            '排名' : rank,
            '校名' : name,
            '省份' : province,
            '總分' : score,
        }
    print(data)

我們看到這裏的解析有變成了

html = etree.HTML(html)

這是使用 lxml 解析 html 的寫法
提取所有的學校信息的標籤，就是上面說的 549 條標籤，使用 xpath 方法選擇標籤在 html 源碼裏的路徑，// 是選擇此 html 源碼裏所有 tr 標籤並且 class 屬性爲 alt 的標籤。

ls = html.xpath('//tr[@class="alt"]')

提權取完 549 條標籤後，xpath 返回的是列表，所以我們接下來遍歷返回的列表，循環每一個標籤從中提取出每個數據，觀察每一條標籤下的子節點，如下圖：

有很多td標籤，而其中的每一個數據也就是td[0 to last]，我們則根據所需要的內容來寫我們的xpath即可。
語法如下：
‘.’ 代表當前節點，就是對應的每次循環的這個標籤的節點
‘/’ 依次選擇路徑
text() 獲得標籤中的文本信息，就是我們的實際數據
由於 xpath 返回列表，所以我們需要取第一個結果 [0]
排名、學校名、省份、總分的寫法如下:

# 排名
rank = info.xpath('./td[1]/text()')[0]
# 學校名
name = info.xpath('./td[2]/div/text()')[0]
# 省份
 province = info.xpath('./td[3]/text()')[0]
# 總分
score = info.xpath('./td[4]/text()')[0]

3.1.6 爬取結果

完整代碼如下：

import requests
import time
from lxml import etree
def get_html(url):
    '''
    獲得 HTML
    '''
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/53\
        7.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        response.encoding = 'utf-8'
        return response.text
    else:
        return
def get_infos(html):
    '''
    提取數據
    '''
    html = etree.HTML(html)
    # 提取所有的大學標籤信息
    ls = html.xpath('//tr[@class="alt"]')
    for info in ls:
        # 排名
        rank = info.xpath('./td[1]/text()')[0]
        # 學校名
        name = info.xpath('./td[2]/div/text()')[0]
        # 省份
        province = info.xpath('./td[3]/text()')[0]
        # 總分
        score = info.xpath('./td[4]/text()')[0]
        data = {
            '排名' : rank,
            '校名' : name,
            '省份' : province,
            '總分' : score,
        }
        print(data)
def main():
    '''
    主接口
    '''
    url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
    html = get_html(url)
    get_infos(html)
    time.sleep(1)
if __name__ == '__main__':
    main()

運行效果如下：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

網絡爬蟲（四） lxml——爬取2019中國大學排名

三、lxml提取

3.1 2019中國大學排名

3.1.1目標

3.1.2 環境配置

3.1.3 請求網頁

3.1.4 分析數據

3.1.5 提取數據

3.1.6 爬取結果

2020年上半年數據庫系統工程師考試

數據分析（三） Pandas整理

TensorFlow（一）Scikit-Learn之Transformer

網絡爬蟲（三） BS4提取之find_all

SL

iris

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結