Python爬蟲實戰，urllib模塊，爬取中國工程院院士信息並保存txt

原創

扒皮狼

2022-12-09 23:44

前言

今天用Python爬蟲中國工程院院士簡介，在這裏給需要的小夥伴們代碼，並且給出一點小心得。

開發工具

Python版本： 3.8

相關模塊：

urllib模塊

re模塊

time模塊

環境搭建

安裝Python並添加到環境變量，pip安裝需要的相關模塊即可。

頁面獲取

打開工程院官方網站，F12或者鼠標右鍵查看全部院士名單，查看源代碼，進行簡單分析，得到規律，以便後面設計正則表達式（建議使用谷歌瀏覽器）

完整代碼實現

import re
import os
import os.path
import time
from urllib.request import urlopen

dstDir = 'YuanShi'
if not os.path.isdir(dstDir):
    os.mkdir(dstDir)

startUrl = r'http://www.cae.cn/cae/html/main/col48/column_48_1.html'
with urlopen(startUrl) as fp:
    content = fp.read().decode()

# 提取並遍歷每位大牛鏈接
pattern = r'<li class="name_list"><a href="(.+)" target="_blank">(.+)</a></li>'
result = re.findall(pattern, content)
for item in result:
    perUrl, name = item
    # 測試是否獲取信息
    print(perUrl)
    # 這裏根據初爬結果進行改進
    name = name.replace('<h3>', '').replace('</h3>', '')
    name = os.path.join(dstDir, name)
    perUrl = r'http://www.cae.cn/' + perUrl
    with urlopen(perUrl) as fp:
        content = fp.read().decode()

    # 抓取簡介
    pattern = r'<p>(.+?)</p>'
    result = re.findall(pattern, content)  # 返回string中所有與pattern匹配的全部字符串,返回形式爲數組。
    if result:
        intro = re.sub('(<a.+</a>)|(&ensp;)|(&nbsp);', '', '\n'.join(result))
        with open(name + '.txt', 'w', encoding='utf8') as fp:
            fp.write(intro)

最後

今天的分享到這裏就結束了，感興趣的朋友也可以去試試哈

對文章有問題的，或者有其他關於python的問題，可以在評論區留言或者私信我哦

覺得我分享的文章不錯的話，可以關注一下我，或者給文章點贊(/≧▽≦)/

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲實戰，urllib模塊，爬取中國工程院院士信息並保存txt

前言

開發工具

環境搭建

頁面獲取

完整代碼實現

最後

35K*14 薪，入職了！這公司只要不裁員，我能一直呆下去！

【腳本項目源碼】Python製作藝術簽名生成器，打造專屬你的個人藝術簽名

【腳本項目源碼】Python實現魯迅名言查詢系統

【腳本項目源碼】Python製作多功能音樂播放器，打造專屬你的音樂播放器

Python爬蟲實戰，requests+xlwt模塊，爬取螺螄粉商品數據（附源碼）

Python爬蟲實戰，Request+urllib模塊，批量下載爬取飆歌榜所有音樂文件

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結