python爬虫之爬取微博《肺炎患者求助》超话信息

学校有个老师想研究微博《肺炎患者求助》的文本信息，他给了我一个PC端的链接，找我帮忙写爬虫，把链接上所有求助信息全部爬下来，我查看一共有21页，日期为2020年2月1日~2020年3月13日；经过一番检查后，我决定自己从移动端网页入手，其一：因为我上个月爬取过微博的《战疫情》，可以节约很多时间去分析网页结构；其二：移动端使用的是ajax加载，请求得到json数据，速度快得很多。在这里我就不去详细讲解这次的实战了，如果感兴趣的话可以结合我上次的实战经验阅读。

曾经实战项目： python爬虫爬取微博之战疫情用户评论及详情

PC端网页： https://weibo.com/p/1008084882401a015244a2ab18ee43f7772d6f/super_index

移动端网页： https://m.weibo.cn/p/index?containerid=1008084882401a015244a2ab18ee43f7772d6f&luicode=10000011&lfid=100103type%3D1%26q%3D肺炎患者求助超话

操作环境： windows10, python37, jupyter

实现思路：

打开《肺炎患者求助》首页，进行抓包，分析它的json结构；
通过抓包可以拿到所有的文章的：用户，发布时间，文章ID转发量，评论量，点赞量等等，但是无法拿到超出发布内容范围的隐藏部分；
隐藏内容超出部分的文章有一个“全文”的标识，鼠标点击后即可展开，但是我们只需要自己通过上面拿到的文章ID拼接出它的完整链接，即可访问详情信息；同时，是否要进行访问详情信息可以通过“【姓名】”是否存在进行，这样可以节约程序的时间，提高效率。
使用正则表达式匹配出上面的内容的详细信息，按照：“姓名、年龄、城市、地址、时间、联系方式、紧急联系人、病情描述" 这样来存储；
下一个API中的区别在 since_id ，可以在当前API中获取，变化 since_id 就可以不断的进行下一个json数据的获取；
在while True 中设置了两个跳出死循环的条件，一个为找不到下一个 since_id 时跳出循环，第二个为 since_id 以前面的重复时跳出死循环。
结果写入CSV.

import requests, csv, time, re

startTime = time.time() #记录起始时间
csvfile = open('./微博疫情求助+数据匹配.csv', 'a', newline='', encoding = 'utf-8-sig')
writer = csv.writer(csvfile)
writer.writerow(('文章ID', '发布时间', '转发量', '评论数', '点赞数', '内容','姓名', '年龄', '城市', '小区', '患病时间', '病情描述', '联系方式', '其他紧急联系人'))

index_url = "https://m.weibo.cn/api/container/getIndex?containerid=1008084882401a015244a2ab18ee43f7772d6f_-_feed&luicode=10000011&lfid=100103type%3D1%26q%3D%E8%82%BA%E7%82%8E%E6%82%A3%E8%80%85%E6%B1%82%E5%8A%A9%E8%B6%85%E8%AF%9D&display=0&retcode=6102"

next_id_list = []

def regular(find_title):
    name = re.findall('.*?【姓名】(.*?)<', find_title)
    if len(name) != 0:
        name = ''.join(name)
    else:
        name = ""

    age = re.findall('.*?【年龄】(.*?)<', find_title)
    if len(age) != 0:
        age = ''.join(age)
    else:
        age = ""

    city = re.findall('.*?【所在城市】(.*?)<', find_title)
    if len(city) != 0:
        city = ''.join(city)
    else:
        city = ""
    village = re.findall('.*?【所在小区、社区】(.*?)<', find_title)
    if len(village) != 0:
        village = ''.join(village)
    else:
        village = ""
    ill_time = re.findall('.*?【患病时间】(.*?)<', find_title)
    if len(ill_time) != 0:
        ill_time = ''.join(ill_time)
    else:
        ill_time = ""
    describe = re.findall('.*?【病情描述】(.*?)<', find_title)
    if len(describe) != 0:
        describe = ''.join(describe)
    else:
        describe = ""
    phone = re.findall('.*?【联系方式】(.*?)<', find_title)
    if len(phone) != 0:
        phone = ''.join(phone)
    else:
        phone = ""
    other_way = re.findall('.*?【其他紧急联系人】(.*?)<', find_title)
    if len(other_way) != 0:
        other_way = ''.join(other_way)
    else:
        other_way = ""
    return name, age, city, village, ill_time, describe, phone, other_way

while True:
    html = requests.get(url = index_url)
    try:
        since_id = html.json()["data"]["pageInfo"]["since_id"]
    except:
        break
    index_url = index_url + "&since_id=" + str(since_id)
    # 防止since_id重复，当since_id重复时跳出循环
    if since_id in next_id_list:
        break
    else:
        next_id_list.append(since_id)
    # 提取所有信息    
    for i in html.json()["data"]["cards"]:        
        try:
            for j in i["card_group"]:
                title_id = j["mblog"]["id"] #文章ID
                created_time = j["mblog"]["created_at"] #发布时间
                sharing = j["mblog"]["reposts_count"] #转发量
                comments_count = j["mblog"]["comments_count"] #评论数
                great = j["mblog"]["attitudes_count"] #点赞
                comments_html = j["mblog"]["text"] # 内容，html形式的
                if ">全文<" in comments_html:
                    all_url = "https://m.weibo.cn/status/" + title_id #打开全文，获取具体信息
                    html_text = requests.get(url=all_url).text
                    # 话题内容
                    find_title = re.findall('.*?"text": "(.*?)",.*?', html_text)[0]
                    comment_text = re.sub('<(S*?)[^>]*>.*?|<.*? />', '', find_title) #正则匹配掉html标签
                    if "【姓名】" in find_title: 
                        result = regular(find_title)
                else:
                    comment_text = re.sub('<(S*?)[^>]*>.*?|<.*? />', '', comments_html) #正则匹配掉html标签
                    if "【姓名】" in find_title:                        
                        result = regular(find_title)
                if "【姓名】" in comments_html: 
                    writer.writerow((title_id, created_time, sharing, comments_count, great, comment_text, result[0], result[1], result[2], result[3], result[4], result[5], result[6], result[7]))
                else:
                    writer.writerow((title_id, created_time, sharing, comments_count, great, comment_text))
        except:
            pass

    print ("正在爬取：", since_id)
    
csvfile.close() #关闭文件
endTime =time.time()#获取结束时的时间
useTime =(endTime-startTime)/60
print ("该次所获的信息一共使用%s分钟"%useTime)

jupyter运行结果：

CSV结果：

python爬虫之爬取微博《肺炎患者求助》超话信息

《Python进阶》学习笔记

Leetcode 3161. 物块放置查询

一个docker容器暴露多个端口

leetcode 60 排列序列

微服务实践之使用 Visual Studio 2022 调试Dapr 应用程序

wpf附加属性理解 WPF附加属性

python爬蟲之爬取《書趣閣》小說教學

數據分析入門之Numpy 矩陣與通用函數

數據分析入門之好萊塢百萬級評論數據分析

python爬取美團評論做詞雲分析

圖像處理之opencv圖像美化

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結