百度新聞信息爬取

序言

通過對百度新聞標題、鏈接、日期及來源的爬取，瞭解使用python語言爬取少量數據的基本方法。

獲取在百度新聞中搜索“阿里巴巴”的網頁源代碼

爲了獲得請求頭，我們可以在谷歌瀏覽器中的地址欄中輸入about:version,即可獲得headers。

除了要請求頭，我們還要構造url。
在網頁輸入阿里巴巴，然後找到地址欄的url，通過簡化url，得到這樣一個url---->https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴。
有了請求頭，我們可以編寫基本的爬蟲代碼了，嘻嘻嘻。

import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36'}
url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'
res = requests.get(url, headers=headers).text
print(res)

得到部分結果如下所示：

// 意見反饋
setTimeout(function(){
var s = document.createElement(“script”);
s.charset=“utf-8”;
s.src=“https://dss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/news/static/protocol/https/global/js/feedback_4acd551.js”;
document.body.appendChild(s);
},0);

編寫正則表達式提取新聞信息

有了源代碼，下面我們要提取新聞的來源和日期必須要分析一下這個源代碼。

發現該新聞的標題、鏈接、日期，全都在《p class=“c-author”》下面，這樣我就知道怎麼提取了。

import requests
import re
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36'}
url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'
res = requests.get(url, headers=headers).text
p_info = '<p class="c-author">(.*?)</p>'
info = re.findall(p_info, res, re.S)
print(info)

代碼結果含有\n、\t、&nbsp ;

同理，我們使用同樣的方法利用正則表達式獲得具體的標題、鏈接。

import requests
import re
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36'}
url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'
res = requests.get(url, headers=headers).text
p_href = '<h3 class="c-title">.*?<a href="(.*?)"'
href = re.findall(p_href, res, re.S)
p_title = '<h3 class="c-title">.*?>(.*?)</a>'
title = re.findall(p_title, res, re.S)
print("鏈接是：", '\n', href)
print("標題是：", '\n', title)

結果：

數據清洗並打印輸出

import requests
import re
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36'}
url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=阿里巴巴'
res = requests.get(url, headers=headers).text
p_info = '<p class="c-author">(.*?)</p>'
info = re.findall(p_info, res, re.S)
# 新聞來源和日期清洗
for i in range(len(info)):
    info[i].split('&nbsp;&nbsp;')
    info[i] = re.sub('<.*？>', '', info[i])
p_href = '<h3 class="c-title">.*?<a href="(.*?)"'
href = re.findall(p_href, res, re.S)
p_title = '<h3 class="c-title">.*?>(.*?)</a>'
title = re.findall(p_title, res, re.S)
# 新聞標題清洗----strip()->除去不需要的空格和換行符、.*?->代替文本之間的所有內容，清洗掉<em></em>
for i in range(len(title)):
    title[i] = title[i].strip()
    title[i] = re.sub('<.*?>', '', title[i])
print("日期是：", '\n', info)
print("鏈接是：", '\n', href)
print("標題是：", '\n', title)

結果(缺陷：日期沒有清洗乾淨)：

實戰完整代碼

首先先介紹爬取阿里巴巴公司新聞的標題、日期、鏈接的完整代碼：

# 1.批量爬取一家公司的多頁信息
def baidu(page):
    import requests
    import re
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
    num = (page - 1) * 10
    url = 'https://www.baidu.com/s?tn=news&rtt=4&bsst=1&cl=2&wd=阿里巴巴&Ppn=' + str(num)
    res = requests.get(url, headers=headers).text
    p_info = '<p class="c-author">(.*?)</p>'
    p_href = '<h3 class="c-title">.*?<a href="(.*?)"'
    p_title = '<h3 class="c-title">.*?>(.*?)</a>'
    info = re.findall(p_info, res, re.S)
    href = re.findall(p_href, res, re.S)
    title = re.findall(p_title, res, re.S)
    source = []  # 先創建兩個空列表來儲存等會分割後的來源和日期
    date = []
    for i in range(len(info)):
        title[i] = title[i].strip()
        title[i] = re.sub('<.*?>', '', title[i])
        info[i] = re.sub('<.*?>', '', info[i])
        source.append(info[i].split('&nbsp;&nbsp;')[0])
        date.append(info[i].split('&nbsp;&nbsp;')[1])
        source[i] = source[i].strip()
        date[i] = date[i].strip()
        print(str(i + 1) + '.' + title[i] + '(' + date[i] + '-' + source[i] + ')')
        print(href[i])
for i in range(10):  # i是從0開始的序號,所以下面要寫成i+1
    baidu(i+1)
    print('第' + str(i+1) + '頁爬取成功')

結果：

然後介紹爬取多家公司新聞的標題、日期、鏈接的代碼：

import requests
import re
def baidu(company):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
    url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=' + company
    res = requests.get(url, headers=headers).text
    p_info = '<p class="c-author">(.*?)</p>'
    p_href = '<h3 class="c-title">.*?<a href="(.*?)"'
    p_title = '<h3 class="c-title">.*?>(.*?)</a>'
    info = re.findall(p_info, res, re.S)
    href = re.findall(p_href, res, re.S)
    title = re.findall(p_title, res, re.S)
    source = []  # 先創建兩個空列表來儲存等會分割後的來源和日期
    date = []
    for i in range(len(info)):
        title[i] = title[i].strip()
        title[i] = re.sub('<.*?>', '', title[i])
        info[i] = re.sub('<.*?>', '', info[i])
        source.append(info[i].split('&nbsp;&nbsp;')[0])
        date.append(info[i].split('&nbsp;&nbsp;')[1])
        source[i] = source[i].strip()
        date[i] = date[i].strip()
        print(str(i + 1) + '.' + title[i] + '(' + date[i] + '-' + source[i] + ')')
        print(href[i])
while True:  # 24小時不間斷爬取
    companys = ['華能信託', '阿里巴巴', '萬科集團', '百度', '騰訊', '京東']
    for i in companys:
        try:
            baidu(i)
            print(i + '百度新聞爬取成功')
        except:
            print(i + '百度新聞爬取失敗')

部分結果如下圖所示：

貪心的萌萌

發佈了9 篇原創文章 · 獲贊 67 · 訪問量 5683

私信關注

python爬蟲實戰之百度新聞爬取

百度新聞信息爬取

目錄

序言

獲取在百度新聞中搜索“阿里巴巴”的網頁源代碼

編寫正則表達式提取新聞信息

數據清洗並打印輸出

實戰完整代碼

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

Python數據分析與挖掘實戰學習筆記(1)

Python數據分析與挖掘實戰學習筆記(2)

Python數據分析與挖掘實戰學習筆記

python爬蟲實戰之百度新聞爬取

python爬蟲實戰之實時數據挖掘

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結