Python連載筆記（十）——————爬蟲初步訓練案例

原創

2020-06-16 08:32

一、網頁內網址的爬取

import urllib.request
import re

#第一步 確定需要爬取的網址
url = "http://www.baidu.com/"

#第二步：發送請求獲取響應
response = urllib.request.urlopen(url)

#第三步：通過response.read() 獲取響應內容
html = response.read().decode("utf-8")

#第四步：輸出
print(html)

#提取網址
f = re.findall("""(")(http://[^"]+)(")""",html)
for i in f:
    print(i[1])

二、User-Agent值的獲取與爬蟲解碼

import urllib.request

url = "http://www.baidu.com/"

#headers的值可在自己的瀏覽器中找到，比如在谷歌流量器中按F12,點擊Network，在點Name下的任意一欄，在Headers便可看見User-Agent的值
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}


#1.構建請求對象
request = urllib.request.Request(url,headers = headers)

#2.獲取響應對象
response = urllib.request.urlopen(request)

#3.通過response獲取對象內容
html = response.read().decode("utf-8")

print(request.get_header("User-agent"))

三、爬蟲搜索的編碼

"""
    https://www.baidu.com/s?wd=圖片
    https://www.baidu.com/s?wd=三峽

    通過以上分析：
        https://www.baidu.com/s?wd=    是不改變的，唯一改變的是wd的值
"""
import urllib.request
import urllib.parse
#*********************************************************************************
#第一種編碼方式
url = "https://www.baidu.com/s?wd="
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
#編碼，拼接 URL
key = input("請輸入要搜索的內容：")
#quote加碼方式
key = urllib.parse.quote(key)
urls = url + key
#構建請求對象
request = urllib.request.Request(urls,headers=headers)
response = urllib.request.urlopen(request)
html = response.read().decode("utf-8")
print(html)
#*********************************************************************************
#第二種字典編碼方式
url = "https://www.baidu.com/s?"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
#編碼，拼接 URL
key = input("請輸入要搜索的內容：")
#quote字典加碼方式
key = {'wd':key,'pn':2}
key = urllib.parse.urlencode(key)
urls = url + key
print(urls)
#構建請求對象
request = urllib.request.Request(urls,headers=headers)
response = urllib.request.urlopen(request)
html = response.read().decode("utf-8")
print(html)

四、百度貼吧網頁的抓取與保存

"""
    百度貼吧數據抓取
        1.可以由用戶輸入貼吧內容
        2.可以由用戶選取頁碼數
        3.最終保存在.html文件中

    步驟：
        1.找URL的規律（拼接URL）
            第一頁：http://tieba.baidu.com/f?kw=貼吧名稱&pn=0
            第二頁：http://tieba.baidu.com/f?kw=貼吧名稱&pn=50
            第三頁：http://tieba.baidu.com/f?kw=貼吧名稱&pn=100
            第n 頁：http://tieba.baidu.com/f?kw=貼吧名稱&pn=50*(n-1)

            url

        2.獲取響應的內容
        3.保存到本地/數據庫

"""
import urllib.request
import urllib.parse
"""
    以下爲函數版本
"""
#******************************************************************************
#這是發送請求或許響應的函數
def zhixing(urls):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
    request = urllib.request.Request(urls, headers=headers)
    response = urllib.request.urlopen(request)
    html = response.read().decode("utf-8")
    return  html

def save(i,html):
    try:
        # 以下方式待程序執行完畢，一定會自動釋放資源
        with open("./test_file/mynote(%d).html" % i, 'w', encoding="utf-8") as f:
            f.write(html)
            print("第%d個網頁文件保存成功！" % i)
    except Exception as e:
        print("文件打開失敗！")

def main():
    url = "http://tieba.baidu.com/f?"
    # 編碼，拼接 URL
    str1 = input("請輸入你要搜索的貼吧名稱：")
    p1 = eval(input("請輸入要截取的起始頁數:"))
    p2 = eval(input("請輸入要截取的起始頁數:"))
    # quote字典加碼方式
    for i in range(p1, p2 + 1):
        key = {'kw': str1, 'pn': 50 * (i - 1)}
        key = urllib.parse.urlencode(key)
        urls = url + key
        html = zhixing(urls)
        save(i,html)

if __name__=="__main__":
    main()
#******************************************************************************

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python連載筆記（十）——————爬蟲初步訓練案例

一、網頁內網址的爬取

二、User-Agent值的獲取與爬蟲解碼

三、爬蟲搜索的編碼

四、百度貼吧網頁的抓取與保存

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

Python連載筆記（六）——————Python與MySQL數據庫的交互

Python連載筆記（四）——————函數與面向對象

Python連載筆記（二）——————循環語句for、while

Python連載筆記（十）——————爬蟲初步訓練案例

基於FPGA的VGA顯示對貪喫蛇遊戲的設計

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結