初學爬蟲-實戰爬取某笑話網站

原創

2020-03-16 15:24

人生苦短，我用Python

# 數據挖掘
import requests
# 數據清洗
from lxml import etree
# 其他
import random
import time

'''  分析翻頁規律
https://www.telnote.cn/xiaohua/baoxiao/list_1.htm
https://www.telnote.cn/xiaohua/baoxiao/list_2.htm
https://www.telnote.cn/xiaohua/baoxiao/list_3.htm

# <meta name="description" content="笑話內容.......">

''' 
url="https://www.telnote.cn/xiaohua/baoxiao/list_"
headers={
	"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}

sum=0
for i in range(1,4):  #爬1~3頁
	# 每頁url
	page=i*1
	urls=url+str(page)+".htm"
	# print(urls)

	# 笑話詳情頁url 
	req=requests.get(urls,headers=headers)
	html=req.content.decode('gbk') #轉二進制再轉gbk,主要看網頁是什麼格式
	# print(html)
	html=etree.HTML(html)
	result=html.xpath('//dd[@class="content"]/h1/a')
	# print(result)
	for i in result:
		links="https://www.telnote.cn"+i.get("href")
		# print(links)

		# 爬取詳情頁的內容
		req=requests.get(links,headers=headers)
		html=req.content.decode('gbk')
		# print(html)
		html=etree.HTML(html)
		result=html.xpath('//meta[@name="description"]')
		# print(result)
		for x in result:
			sum+=1
			print("正在下載第",sum,"個笑話")
			data=x.get("content")
			# print(type(data))
			xieru=open(r"E:\測試python效果\爬取的笑話.txt","a")
			xieru.write(data+"\n")

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

初學爬蟲-實戰爬取某笑話網站

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

再談23種設計模式（3）：行爲型模式（學習筆記）

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

python初級實戰案例，爬取某小說網站並下載到本地txt裏，分別使用bs4 xpath re 3種數據清洗方式爬數據

python爬蟲實戰練習 --- 喜馬拉雅免費音頻下載到本地

按鍵雲數據倉庫平臺，連接按鍵精靈和雲端數據庫，可自行配置用做網絡驗證或雲端配置

按鍵精靈手機版（安卓 ios）如何連接遠程網絡數據庫比如 sql server, 進行讀寫操作雲端自動化等

初學爬蟲-實戰爬取某笑話網站

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結