一、题目需求:
编写程序实现如下功能:
提示用户输入待爬取的百度贴吧的主题
提示用户输入待爬取的起始页码和终止页码
将爬取的指定页码对应的网页保存到本地磁盘指定目录
要求爬取和保存文件使用自定义函数
二、问题分析,规律查找
以百度贴吧搜索“乔丹”一词为例
从这两幅图对比可以发现一些规律
1、百度贴吧搜索关键词网页URL前缀为:https://tieba.baidu.com/f?
2、百度贴吧搜索关键词的参数信息为kw=key_word
3、百度贴吧页码的偏移量参数值pn每页的偏移差值为100-50=50(pn=50第一页,pn=100第二页,以此类推...)
获取这些分析规律之后,就可以动手写代码爬取需求的页面内容了
三、Python爬虫源码
# -*- coding: utf-8 -*-
"""
Created on Mon Jun 24 22:21:12 2019
@author: UnderMask
"""
#爬取百度贴吧内指定搜索主题跟页面内的HTML数据信息
import datetime
import requests
import random#随机添加/修改User-Agent
ualist = [#一些可用的浏览器名称
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
def save_txt(filePath,content):#将网页内容content存入指定的路径文件中filePath
with open(filePath,"w+",encoding='utf-8') as file_write:
file_write.write(content)
def climb(theme,start_page,end_page):#给定贴吧搜索主题、爬取起始页码、爬取终止页码
url="https://tieba.baidu.com/f?"#百度贴吧未给定搜索主题时的网页URL前缀
for i in range(start_page,end_page+1):#从起始页码爬取到终止页码
ua = random.choice(ualist)#从上面可用浏览器中随机挑选模拟一个,模拟真实的浏览器访问页面
headers = {"Connection":"Keep-alive", "User-Agent":ua}#设定headers
offset=(i-1)*50#可用看出百度贴吧的页面规律,一个页码pn的值偏移量为50(第一页为0,第二页的pn=50,第三页的pn=100...)
re=requests.get(url,params={"kw":theme,"pn":str(offset)},headers=headers)#给定请求时的参数、搜索主题,跟页码pn的偏移量值
re.encoding="utf-8"#设定utf-8编码方式,防止乱码
savePath="C:\\Users\\UnderMask\\Desktop\\新建文件夹\\"+"BaiDuTieBa&&theme="+str(theme)+"&&pageNum="+str(i)+"&&datetime="+datetime.datetime.now().strftime('%Y-%m-%d %H-%M-%S')+".txt"
save_txt(savePath,re.text)
if __name__=="__main__":
theme=input("请输入想要爬取的百度贴吧主题:")
start_page=int(input("请输入想要爬取的起始页码:"))
end_page=int(input("请输入想要爬取的终止页码:"))
start_time=datetime.datetime.now()
print("[" + start_time.strftime('%Y-%m-%d %H:%M:%S') + "]>>>"+"爬取开始!")
climb(theme,start_page,end_page)
end_time=datetime.datetime.now()
print("[" + end_time.strftime('%Y-%m-%d %H:%M:%S') + "]>>>"+"爬取结束!")
print("总用时:",(end_time-start_time))
存储路径需要自行改写(我这是绝对路径),然后文件命名瞎搞的,分页码存储为一个.txt文本文件
四、运行效果
1、控制台
2、存储文件夹
3、例如第一页的存储页面.txt文件