Python基础之爬虫：爬取小说，图片示例

原創

岩枭

2020-02-22 17:38

一、用python里面的beautifulsoup爬取网页中的小说

原来网页内容：http://www.jueshitangmen.info/tian-meng-bing-can-11.html

#爬虫

from  bs4 import  BeautifulSoup
from  urllib.request import urlopen

html=urlopen('http://www.jueshitangmen.info/tian-meng-bing-can-11.html')\
    .read().decode('utf-8')
#print(html)
#print('***********************************************************************************************************************')
soup=BeautifulSoup(html,features='lxml')
#获取所有p标签
all_p=soup.find_all('p')
#print(all_p)
f=open('flie.txt','a',encoding='utf-8')
for i in all_p:
    print('\n',i.get_text())
    f.write('\n'+i.get_text())
f.close()

#获取所有a标签
all_a=soup.find_all('a')
f=open('flie_a.txt','a',encoding='utf-8')
for i in all_a:
    print('\n',i.get_text())
    f.write('\n'+i.get_text())
f.close()

执行程序效果：

二、用python爬取图片,两种代码都可以成功爬取到图片信息:

代码一：

import urllib.request
import urllib.parse
import re
import os

#添加header，其中Referer是必须的,否则会返回403错误，User-Agent是必须的，这样才可以伪装成浏览器进行访问
#伪装成浏览器，防止反爬虫
header=\
{
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
     "referer":"https://image.baidu.com"
    }
url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={word}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&word={word}&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&cg=girl&pn={pageNum}&rn=30&gsm=1e00000000001e&1490169411926="

keyword = input("请输入搜索关键词：")
#字符转码
keyword = urllib.parse.quote(keyword,"utf-8")

n = 0
j = 0
error = 0
while(n<3000):
    n+=1
    #url
    url1 = url.format(word=keyword,pageNum=str(n))
    #获取请求
    rep = urllib.request.Request(url1,headers=header)
    rep = urllib.request.urlopen(rep)
    #读取网页内容
    try:
        html = rep.read().decode("utf-8")
    except:
        print("出错啦！")
        error = 1
        print("-------当前页面数："+str(n))
    if(error==1):continue
    #正则匹配：需要的数据都是放在：("thumbURL": "https://ss2.bdstatic.com/70cFvnSh_Q1YnxGkpoWK1HF6hhy/it/u=1593875716,602632714&fm=27&gp=0.jpg")
    p = re.compile("thumbURL.*?\.jpg")
    #获取匹配的结果
    s = p.findall(html)
    #图片存储路径
    if os.path.isdir("D://pictest/图片") !=True:
        os.makedirs(r"D://pictest/图片")
    #获取图片的url
    for i in s:
        i = i.replace("thumbURL\":\"","")
        print(i)
        urllib.request.urlretrieve(i,"D://pictest/图片/pic{num}.jpg".format(num=j))
        j+=1
    print("总共爬取的图片数："+str(j))

代码二：

import urllib.request
import urllib.parse
import re
import os

#添加header，其中Referer是必须的,否则会返回403错误，User-Agent是必须的，这样才可以伪装成浏览器进行访问
#伪装成浏览器，防止反爬虫
header=\
{
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
     "referer":"https://image.baidu.com"
    }
url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={word}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&word={word}&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&cg=girl&pn={pageNum}&rn=30&gsm=1e00000000001e&1490169411926="

keyword=input('请输入搜索关键词：')
#字符转码
keyword=urllib.parse.quote(keyword,'utf-8')

n=0
j=0
error=0

while(n<3000):
    n+=30
    url1=url.format(word=keyword,pageNum=str(n))
    #获取请求
    rep=urllib.request.Request(url1,headers=header)
    #打开网页
    rep=urllib.request.urlopen(rep)
    #读取网页内容
    try:
        html=rep.read().decode('utf-8')
    except:
        print('出错了')
        error=1
        print('出错页数：'+str(n))
    if error==1:
        continue
    #正则匹配
    p=re.compile('thumbURL.*?\.jpg')
    #获取正则匹配到的结果,返回list
    s=p.findall(html)
    if os.path.isdir('D://pic')!=True:
        os.makedirs('D://pic')
    with open('testPc.txt','a') as f:
        #获取图片url
        for i in s:
            i=i.replace('thumbURL\":\"','')
            print(i)
            f.write(i)
            f.write('\n')
            #保存图片到D://pic
            urllib.request.urlretrieve(i,'D://pic/pic{num}.jpg'.format(num=j))
            j+=1
        f.close()

print('总共爬取图片数为:'+str(j))

执行程序运行结果：

岩枭

发布了301 篇原创文章 · 获赞 365 · 访问量 159万+

他的留言板关注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python基础之爬虫：爬取小说，图片示例

钉钉打卡速度慢

Nginx R31 doc 官方文档-01-nginx 如何安装

Python 潮流周刊#51：用 Python 绘制美观的图表

Qt/C++音视频开发74-合并标签图形/生成yolo运算结果图形/文字和图形合并成一个/水印滤镜

挑战程序设计竞赛 2.2章习题 POJ - 3617 Best Cow Line 贪心

字节面试：MySQL什么时候锁表？如何防止锁表？

.NET8连接SQL SERVER 2008 R2 报：证书链是由不受信任的颁发机构颁发的

golang开发环境搭建(win10)

python计算机视觉学习笔记——PIL库的用法

Golang初学：获取程序内存使用情况，std runtime

flask文件的上傳

用c++實現紅黑樹的判斷、插入、遍歷操作

指紋模式識別算法源碼及其測試和應用方法

Python基礎之opencv框架

python實現：輸入一行字符，分別統計出其中英文字母，空格，數字和其他字符的個數

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結