爬蟲福利二之妹子圖網MM批量下載

轉自：https://blog.csdn.net/PY0312/article/details/101087356?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task

看了本文，相信大家對爬蟲一定會產生強烈的興趣，激勵自己去學習爬蟲，在這裏提前祝：大家學有所成！

目標網站：妹子圖網

環境：Python3.x

相關第三方模塊：requests、beautifulsoup4

Re：各位在測試時只需要將代碼裏的變量 path 指定爲你當前系統要保存的路徑，使用 python xxx.py 或IDE運行即可。

完整源碼如下：
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import os

all_url = 'https://www.mzitu.com'

# http請求頭
Hostreferer = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Referer': 'http://www.mzitu.com'
}
# 此請求頭Referer破解盜圖鏈接
Picreferer = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Referer': 'http://i.meizitu.net'
}

# 對mzitu主頁all_url發起請求，將返回的HTML數據保存，便於解析
start_html = requests.get(all_url, headers=Hostreferer)

# Linux保存地址
# path = '/home/Nick/Desktop/mzitu/'

# Windows保存地址
path = 'E:/mzitu/'

# 獲取最大頁數
soup = BeautifulSoup(start_html.text, "html.parser")
page = soup.find_all('a', class_='page-numbers')
max_page = page[-2].text

# same_url = 'http://www.mzitu.com/page/' # 主頁默認最新圖片
# 獲取每一類MM的網址
same_url = 'https://www.mzitu.com/mm/page/' # 也可以指定《qingchun MM系列》

for n in range(1, int(max_page) + 1):
# 拼接當前類MM的所有url
ul = same_url + str(n)

# 分別對當前類每一頁第一層url發起請求
start_html = requests.get(ul, headers=Hostreferer)

# 提取所有MM的標題
soup = BeautifulSoup(start_html.text, "html.parser")
all_a = soup.find('div', class_='postlist').find_all('a', target='_blank')

# 遍歷所有MM的標題
for a in all_a:
# 提取標題文本，作爲文件夾名稱
title = a.get_text()
if(title != ''):
print("準備扒取：" + title)

# windows不能創建帶？的目錄，添加判斷邏輯
if(os.path.exists(path + title.strip().replace('?', ''))):
# print('目錄已存在')
flag = 1
else:
os.makedirs(path + title.strip().replace('?', ''))
flag = 0
# 切換到上一步創建的目錄
os.chdir(path + title.strip().replace('?', ''))

# 提取第一層每一個MM的url，併發起請求
href = a['href']
html = requests.get(href, headers=Hostreferer)
mess = BeautifulSoup(html.text, "html.parser")

# 獲取第二層最大頁數
pic_max = mess.find_all('span')
pic_max = pic_max[9].text
if(flag == 1 and len(os.listdir(path + title.strip().replace('?', ''))) >= int(pic_max)):
print('已經保存完畢，跳過')
continue

# 遍歷第二層每張圖片的url
for num in range(1, int(pic_max) + 1):
# 拼接每張圖片的url
pic = href + '/' + str(num)

# 發起請求
html = requests.get(pic, headers=Hostreferer)
mess = BeautifulSoup(html.text, "html.parser")
pic_url = mess.find('img', alt=title)
print(pic_url['src'])
html = requests.get(pic_url['src'], headers=Picreferer)

# 提取圖片名字
file_name = pic_url['src'].split(r'/')[-1]

# 保存圖片
f = open(file_name, 'wb')
f.write(html.content)
f.close()
print('完成')
print('第', n, '頁完成')
扒圖步驟分析：（送給有興趣的朋友）
1、獲取網頁源碼

打開mzitu網址，用瀏覽器的F12可以看到網頁的請求過程及源碼

該步驟代碼如下：

#coding=utf-8

import requests

url = 'http://www.mzitu.com'

#設置headers，網站會根據這個判斷你的瀏覽器及操作系統，很多網站沒有此信息將拒絕你訪問
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

#用get方法打開url併發送headers
html = requests.get(url,headers = header)

#打印結果 .text是打印出文本信息即源碼
print(html.text)
返回的響應，如果沒問題的話結果和下面類似，這些就是網頁的源碼了。

<html>
<body>

......

$("#index_banner_load").find("div").appendTo("#index_banner");
$("#index_banner").css("height", 90);
$("#index_banner_load").remove();
});
</script>
</body>
</html>

2、提取所需信息

將獲取的源碼轉換爲BeautifulSoup對象
使用find搜索需要的數據，保存到容器中
該步驟代碼如下：

#coding=utf-8

import requests
from bs4 import BeautifulSoup

url = 'http://www.mzitu.com'
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

html = requests.get(url,headers = header)

#使用自帶的html.parser解析，速度慢但通用
soup = BeautifulSoup(html.text,'html.parser')

#實際上是第一個class = 'postlist'的div裏的所有a 標籤是我們要找的信息
all_a = soup.find('div',class_='postlist').find_all('a',target='_blank')

for a in all_a:
title = a.get_text() #提取文本
print(title)
如下就找到了當頁所有套圖的標題：

注意：BeautifulSoup()返回的類型是<class 'bs4.BeautifulSoup'>
find()返回的類型是<class 'bs4.element.Tag'>
find_all()返回的類型是<class 'bs4.element.ResultSet'>
<class 'bs4.element.ResultSet'>不能再進項find/find_all操作
3、進入第二層頁面，進行下載操作

點進一個套圖之後，發現他是每個頁面顯示一個圖片，這時我們需要知道他的總頁數，比如：http://www.mzitu.com/26685是某個套圖的第一頁，後面的頁數都是再後面跟/和數字http://www.mzitu.com/26685/2 (第二頁)，那麼很簡單了，我們只需要找到他一共多少頁，然後用循環組成頁數就OK了。

該步驟代碼如下：

<img src="https://i5.meizitu.net/2019/07/01b56.jpg" alt="xxxxxxxxxxxxxxxxxxxxxxxxx" width="728" height="485">
如圖所示，上面就是我們MM圖片的具體地址了，保存它即可。

該步驟代碼如下：

#coding=utf-8

import requests
from bs4 import BeautifulSoup

url = 'http://www.mzitu.com/26685'
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

html = requests.get(url,headers = header)
soup = BeautifulSoup(html.text,'html.parser')

#最大頁數在span標籤中的第10個
pic_max = soup.find_all('span')[10].text

#找標題
title = soup.find('h2',class_='main-title').text

#輸出每個圖片頁面的地址
for i in range(1,int(pic_max) + 1):
href = url+'/'+str(i)
html = requests.get(href,headers = header)
mess = BeautifulSoup(html.text,"html.parser")

#圖片地址在img標籤alt屬性和標題一樣的地方
pic_url = mess.find('img',alt = title)

html = requests.get(pic_url['src'],headers = header)

#獲取圖片的名字方便命名
file_name = pic_url['src'].split(r'/')[-1]

#圖片不是文本文件，以二進制格式寫入，所以是html.content
f = open(file_name,'wb')
f.write(html.content)
f.close()
到此分析結束，完整代碼見文章開頭......

爬蟲福利二之妹子圖網MM批量下載

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

windows下qt程序崩潰後自動將程序拉起來

qtablewidget的setSortingEnabled()函數有問題(qt5.9.6版本)

QLineEdit限制輸入最多8個16進制字符

使用QAxObject讀excel

Qt for Android開發入門

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

爬蟲福利二 之 妹子圖網MM批量下載

爬蟲福利二之妹子圖網MM批量下載