文末爬取案例的效果圖(爬取妹子圖):
BeautifulSoup 是一個可以從 HTML 或 XML 文件中提取數據的 Python 庫,簡單來說,它能將 HMTL 的標籤文件解析成樹形結構,然後方便的獲取到指定標籤的對應屬性。
官方文檔:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
BeautifulSoup 安裝
PyCharm 安裝:File -> Default Settings -> Project Interpreter
入門程序
# 導入 beautifulsoup
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.</p>
<p class="story">the story is beautiful</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify()) # 按照標準的縮進格式的結構輸出
all_content = soup.get_text() # 獲取文檔所有的顯示內容,即出去標籤後文字內容
title = soup.title # 獲取文檔的標題
title_name = soup.title.name # 獲取文檔的標題的標籤名稱
title_text = soup.title.string # 獲取文檔的標題的顯示內容
title_header = soup.title.parent.name # 獲取<title>標籤的父級標籤的名稱
p_all = soup.find_all('p') # 獲取文檔的所有段落標籤
a_links = soup.find_all('a') # 獲取文檔的所有超鏈接標籤
print('all_content = %s' % all_content)
print('title = %s' % title)
print('title_name = %s' % title_name)
print('title_text = %s' % title_text)
print('title_header = %s' % title_header)
for link in a_links:
print('a = %s ' % link)
for p in p_all:
print('type(p) = %s ' % type(p)) #
print('p.name = %s ' % p.name) # 獲取段落標籤名稱
print('p[class] = %s' % p['class']) # 獲取段落標籤的 class 屬性的值
輸出結果:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
and they lived at the bottom of a well.
</p>
<p class="story">
the story is beautiful
</p>
</body>
</html>
all_content =
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
Lacie and
Tillie
and they lived at the bottom of a well.
the story is beautiful
title = <title>The Dormouse's story</title>
title_name = title
title_text = The Dormouse's story
title_header = head
a = <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
a = <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
a = <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
type(p) = <class 'bs4.element.Tag'>
p.name = p
p[class] = ['title']
type(p) = <class 'bs4.element.Tag'>
p.name = p
p[class] = ['story']
type(p) = <class 'bs4.element.Tag'>
p.name = p
p[class] = ['story']
BeautifulSoup 解析結果對象
BeautifulSoup 將複雜 HTML 文檔轉換成一個複雜的樹形結構,每個節點都是 Python 對象,所有對象可以歸納爲 4 種: Tag,NavigableString,BeautifulSoup,Comment
(1)Tag
通俗的說,就是 HTML 中的一個個標籤。
<title>The HTML5 Document</title>
<a class="red_link" href="http://www.baidu.com" id = "link"></a>
上面的 title,a 等等 HTML 標籤加上裏面的包括的內容就是 Tag 。
html_tag = '<b class="boldest">bold text</b>'
soup = BeautifulSoup(html_tag,'html.parser')
tag = soup.b
# Tag 對象屬於的類型
type_tag = type(tag)
# 每一個 Tag 都有自己的名字,可以通過 .name 來獲取
tag_name = tag.name
# 一個 tag 可能有很多個屬性,tag 屬性的操作方法與字典 dict 相同
# 例如,獲取 class 屬性的值
tag_class = tag['class']
print(type_tag) # 結果:<class 'bs4.element.Tag'>
print('type_name = %s' % tag_name) # 結果:type_name = b
print('type["class"] = %s ' % tag_class) # 結果:type["class"] = ['boldest']
Tag 的屬性可以被增加,刪除或修改:
tag['class'] = 'normal' # 修改 class 屬性值
tag['id'] = 'id_bold_text' # 增加 id 屬性
print(soup_tag.prettify()) # 格式化輸出
del tag['class'] # 刪除 class 屬性
del tag['id'] # 刪除 id 屬性
print(soup_tag.prettify()) # 格式化輸出
輸出結果:
<b class="normal" id="id_bold_text">
bold text
</b>
<b>
bold text
</b>
Tag 的其他操作:
html_str = '<head><title>The document Story</title></head><body><b>text body</b><b>text color</b></body>'
soup = BeautifulSoup(html_str,'html.parser')
print(soup.prettify())
print('head = %s ' % soup.head)
print('title = %s' % soup.title)
print('body = %s ' % soup.body)
# 點取屬性的方式只能獲得當前名字的第一個 tag
print('body.b = %s ' % soup.body.b)
輸出結果:
<head>
<title>
The document Story
</title>
</head>
<body>
<b>
text body
</b>
<b>
text color
</b>
</body>
head = <head><title>The document Story</title></head>
title = <title>The document Story</title>
body = <body><b>text body</b><b>text color</b></body>
body.b = <b>text body</b>
tag 節點屬性 | 描述 |
---|---|
.contents | 可以將 tag 的子節點以列表的方式輸出(直接子節點) |
.children | 生成子節點的生成器,可以對 tag 子節點進行循環(直接子節點) |
.descendants | 可以對 tag 所有子孫節點進行遞歸循環 |
.string | |
.strings | tag 中包含多個字符串,可以使用 .strings 來循環獲取 |
.parent | 獲取某個元素的父節點 |
.parents | 遞歸獲取元素的所有父節點 |
.next_sibling .previous_sibling | 查詢兄弟節點 |
.next_siblings .previous_sibling | 可以對當前節點的兄弟節點迭代輸出 |
.next_element .previous_element | 指向解析過程中中下一個被解析的對象(字符串或 tag) |
.next_elements .previus_elements | 生成迭代器,可以向前或向後訪問文檔的解析內容 |
(2)NavigableString
使用 .string 能夠很輕鬆的獲取到標籤內部的文字內容。
html_string = '<b class="boldest">bold text</b>'
soup = BeautifulSoup(html_string,'html.parser')
tag_b = soup.b
tag_string = tag_b.string # 結果:bold text
print('type(tag.string) = %s ' % type(tag_string)) # 結果:type(tag.string) = <class 'bs4.element.NavigableString'>
(3)BeautifulSoup
BeautifulSoup 對象表示的一個文檔的全部內容,大部分時候,可以把它當做 Tag 對象,是一個特殊的 Tag,可以分別獲取它的類型,名稱和 屬性。
from bs4 import BeautifulSoup
html_doc = '<head><title>Document title</title></head>'
soup = BeautifulSoup(html_doc, 'html.parser')
name = soup.name
attrs = soup.attrs
print(type(soup)) # <class 'bs4.BeautifulSoup'>
print(type(name)) # <class 'str'>
print(type(attrs)) # <class 'dict'>
(4) Comment
Comment 對象是一個特殊類型的 NavigableString 對象,和 NavigableString 一樣,在輸出時,輸出的內容仍然不包括 註釋符號。即沒有把它當做註釋看待。
html_href = '<a href="http://www.baidu.com"><!-- a href to baidu--></a>'
soup = BeautifulSoup(html_href,'html.parser')
a = soup.a
string = soup.a.string
type = type(soup.a.string)
print(a) # 結果:<a href="http://www.baidu.com"><!-- a href to baidu--></a>
print(string) # 結果:a href to baidu
print(type) # 結果:<class 'bs4.element.Comment'>
實際場景可以根據 type 類型進行特定操作:
if(type == element.Commont):
pass
爬取妹子圖
import requests
from bs4 import BeautifulSoup
import os
hostreferer = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Referer': 'http://m.mzitu.com/'
}
picreferer = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Referer': 'http://i.meizitu.net'
}
all_url = 'http://m.mzitu.com/all/'
response = requests.get(all_url, headers=hostreferer)
html_doc = response.text
# print(html_doc)
soup = BeautifulSoup(html_doc, "html.parser")
div_list = soup.find_all('div', class_='archive-brick')
# print(div_list)
root_path = 'C:\mzitu' # 存放圖片的根路徑
for div in div_list:
# print(div)
a_link = div.find('a')
# print(a_link)
href = a_link['href']
title = a_link.get_text()
print(title, href)
# 每套圖中含有多張圖片,以該套圖的標題爲文件夾名稱創建文件夾
folder_name = str(title).strip().replace(':', '').replace(' ', '').replace('?', '') # 文件夾名稱,去掉空格
# os.path.join(path,name): 連接目錄與文件名或目錄 結果爲path/name
path = os.path.join(root_path, folder_name)
abspath = os.path.abspath(path) # 文件夾
# print(abspath)
os.makedirs(path) # 創建一個存放套圖的文件夾
os.chdir(path) # 切換到創建的文件夾
response_detail = requests.get(href, headers=hostreferer)
html_detail = response_detail.text # 爬取每個詳情頁面的 html 內容,其中含有圖片的url
# print(html_detail)
detail_soup = BeautifulSoup(html_detail, 'html.parser')
# 獲取最大的頁數,因爲詳情頁的 url 是拼接頁數組成的
max_page = detail_soup.find('div', class_='prev-next').find('span', class_='prev-next-page').get_text()[-3:-1]
print(max_page)
# 根據當前頁數獲取每一頁的url
for page in range(1, int(max_page) + 1): # 表示從 1 到 max_page + 1 之間的整數,不包括 max_page+1
page_url = href + '/' + str(page) # 即每一頁的 url 地址
page_response = requests.get(page_url, headers=hostreferer)
page_html = page_response.text
page_soup = BeautifulSoup(page_html, 'html.parser')
figure = page_soup.find('figure')
# print(figure)
img_src = figure.find('img')['src']
print(img_src) # 每張圖片的 Url
img_result = requests.get(img_src, headers=picreferer)
f = open(img_src[-9:-4] + '.jpg', 'ab')
f.write(img_result.content)
f.close()
代碼封裝:
# encoding:utf-8
import requests
from requests import HTTPError
from bs4 import BeautifulSoup
import os
all_url = 'http://m.mzitu.com/all/'
root_path = 'C:\mzitu' # 存放圖片的根路徑
hostreferer = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Referer': 'http://m.mzitu.com/'
}
picreferer = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Referer': 'http://i.meizitu.net'
}
def get_html_text(url, headers):
'''
url 請求 url
requests 獲取 html 源碼
:param headers: http header
:return: 返回 response
'''
try:
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
response.encoding = 'utf-8'
html_text = response.text
return html_text
except HTTPError as e:
print(e)
return 'request failed'
def makedir(title):
'''
創建文件夾,並切換到該文件下,因爲圖片分類需要保存到分類文件夾中
:param title:
:return:
'''
# 每套圖中含有多張圖片,以該套圖的標題爲文件夾名稱創建文件夾
folder_name = str(title).strip().replace(':', '').replace(' ', '').replace('?', '') # 文件夾名稱,去掉空格
# os.path.join(path,name): 連接目錄與文件名或目錄 結果爲path/name
path = os.path.join(root_path, folder_name)
abspath = os.path.abspath(path) # 文件夾
# print(abspath)
os.makedirs(path) # 創建一個存放套圖的文件夾
os.chdir(path) # 切換到創建的文件夾
def get_img_url(page_html):
'''
獲取詳情頁中的圖片的 url
:param page_html:
:return:
'''
page_soup = BeautifulSoup(page_html, 'html.parser')
figure = page_soup.find('figure')
# print(figure)
img_src = figure.find('img')['src']
print(img_src) # 每張圖片的 Url
return img_src
def save_img(img_src):
'''
根據圖片地址 URL,下載圖片
:param img_src:
:return:
'''
img_result = requests.get(img_src, headers=picreferer)
f = open(img_src[-9:-4] + '.jpg', 'ab')
f.write(img_result.content)
f.close()
def get_max_page(page_detail_text):
'''
獲取每個類別下的圖片詳情頁的最大頁數
:return: 最大頁數
'''
detail_soup = BeautifulSoup(page_detail_text, 'html.parser')
# 獲取最大的頁數,因爲詳情頁的 url 是拼接頁數組成的
max_page = detail_soup.find('div', class_='prev-next').find('span', class_='prev-next-page').get_text()[-3:-1]
print(max_page)
return max_page
def save_to_disk(html_doc):
soup = BeautifulSoup(html_doc, "html.parser")
div_list = soup.find_all('div', class_='archive-brick')
# print(div_list)
for div in div_list:
# print(div)
a_link = div.find('a')
# print(a_link)
href = a_link['href']
title = a_link.get_text()
print(title, href)
makedir(title)
# 詳情頁的 text 內容
page_detail_text = get_html_text(href, headers=hostreferer) # 每個詳情頁面的 html 內容,其中含有圖片的url
# print(page_detail_text)
max_page = get_max_page(page_detail_text) # 獲取每個類別下的圖片詳情頁的最大頁數
# 根據當前頁數獲取每一頁的url
for page in range(1, int(max_page) + 1): # 表示從 1 到 max_page + 1 之間的整數,不包括 max_page+1
page_url = href + '/' + str(page) # 即每一頁的 url 地址
page_html = get_html_text(page_url, headers=hostreferer)
img_url = get_img_url(page_html) # 每一頁中圖片的 url
save_img(img_url)
if __name__ == '__main__':
html_doc = get_html_text(all_url, headers=hostreferer)
save_to_disk(html_doc)
爬取結果: 文章開頭的效果圖。