BeautifulSoup，一碗美麗的湯，一個隱藏的大坑

 python 網絡爬蟲常用的4大解析庫助手：re正則、etree xpath、scrapy xpath、BeautifulSoup。（因爲etree xpath和scrapy xpath用法上有較大的不同，故沒有歸爲一類），本文來介紹BeautifulSoup一個少爲人知的坑，見示例：
例1(它是長得不一樣， 柬文勿怪)：
content = """
<html>
<body>
<div class="td-post-content td-pb-padding-side">
<p>
<img alt="" class="alignnone size-full wp-image-122426"
data-recalc-dims="1" height="352"
src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching.jpg?resize=630%2C352&amp;ssl=1"
width="630"/>
</p>
<p>
<img alt="" class="alignnone size-full wp-image-122427"
data-recalc-dims="1" height="473"
src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&amp;ssl=1"
width="630"/>
</p>
<p>
ចំណែកឯប្រេងដូងវិញ មានផ្ទុកអាស៊ីតខ្លាញ់អូមេហ្គា៣
ដែលល្អបំផុតសម្រាប់បំផ្លាញ់មីក្រុបដែលមានវត្តមាននៅក្នុងតំបន់រន្ធគូថ
ហេតុនេះហើយទើបការឆ្លងមេរោគ និងរមាស់ត្រូវបានទប់ស្កាត់។
</p>
<p>
<img alt="" class="alignnone size-full wp-image-122427"
data-recalc-dims="1" height="473"
src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&amp;ssl=1"
width="630"/>
</p>
<p>
<img alt="" class="alignnone size-full wp-image-122428"
data-recalc-dims="1" height="473"
src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching2.jpg?resize=630%2C473&amp;ssl=1"
width="630"/>
<br/>
<em>
<br/>
ចំណាំ៖
</em>
ប្រសិនបើអ្នករមាស់ខ្លាំង មានការឈឺចាប់ ហើយមានឈាមហូរទៀតនោះ
ត្រូវប្រញាប់ទៅជួបជាមួយគ្រូពេទ្យភ្លាម៕
</p>
</div>
</body>
</html>
"""
soup = BeautifulSoup(content)
img_lst = []
inner_src_list = soup.find_all('img', src=True)
for i, src in enumerate(inner_src_list):
url=src["src"].replace("&ssl", "&amp;ssl")
print(url)
print(soup.prettify())
# content = soup.prettify() # src的打印結果一樣
img_tags = soup.find_all('img')
for img in img_tags:
print(img['src'])
控制檯打印輸出如下：
![](http://i2.51cto.com/images/blog/201810/19/f709eed65fc5ebf49e98cc7cb67e6b91.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=)
![](http://i2.51cto.com/images/blog/201810/19/3bda9857b63335670b3dcac69903aa74.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=)
![](http://i2.51cto.com/images/blog/201810/19/9e41161d11fb22a9f01ec2868e870ead.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=)
怎麼會這樣：文本中的‘amp;’字符怎麼消失了？
解釋如下：BeautifulSoup在提取src時內部會自動把符號‘&amp;’轉義成'&'，【網頁解析有時不一定要眼前的直覺】【不僅bs如此， etree xpath和scrapy xpath也是一樣】
例2：
文本同上
soup = BeautifulSoup(content)
img_lst = []
inner_src_list = soup.find_all('img', src=True) # 注意比較
for i, src in enumerate(inner_src_list):
url=src["src"].replace("&ssl", "&amp;ssl")
print(url)
inner_src_list = soup.find_all('img', attr={'src':True}) # 注意比較
for i, src in enumerate(inner_src_list):
url=src["src"].replace("&ssl", "&amp;ssl")
print(url)
這裏不作打印了，直接說明現象，第一個print正常打印，第二個print輸出爲空，爲什麼？
解釋如下： 第一個find_all，把src=True視爲存在src屬性的img標籤，第二個find_all，把attr={'src', True}視爲存在src且屬性值爲True的img標籤，所以結果可想而知！
上述如有不正之處，歡迎指出，謝謝！

BeautifulSoup，一碗美麗的湯，一個隱藏的大坑

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

Vue mockjs mock.js

關於遊戲付費的一點想法

我通過CKA和CKS啦！

安裝chromadb注意事項

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

在pandas的unstack時報ValueError： duplicate entries 錯誤

python ftp遠程創建層級目錄

BeautifulSoup，一碗美麗的湯，一個隱藏的大坑

使用pexpert自動化工具在open文件時時報TypeError

Linux基礎之常見命令用法（一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結