BeautifulSoup,一碗美麗的湯,一個隱藏的大坑

python 網絡爬蟲常用的4大解析庫助手:re正則、etree xpath、scrapy xpath、BeautifulSoup。(因爲etree xpath和scrapy xpath用法上有較大的不同,故沒有歸爲一類),本文來介紹BeautifulSoup一個少爲人知的坑,見示例: 例1(它是長得不一樣, 柬文勿怪): content = """ <html> <body> <div class="td-post-content td-pb-padding-side"> <p> <img alt="" class="alignnone size-full wp-image-122426" data-recalc-dims="1" height="352" src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching.jpg?resize=630%2C352&amp;ssl=1" width="630"/> </p> <p> <img alt="" class="alignnone size-full wp-image-122427" data-recalc-dims="1" height="473" src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&amp;ssl=1" width="630"/> </p> <p> ចំណែកឯប្រេងដូងវិញ មានផ្ទុកអាស៊ីតខ្លាញ់អូមេហ្គា៣ ដែលល្អបំផុតសម្រាប់បំផ្លាញ់មីក្រុបដែលមានវត្តមាននៅក្នុងតំបន់រន្ធគូថ ហេតុនេះហើយទើបការឆ្លងមេរោគ និងរមាស់ត្រូវបានទប់ស្កាត់។ </p> <p> <img alt="" class="alignnone size-full wp-image-122427" data-recalc-dims="1" height="473" src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&amp;ssl=1" width="630"/> </p> <p> <img alt="" class="alignnone size-full wp-image-122428" data-recalc-dims="1" height="473" src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching2.jpg?resize=630%2C473&amp;ssl=1" width="630"/> <br/> <em> <br/> ចំណាំ៖ </em> ប្រសិនបើអ្នករមាស់ខ្លាំង មានការឈឺចាប់ ហើយមានឈាមហូរទៀតនោះ ត្រូវប្រញាប់ទៅជួបជាមួយគ្រូពេទ្យភ្លាម៕ </p> </div> </body> </html> """ soup = BeautifulSoup(content) img_lst = [] inner_src_list = soup.find_all('img', src=True) for i, src in enumerate(inner_src_list): url=src["src"].replace("&ssl", "&amp;ssl") print(url) print(soup.prettify()) # content = soup.prettify() # src的打印結果一樣 img_tags = soup.find_all('img') for img in img_tags: print(img['src']) 控制檯打印輸出如下: ![](http://i2.51cto.com/images/blog/201810/19/f709eed65fc5ebf49e98cc7cb67e6b91.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) ![](http://i2.51cto.com/images/blog/201810/19/3bda9857b63335670b3dcac69903aa74.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) ![](http://i2.51cto.com/images/blog/201810/19/9e41161d11fb22a9f01ec2868e870ead.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) 怎麼會這樣:文本中的‘amp;’字符怎麼消失了? 解釋如下:BeautifulSoup在提取src時內部會自動把符號‘&amp;’轉義成'&',【網頁解析有時不一定要眼前的直覺】【不僅bs如此, etree xpath和scrapy xpath也是一樣】 例2: 文本同上 soup = BeautifulSoup(content) img_lst = [] inner_src_list = soup.find_all('img', src=True) # 注意比較 for i, src in enumerate(inner_src_list): url=src["src"].replace("&ssl", "&amp;ssl") print(url) inner_src_list = soup.find_all('img', attr={'src':True}) # 注意比較 for i, src in enumerate(inner_src_list): url=src["src"].replace("&ssl", "&amp;ssl") print(url) 這裏不作打印了,直接說明現象,第一個print正常打印,第二個print輸出爲空,爲什麼? 解釋如下: 第一個find_all,把src=True視爲存在src屬性的img標籤,第二個find_all,把attr={'src', True}視爲存在src且屬性值爲True的img標籤,所以結果可想而知! 上述如有不正之處,歡迎指出,謝謝!
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章