Python爬蟲庫-1-BeautifulSoup的使用

Beautiful Soup是一個可以從HTML或XML文件中提取數據的Python庫,簡單來說,它能將HTML的標籤文件解析成樹形結構(網頁原本就是一個樹形結構),然後獲取到指定標籤的對應屬性。

通過Beautiful Soup庫,我們可以將指定的class或id值作爲參數,來直接獲取到對應標籤的相關數據,是python爬蟲當中的常用庫,python 3環境下。

內容大綱:

  1. 安裝
  2. 調用beautifulsoup4(bs4)
  3. 頁面解析。獲取頁面,並轉換爲bs4對象
  4. 抓取。獲取bs4對象中的各個元素

環境建議使用anaconda+vscode

1、安裝beautifulsoup4、urllib庫

vscode下,運行 pip install beautifulsoup4  、pip install urllib 

2、調用bs4

安裝完成後,嘗試包含庫運行:

from bs4 import BeautifulSoup

若沒有報錯,則說明庫已正常安裝完成。

3、頁面獲取

本文會通過這個網頁http://reeoo.com來進行示例講解,如下圖所示

先導入urllib.request庫,通過Request方法,訪問url,獲取網頁返回值,再通過BeautifulSoup 對象初始化

from bs4 import BeautifulSoup
import urllib.request

url = 'http://reeoo.com'

request = urllib.request.Request(url)

response = urllib.request.urlopen(request, timeout=20)

content = response.read()

soup = BeautifulSoup(content, 'html.parser')

將一段文檔傳入 BeautifulSoup 的構造方法,就能得到一個文檔對象,這個對象是beautifulsoup的對象格式。如下代碼所示,文檔通過請求url獲取:

" rel="EditURI" title="RSD" type="application/rsd+xml"/>
<link href="http://reeoo.com/wp-includes/wlwmanifest.xml" rel="wlwmanifest" type="application/wlwmanifest+xml"/>
<meta content="WordPress 4.9.8" name="generator"/>
</link></meta></meta></meta></meta></meta></meta></head>
<body>
<header id="header">
<div id="main_menu">
<div class="box">
<h1 id="logo"><a href="https://reeoo.com" title="Web design inspiration and gallery"><span class="icon-reeoo"></span></a></h1>
<ul>
<li class="active" id="link_web"><a href="https://reeoo.com" title="Web Design Gallery">Web Design</a></li>
<li id="link_iphone"><a href="https://iphone.reeoo.com" title="iPhone Patterns">iPhone App</a></li>
<li id="link_ipad"><a href="https://ipad.reeoo.com" title="iPad Patterns">iPad App</a></li>
<li id="link_icon"><a href="https://icon.reeoo.com" title="iOS Icon Design">Icon</a></li>
<li id="link_designer"><a href="https://designer.reeoo.com" title="Designer Show">Designer</a></li>
<li id="link_download"><a href="https://download.reeoo.com" title="Design resources download">Download</a></li>
</ul>
<div id="more">
<div id="search">
<span class="icon-search"></span>
<form action="https://reeoo.com" id="searchform" method="get">
<input id="s" name="s" placeholder="Search name or tag" required="" size="20" type="text" value=""/>
</form>
</div>
<div id="contact"><a href="http://weibo.com/reeoocom" target="_blank"><span class="icon-weibo"></span></a><a href="https://twitter.com/reeoocom" target="_blank"><span class="icon-twitter"></span></a><a href="mailto:[email protected]" target="_blank"><span class="icon-email"></span></a></div>
</div>
</div>
</div>
<div id="submenu">
<div class="box">
<div class="menu-color-menu-container"><ul class="menu" id="menu-color-menu"><li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3865" id="menu-item-3865"><a href="https://reeoo.com/category/black" title="Black Web Design">Black</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3866" id="menu-item-3866"><a href="https://reeoo.com/category/blue" title="Blue Web Design">Blue</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3867" id="menu-item-3867"><a href="https://reeoo.com/category/brown" title="Brown Web Design">Brown</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3869" id="menu-item-3869"><a href="https://reeoo.com/category/green" title="Green Web Design">Green</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3868" id="menu-item-3868"><a href="https://reeoo.com/category/gray" title="Gray Web Design">Gray</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3871" id="menu-item-3871"><a href="https://reeoo.com/category/orange" title="Orange Web Design">Orange</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3872" id="menu-item-3872"><a href="https://reeoo.com/category/purple" title="Purple Web Design">Purple</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-13232" id="menu-item-13232"><a href="https://reeoo.com/category/pink">Pink</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3873" id="menu-item-3873"><a href="https://reeoo.com/category/red" title="Red Web Design">Red</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3874" id="menu-item-3874"><a href="https://reeoo.com/category/white" title="White Web Design">White</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3875" id="menu-item-3875"><a href="https://reeoo.com/category/yellow" title="Yellow Web Design">Yellow</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3870" id="menu-item-3870"><a href="https://reeoo.com/category/multicolored" title="Multicolored Web Design">Multicolored</a></li>
</ul></div> <div class="filter">
<span class="icon-category"></span>
<div class="menu-header-menu-container"><ul class="menu" id="menu-header-menu"><li class="menu-item menu-item-type-custom menu-item-object-custom current-menu-item menu-item-11736" id="menu-item-11736"><a href="http://reeoo.com/">All</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11737" id="menu-item-11737"><a href="http://reeoo.com/?s=app">App</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11750" id="menu-item-11750"><a href="http://reeoo.com/tag/software">Software</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11754" id="menu-item-11754"><a href="http://reeoo.com/tag/icon">Icon</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11747" id="menu-item-11747"><a href="http://reeoo.com/?s=agency">Agency</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11752" id="menu-item-11752"><a href="http://reeoo.com/tag/company">Company</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11740" id="menu-item-11740"><a href="http://reeoo.com/?s=studio">Studio</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11738" id="menu-item-11738"><a href="http://reeoo.com/tag/coming-soon">Coming Soon</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11739" id="menu-item-11739"><a href="http://reeoo.com/tag/onepage">Onepage</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11751" id="menu-item-11751"><a href="http://reeoo.com/tag/cartoon">Cartoon</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11764" id="menu-item-11764"><a href="http://reeoo.com/?s=animation">Animation</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11766" id="menu-item-11766"><a href="http://reeoo.com/?s=develop">Develop</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11743" id="menu-item-11743"><a href="http://reeoo.com/tag/designer">Designer</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11741" id="menu-item-11741"><a href="http://reeoo.com/tag/food">Food</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11742" id="menu-item-11742"><a href="http://reeoo.com/tag/music">Music</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11749" id="menu-item-11749"><a href="http://reeoo.com/?s=movie">Movie</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11763" id="menu-item-11763"><a href="http://reeoo.com/?s=metting">Metting</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11744" id="menu-item-11744"><a href="http://reeoo.com/?s=shop">Shop</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11756" id="menu-item-11756"><a href="http://reeoo.com/tag/fashion">Fashion</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11745" id="menu-item-11745"><a href="http://reeoo.com/?s=wordpress">WordPress</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11746" id="menu-item-11746"><a href="http://reeoo.com/?s=theme">Theme</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11748" id="menu-item-11748"><a href="http://reeoo.com/?s=official">Official</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11753" id="menu-item-11753"><a href="http://reeoo.com/tag/travel">Travel</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11757" id="menu-item-11757"><a href="http://reeoo.com/?s=tool">Tool</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11755" id="menu-item-11755"><a href="http://reeoo.com/tag/product">Product</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11758" id="menu-item-11758"><a href="http://reeoo.com/?s=bike">Bike</a></li>
</ul></div> </div>
</div>
</div>
</header>
<article class="box">
<div id="main">
<ul id="list">
<li class="sponsor">
<script async="" id="_carbonads_js" src="//cdn.carbonads.com/carbon.js?serve=CKYIVKJ7&amp;placement=reeoocom" type="text/javascript"></script>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/loop">
<img alt="Loop" class="lazy" data-original="https://reeoo.xnny.net/Loop.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Loop" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/loop">Loop</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/programatorio">
<img alt="Programatório" class="lazy" data-original="https://reeoo.xnny.net/Programatorio.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="Programatório" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/programatorio">Programatório</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/ultraviolet-way">
<img alt="Ultraviolet Way" class="lazy" data-original="https://reeoo.xnny.net/Ultraviolet Way.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Ultraviolet Way" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/ultraviolet-way">Ultraviolet Way</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/misatoto-town">
<img alt="みさとと。" class="lazy" data-original="https://reeoo.xnny.net/Misatoto Town.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="みさとと。" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/misatoto-town">みさとと。</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/block-studio">
<img alt="Block Studio" class="lazy" data-original="https://reeoo.xnny.net/Block Studio.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Block Studio" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/block-studio">Block Studio</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/composition-no-24">
<img alt="Composition No. 24" class="lazy" data-original="https://reeoo.xnny.net/Composition No. 24.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Composition No. 24" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/composition-no-24">Composition No. 24</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/discovery-land-company">
<img alt="Discovery Land Company" class="lazy" data-original="https://reeoo.xnny.net/Discovery Land Company.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Discovery Land Company" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/discovery-land-company">Discovery Land Company</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/hardies">
<img alt="Hardies" class="lazy" data-original="https://reeoo.xnny.net/Hardies.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Hardies" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/hardies">Hardies</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/welchs-fruit-snacks">
<img alt="Welch’s Fruit Snacks" class="lazy" data-original="https://reeoo.xnny.net/Welch's Fruit Snacks.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Welch’s Fruit Snacks" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/welchs-fruit-snacks">Welch’s Fruit Snacks</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/exeron">
<img alt="EXERON" class="lazy" data-original="https://reeoo.xnny.net/EXERON.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="EXERON" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/exeron">EXERON</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/pop-weaver">
<img alt="Pop Weaver" class="lazy" data-original="https://reeoo.xnny.net/Pop Weaver.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Pop Weaver" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/pop-weaver">Pop Weaver</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/edesign-interactive">
<img alt="eDesign Interactive" class="lazy" data-original="https://reeoo.xnny.net/eDesign Interactive.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="eDesign Interactive" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/edesign-interactive">eDesign Interactive</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/obsolete">
<img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/obsolete">OBSOLETE</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/minibricks">
<img alt="Minibricks" class="lazy" data-original="https://reeoo.xnny.net/Minibricks.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Minibricks" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/minibricks">Minibricks</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/your-sport-agent">
<img alt="Your Sport Agent" class="lazy" data-original="https://reeoo.xnny.net/Your Sport Agent.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Your Sport Agent" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/your-sport-agent">Your Sport Agent</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/modulz">
<img alt="Modulz" class="lazy" data-original="https://reeoo.xnny.net/Modulz.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Modulz" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/modulz">Modulz</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/shift-2">
<img alt="Shift" class="lazy" data-original="https://reeoo.xnny.net/Shift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Shift" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/shift-2">Shift</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/rand">
<img alt="Rand" class="lazy" data-original="https://reeoo.xnny.net/Rand.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Rand" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/rand">Rand</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/rappipay-2">
<img alt="RappiPay" class="lazy" data-original="https://reeoo.xnny.net/RappiPay 2.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="RappiPay" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/rappipay-2">RappiPay</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/real-happiness-project-from-bbc-earth">
<img alt="Real Happiness Project from BBC Earth" class="lazy" data-original="https://reeoo.xnny.net/Real Happiness Project from BBC Earth.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Real Happiness Project from BBC Earth" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/real-happiness-project-from-bbc-earth">Real Happiness Project from BBC Earth</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/opera">
<img alt="OPERA" class="lazy" data-original="https://reeoo.xnny.net/OPERA.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="OPERA" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/opera">OPERA</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/kyoto-shin-nyo-do">
<img alt="真如堂を楽しむ" class="lazy" data-original="https://reeoo.xnny.net/Kyoto Shin nyo-do.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="真如堂を楽しむ" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/kyoto-shin-nyo-do">真如堂を楽しむ</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/bitbiome">
<img alt="bitBiome" class="lazy" data-original="https://reeoo.xnny.net/bitBiome.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="bitBiome" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/bitbiome">bitBiome</a></div>
</li>
</ul>
<!-- pb265 --><div class="pagebar"><span> </span><span class="this-page">1</span>
<a href="https://reeoo.com/page/2" title="Page 2">2</a>
<a href="https://reeoo.com/page/3" title="Page 3">3</a>
<a href="https://reeoo.com/page/4" title="Page 4">4</a>
<a href="https://reeoo.com/page/5" title="Page 5">5</a>
<a href="https://reeoo.com/page/6" title="Page 6">6</a>
<a href="https://reeoo.com/page/7" title="Page 7">7</a>
<a href="https://reeoo.com/page/8" title="Page 8">8</a>
<a href="https://reeoo.com/page/9" title="Page 9">9</a>
<span class="break">...</span>
<a href="https://reeoo.com/page/172" title="Page 172">172</a>
<a href="https://reeoo.com/page/173" title="Page 173">173</a>
<a href="https://reeoo.com/page/174" title="Page 174">174</a>
<a href="https://reeoo.com/page/175" title="Page 175">175</a>
<a href="https://reeoo.com/page/176" title="Page 176">176</a>
<a href="https://reeoo.com/page/177" title="Page 177">177</a>
<a href="https://reeoo.com/page/2" title="Page 2">&gt;</a>
</div></div>
</article>
<footer id="footer">
<div class="box">
<p>
<span class="link">
<a href="http://designlol.net" target="_blank" title="全球設計精華分享站">Design lol</a>
<a href="http://logojoy.com" target="_blank">Logojoy</a>
<a href="http://www.pplock.com/" target="_blank" title="分享藝術·設計·創意">PPLock</a>
<a href="http://reader.mx/?utm_source=reeoo&amp;utm_medium=web&amp;utm_campaign=link" target="_blank" title="Reader APP">ReaderMX</a>
<a href="http://www.ui.cn" target="_blank">UICN</a>
<a href="http://www.uisdc.com/" target="_blank" title="優秀網頁設計聯盟">UISDC</a>
<a href="http://zmingcx.com/" target="_blank" title="知更鳥">Zmingcx</a>
</span>
<span class="link">
<a href="https://logomaster.ai/" rel="noopener" target="_blank">Online Logo Maker</a>
<a href="http://www.treasurebox.co.nz/outdoor-garden/greenhouse.html" rel="noopener" target="_blank">greenhouse nz</a>
<a href="https://www.payformathhomework.com" target="_blank">Pay For Math Homework</a>- math help
				</span>
<a href="https://www.zessay.com/" target="_blank">Essay services</a> for college students.   
				<a href="https://myhomeworkdone.com/" target="_blank">My Homework Done</a> really makes your homework done.   
				<a href="http://mydissertations.com/" target="_blank">MyDissertations</a> - dissertation help on design topics.   
						<br/>
			Powered by <a href="http://wordpress.org/" target="_blank">WordPress</a>. © <a href="https://reeoo.com" rel="home" title="Reeoo">Reeoo.com</a>.</p>
</div>
</footer>
<script type="text/javascript">
/* <![CDATA[ */
var image_lazy_load = {"image_unveil_load":"0"};
/* ]]> */
</script>
<script src="http://reeoo.com/wp-content/plugins/image-lazy-load/js/min/frontend-min.js?ver=1.0.9" type="text/javascript"></script>
<script src="http://reeoo.com/wp-includes/js/wp-embed.min.js?ver=4.9.8" type="text/javascript"></script>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-11594399-2', 'auto');
  ga('send', 'pageview');

</script>
</body>
</html>

request 請求沒有做異常處理,這裏暫時先忽略,一般通過urllib庫判斷request的請求是否成功。BeautifulSoup 構造方法的第二個參數(lxml或html.parser)爲文檔解析器,若不傳入該參數,BeautifulSoup會自行選擇最合適的解析器來解析文檔,不過會有警告提示,具體可以參考bs4的幫助文檔(https://www.crummy.com/software/BeautifulSoup/bs4/doc/)。

也可以通過文件句柄來初始化,可先將HTML的源碼保存到本地同級目錄 reo.html,然後將文件名作爲參數:

soup = BeautifulSoup(open('reo.html'))

這樣就可以先把網頁都採集下來,再進行分析,避免了測試過程中,多次訪問網站,導致被屏蔽等問題。可以(print)打印 soup,輸出內容和HTML文本無二致,此時它爲一個複雜的樹形結構,每個節點都是Python對象。

4、獲取指定標籤

接下來示例代碼中所用到的 soup 都爲該soup。

4.1、Tag

Tag對象與HTML原生文檔中的標籤相同,可以直接通過對應名字獲取

tag = soup.title
print(tag)

打印結果:

<title>Reeoo - web design inspiration and website gallerytitle>

4.2、Name

通過Tag對象的name屬性,可以獲取到標籤的名稱

print tag.name

# title

4.3、Attributes

一個tag可能包含很多屬性,如id、class等,操作tag屬性的方式與字典相同。

例如網頁中包含縮略圖區域的標籤 article

...

<article class="box">

   <div id="main">

   <ul id="list">

       <li id="sponsor"><div class="sponsor_tips">div>

           <script async type="text/javascript" src="//cdn.carbonads.com/carbon.js?zoneid=1696&serve=CVYD42T&placement=reeoocom" id="_carbonads_js">script>

       li>

...

獲取它 class 屬性的值

tag = soup.article

c = tag['class']

 

# [u'box']

也可以直接通過 .attrs 獲取所有的屬性

tag = soup.article

attrs = tag.attrs

print(attrs)

# {u'class': [u'box']}

ps. 因爲class屬於多值屬性,所以它的值爲數組。

 

-1-tag中的字符串

通過 string 方法獲取標籤中包含的字符串

tag = soup.title

s = tag.string

print(s)

# Reeoo - web design inspiration and website gallery

 

-2-文檔樹的遍歷

一個Tag可能包含多個字符串或其它的Tag,這些都是這個Tag的子節點。Beautiful Soup提供了許多操作和遍歷子節點的屬性。

子節點

通過Tag的 name 可以獲取到對應標籤,多次調用這個方法,可以獲取到子節點中對應的標籤。

比如我們希望獲取到 article 標籤中的 li

tag = soup.article.div.ul.li

print(tag)

打印結果:

<li id="sponsor"><div class="sponsor_tips">div>

<script async="" id="_carbonads_js" src="//cdn.carbonads.com/carbon.js?zoneid=1696&serve=CVYD42T&placement=reeoocom" type="text/javascript">script>

li>

也可以把中間的一些節點省略,結果也一致

tag = soup.article.li

通過 . 屬性只能獲取到第一個tag,若想獲取到所有的 li 標籤,可以通過 find_all() 方法

ls = soup.article.div.ul.find_all('li')

獲取到的是包含所有li標籤的列表。

tag的 .contents 屬性可以將tag的子節點以列表的方式輸出:

tag = soup.article.div.ul
contents = tag.contents
print(contents)
for i in contents:
    print(i)

打印 contents 可以看到列表中不僅包含了 li 標籤內容,還包括了換行符 '\n ',也可以循環輸出一下,看看內部的區別。

通過tag的 .children 生成器,可以對tag的子節點進行循環

tag = soup.article.div.ul

children = tag.children

print(children)

for child in children:

   print(child)

可以看到 children 的類型爲 object對象。對比以上兩種for方法的結果,會發現他們結果差不多,但是可以看看開頭處,會發現children方法的結果更爲規範。

.contents 和 .children 屬性僅包含tag的直接子節點,若要遍歷子節點的子節點,可以通過 .descendants 屬性,方法與前兩者類似,這裏不列出來了。

-3-父節點

通過 .parent 屬性來獲取某個元素的父節點,article 的 父節點爲 body。

tag = soup.article

print tag.parent.name

# body

或者通過 .parents 屬性遍歷所有的父輩節點。

tag = soup.article

for p in tag.parents:

   print(p.name)

 

-4-兄弟節點

.next_sibling 和 .previous_sibling 屬性用來插敘兄弟節點,使用方式與其他的節點類似。

 

-5-文檔樹的搜索

對樹形結構的文檔進行特定的搜索是爬蟲抓取過程中最常用的操作。

find_all()

find_all(name , attrs , recursive , string , ** kwargs)

4.4、name 參數

查找所有名字爲 name 的tag

soup.find_all('title')

# [<title>Reeoo - web design inspiration and website gallerytitle>]

soup.find_all('footer')

# [<footer id="footer"> <div class="box"> <p> ... div> footer>]

4.5、keyword 參數

如果指定參數的名字不是內置的參數名(name , attrs , recursive , string),則將該參數當成tag的屬性進行搜索,不指定tag的話則默認爲對所有tag進行搜索。

如,搜索所有 id 值爲 footer 的標籤

soup.find_all(id='footer')

# [<footer id="footer"> <div class="box"> <p> ... div> footer>]

加上標籤的參數

soup.find_all('footer', id='footer')

[<footer id="footer">
 <div class="box">
 <p>
 <span class="link">
 <a href="http://designlol.net" target="_blank" title="全球設計精華分享站">Design lol</a>
 <a href="http://logojoy.com" target="_blank">Logojoy</a>
 <a href="http://www.pplock.com/" target="_blank" title="分享藝術·設計·創意">PPLock</a>
 <a href="http://reader.mx/?utm_source=reeoo&amp;utm_medium=web&amp;utm_campaign=link" target="_blank" title="Reader APP">ReaderMX</a>
 <a href="http://www.ui.cn" target="_blank">UICN</a>
 <a href="http://www.uisdc.com/" target="_blank" title="優秀網頁設計聯盟">UISDC</a>
 <a href="http://zmingcx.com/" target="_blank" title="知更鳥">Zmingcx</a>
 </span>
 <span class="link">
 <a href="https://logomaster.ai/" rel="noopener" target="_blank">Online Logo Maker</a>
 <a href="http://www.treasurebox.co.nz/outdoor-garden/greenhouse.html" rel="noopener" target="_blank">greenhouse nz</a>
 <a href="https://www.payformathhomework.com" target="_blank">Pay For Math Homework</a>- math help
 				</span>
 <a href="https://www.zessay.com/" target="_blank">Essay services</a> for college students.   
 				<a href="https://myhomeworkdone.com/" target="_blank">My Homework Done</a> really makes your homework done.   
 				<a href="http://mydissertations.com/" target="_blank">MyDissertations</a> - dissertation help on design topics.   
 						<br/>
 			Powered by <a href="http://wordpress.org/" target="_blank">WordPress</a>. © <a href="https://reeoo.com" rel="home" title="Reeoo">Reeoo.com</a>.</p>
 </div>
 </footer>]

 

獲取所有縮略圖的 div 標籤,縮略圖用 class 爲 thumb 標記

soup.find_all('div', class_='thumb')

這裏需要注意一點,因爲 class 爲Python的保留關鍵字,所以作爲參數時加上了下劃線,爲“class_”。

指定名字的屬性參數值可以包括:字符串、正則表達式、列表、True/False。

True/False

是否存在指定的屬性。

搜索所有帶有 target 屬性的標籤

soup.find_all(target=True)

搜索所有不帶 target 屬性的標籤(仔細觀察會發現,搜索結果還是會有帶 target 的標籤,那是不帶 target 標籤的子標籤,這裏需要注意一下。)

soup.find_all(target=False)

可以指定多個參數作爲過濾條件,例如頁面縮略圖部分的標籤如下所示:

<li>

   <div class="thumb">

       <a href="http://reeoo.com/aim-creative-studios">![AIM Creative Studios](http://upload-images.jianshu.io/upload_images/1346917-f6281ffe1a8f0b18.gif?imageMogr2/auto-orient/strip)a>

   div>

   <div class="title">

       <a href="http://reeoo.com/aim-creative-studios">AIM Creative Studiosa>

   div>

li>

搜索 src 屬性中包含 reeoo 字符串,並且 class 爲 lazy 的標籤:

注:這裏re是正則表達式,需要導入re包

soup.find_all(src=re.compile("reeoo.com"), class_='lazy')

搜索結果即爲所有的縮略圖 img 標籤。

有些屬性不能作爲參數使用,如 data-**** 屬性。在上面的例子中,data-original 不能作爲參數使用,運行起來會報錯,SyntaxError: keyword can't be an expression*。

4.6、attrs 參數

定義一個字典參數來搜索對應屬性的tag,一定程度上能解決上面提到的不能將某些屬性作爲參數的問題。

例如,搜索包含 data-original 屬性的標籤

print soup.find_all(attrs={'data-original': True})

[<img alt="Travelshift" class="lazy" data-original="https://reeoo.xnny.net/Travelshift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Travelshift" width="300"/>,
 <img alt="Loop" class="lazy" data-original="https://reeoo.xnny.net/Loop.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Loop" width="300"/>,
 <img alt="Programatório" class="lazy" data-original="https://reeoo.xnny.net/Programatorio.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="Programatório" width="300"/>,
 <img alt="Ultraviolet Way" class="lazy" data-original="https://reeoo.xnny.net/Ultraviolet Way.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Ultraviolet Way" width="300"/>,
 <img alt="みさとと。" class="lazy" data-original="https://reeoo.xnny.net/Misatoto Town.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="みさとと。" width="300"/>,
 <img alt="Block Studio" class="lazy" data-original="https://reeoo.xnny.net/Block Studio.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Block Studio" width="300"/>,
 <img alt="Composition No. 24" class="lazy" data-original="https://reeoo.xnny.net/Composition No. 24.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Composition No. 24" width="300"/>,
 <img alt="Discovery Land Company" class="lazy" data-original="https://reeoo.xnny.net/Discovery Land Company.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Discovery Land Company" width="300"/>,
 <img alt="Hardies" class="lazy" data-original="https://reeoo.xnny.net/Hardies.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Hardies" width="300"/>,
 <img alt="Welch’s Fruit Snacks" class="lazy" data-original="https://reeoo.xnny.net/Welch's Fruit Snacks.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Welch’s Fruit Snacks" width="300"/>,
 <img alt="EXERON" class="lazy" data-original="https://reeoo.xnny.net/EXERON.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="EXERON" width="300"/>,
 <img alt="Pop Weaver" class="lazy" data-original="https://reeoo.xnny.net/Pop Weaver.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Pop Weaver" width="300"/>,
 <img alt="eDesign Interactive" class="lazy" data-original="https://reeoo.xnny.net/eDesign Interactive.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="eDesign Interactive" width="300"/>,
 <img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>,
 <img alt="Minibricks" class="lazy" data-original="https://reeoo.xnny.net/Minibricks.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Minibricks" width="300"/>,
 <img alt="Your Sport Agent" class="lazy" data-original="https://reeoo.xnny.net/Your Sport Agent.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Your Sport Agent" width="300"/>,
 <img alt="Modulz" class="lazy" data-original="https://reeoo.xnny.net/Modulz.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Modulz" width="300"/>,
 <img alt="Shift" class="lazy" data-original="https://reeoo.xnny.net/Shift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Shift" width="300"/>,
 <img alt="Rand" class="lazy" data-original="https://reeoo.xnny.net/Rand.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Rand" width="300"/>,
 <img alt="RappiPay" class="lazy" data-original="https://reeoo.xnny.net/RappiPay 2.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="RappiPay" width="300"/>,
 <img alt="Real Happiness Project from BBC Earth" class="lazy" data-original="https://reeoo.xnny.net/Real Happiness Project from BBC Earth.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Real Happiness Project from BBC Earth" width="300"/>,
 <img alt="OPERA" class="lazy" data-original="https://reeoo.xnny.net/OPERA.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="OPERA" width="300"/>,
 <img alt="真如堂を楽しむ" class="lazy" data-original="https://reeoo.xnny.net/Kyoto Shin nyo-do.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="真如堂を楽しむ" width="300"/>]

搜索 data-original 屬性中包含 reeoo.com 字符串的標籤

soup.find_all(attrs={'data-original':re.compile('reeoo')})

[<img alt="Travelshift" class="lazy" data-original="https://reeoo.xnny.net/Travelshift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Travelshift" width="300"/>,
 <img alt="Loop" class="lazy" data-original="https://reeoo.xnny.net/Loop.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Loop" width="300"/>,
 <img alt="Programatório" class="lazy" data-original="https://reeoo.xnny.net/Programatorio.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="Programatório" width="300"/>,
 <img alt="Ultraviolet Way" class="lazy" data-original="https://reeoo.xnny.net/Ultraviolet Way.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Ultraviolet Way" width="300"/>,
 <img alt="みさとと。" class="lazy" data-original="https://reeoo.xnny.net/Misatoto Town.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="みさとと。" width="300"/>,
 <img alt="Block Studio" class="lazy" data-original="https://reeoo.xnny.net/Block Studio.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Block Studio" width="300"/>,
 <img alt="Composition No. 24" class="lazy" data-original="https://reeoo.xnny.net/Composition No. 24.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Composition No. 24" width="300"/>,
 <img alt="Discovery Land Company" class="lazy" data-original="https://reeoo.xnny.net/Discovery Land Company.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Discovery Land Company" width="300"/>,
 <img alt="Hardies" class="lazy" data-original="https://reeoo.xnny.net/Hardies.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Hardies" width="300"/>,
 <img alt="Welch’s Fruit Snacks" class="lazy" data-original="https://reeoo.xnny.net/Welch's Fruit Snacks.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Welch’s Fruit Snacks" width="300"/>,
 <img alt="EXERON" class="lazy" data-original="https://reeoo.xnny.net/EXERON.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="EXERON" width="300"/>,
 <img alt="Pop Weaver" class="lazy" data-original="https://reeoo.xnny.net/Pop Weaver.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Pop Weaver" width="300"/>,
 <img alt="eDesign Interactive" class="lazy" data-original="https://reeoo.xnny.net/eDesign Interactive.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="eDesign Interactive" width="300"/>,
 <img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>,
 <img alt="Minibricks" class="lazy" data-original="https://reeoo.xnny.net/Minibricks.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Minibricks" width="300"/>,
 <img alt="Your Sport Agent" class="lazy" data-original="https://reeoo.xnny.net/Your Sport Agent.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Your Sport Agent" width="300"/>,
 <img alt="Modulz" class="lazy" data-original="https://reeoo.xnny.net/Modulz.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Modulz" width="300"/>,
 <img alt="Shift" class="lazy" data-original="https://reeoo.xnny.net/Shift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Shift" width="300"/>,
 <img alt="Rand" class="lazy" data-original="https://reeoo.xnny.net/Rand.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Rand" width="300"/>,
 <img alt="RappiPay" class="lazy" data-original="https://reeoo.xnny.net/RappiPay 2.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="RappiPay" width="300"/>,
 <img alt="Real Happiness Project from BBC Earth" class="lazy" data-original="https://reeoo.xnny.net/Real Happiness Project from BBC Earth.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Real Happiness Project from BBC Earth" width="300"/>,
 <img alt="OPERA" class="lazy" data-original="https://reeoo.xnny.net/OPERA.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="OPERA" width="300"/>,
 <img alt="真如堂を楽しむ" class="lazy" data-original="https://reeoo.xnny.net/Kyoto Shin nyo-do.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="真如堂を楽しむ" width="300"/>]

搜索 data-original 屬性爲指定值的標籤

soup.find_all(attrs={'data-original': 'https://reeoo.xnny.net/OBSOLETE.png!page'})

[<img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>]

4.7、string 參數

和 name 參數類似,針對文檔中的字符串內容。

搜索包含 Reeoo 字符串的標籤

soup.find_all(string=re.compile("Reeoo"))

 

4.8、limit 參數

find_all() 返回的是整個文檔的搜索結果,如果文檔內容較多則搜索過程耗時過長,加上 limit 限制,當結果到達 limit 值時停止搜索並返回結果。

搜索 class 爲 thumb 的 div 標籤,只搜索3個

soup.find_all('div', class_='thumb', limit=3)

打印結果爲一個包含3個元素的列表,實際滿足結果的標籤在文檔裏不止3個。

4.9、recursive 參數

find_all() 會檢索當前tag的所有子孫節點,如果只想搜索tag的直接子節點,可以使用參數 recursive=False。

 

4.10、find()

find(name , attrs , recursive , string , ** kwargs)

find() 方法和 find_all() 方法的參數使用基本一致,只是 find() 的搜索方法只會返回第一個滿足要求的結果,等價於 find_all() 方法並將limit設置爲1。

soup.find_all('div', class_='thumb', limit=1)

soup.find('div', class_='thumb')

搜索結果一致,唯一的區別是 find_all() 返回的是一個數組,find() 返回的是一個元素。

當沒有搜索到滿足條件的標籤時,find() 返回 None, 而 find_all() 返回一個空的列表。

 

4.11、CSS選擇器

Tag 或 BeautifulSoup 對象通過 select() 方法中傳入字符串參數, 即可使用CSS選擇器的語法找到tag。

語義和CSS一致,搜索 article 標籤下的 ul 標籤中的 li 標籤

print(soup.select('article ul li'))

通過類名查找,兩行代碼的結果一致,搜索 class 爲 thumb 的標籤

soup.select('.thumb')

soup.select('[class~=thumb]')

通過id查找,搜索 id 爲 submenu的標籤

soup.select('#submenu')

通過是否存在某個屬性來查找,搜索具有 id 屬性的 li 標籤

soup.select('li[id]')

通過屬性的值來查找查找,搜索class爲 sponsor 的 li 標籤

soup.select('li[class="sponsor"]')

 

其他

其他的搜索方法還有:

find_parents() 和 find_parent()

find_next_siblings() 和 find_next_sibling()

find_previous_siblings() 和 find_previous_sibling()

參數的作用和 find_all()、find() 差別不大,這裏就不再列舉使用方式了。這兩個方法基本已經能滿足絕大部分的查詢需求。

還有一些方法涉及文檔樹的修改。對於爬蟲來說大部分工作只是檢索頁面的信息,很少需要對頁面源碼做改動,所以這部分的內容也不再列舉。

具體詳細信息可直接參考Beautiful Soup庫的官方說明文檔。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章