BeautifulSoup（豆瓣例子）

原創

2020-06-02 05:19

一、安裝

1.先打開路徑如下,定位到Scripts

2.輸入命令進行安裝beautifulsoup

pip install beautifulsoup4

二、導入

import urllib ##這個是自帶的
from urllib.request import urlopen ##這個是自帶的
from bs4 import BeautifulSoup

三、基本用法

1.通過字符串創建BeautifulSoup對象

>>> helloworld='<p>Hello World</p>'
>>> soup_string=BeautifulSoup(helloworld,"html.parser")
>>> soup_string ##輸出
<p>Hello World</p> ##結果

2.通過類文件對象創建BeautifulSoup對象

>>> url = "http://www.baidu.com"
>>> page = urllib.request.urlopen(url)
>>> soup = BeautifulSoup(page,"html.parser")

3.通過本地文件對象創建BeautifulSoup對象

with open('index.html','r') as foo_file :
    soup_foo = BeautifulSoup(foo_file, "html.parser")

4.使用BeautifulSoup庫的 find()和findAll()函數

這兩個函數的使用很靈活，可以：通過tag的id屬性搜索標籤、通過tag的class屬性搜索標籤、通過字典的形式搜索標籤內容返回的爲一個列表、通過正則表達式匹配搜索等等

基本使用格式

pid = soup.find(attrs={"id":"aa"})
pid = soup.findAll('a',{'class':'sister'})

四、例子（書名）

1.目標：利用BeautifulSoup爬出此頁面的書名

2.首先分析一下html的佈局

3.通過下圖能看出，所有書籍都在一個id爲book的div中，可以簡單地看成如下代碼

<div id="book">
	<dl>書籍</dl>
	<dl>書籍</dl>
	<dl>書籍</dl>
	<dl>書籍</dl>
	<dl>書籍</dl>
	<dl>書籍</dl>
	<dl>書籍</dl>
	<dl>書籍</dl>
	<dl>書籍</dl>
	<dl>書籍</dl>
</div>

4.我們再點開dl標籤繼續分析

5.看出來書名是一個a標籤，並且class="title",我們重新整理一下思路，可以看成如下代碼。

<div id="book">
	<dl><a class="title">書籍</a></dl>
	<dl><a class="title">書籍</a></dl>
	<dl><a class="title">書籍</a></dl>
	<dl><a class="title">書籍</a></dl>
	<dl><a class="title">書籍</a></dl>
	<dl><a class="title">書籍</a></dl>
	<dl><a class="title">書籍</a></dl>
	<dl><a class="title">書籍</a></dl>
	<dl><a class="title">書籍</a></dl>
	<dl><a class="title">書籍</a></dl>
</div>

6.思路：

通過find()函數找到id="book"的div
緊接着通過findAll()函數把class="title"的a標籤都存儲在列表中
for循環遍歷列表

7.python代碼實現

>>> import urllib
>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>> res = urllib.request.urlopen("http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book")
>>> soup = BeautifulSoup(res,"html.parser")
>>> book_div = soup.find(attrs={"id":"book"})
>>> book_a = book_div.findAll(attrs={"class":"title"})
>>> for book in book_a:
		print(book.string)

五、例子（圖片）

如何獲取書名明白的話，那這獲取圖片就很好理解了，這裏把獲取書名寫成一個getText()函數

from urllib import request
from bs4 import BeautifulSoup

#請求網頁
def getHtmlCode(url):
    headers = {
        'User-Agent' : 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36'
    }
    url1 = request.Request(url, headers = headers)
    page = request.urlopen(url1).read().decode()
    return page
    
#獲取書名
def getText(page):
	soup = BeautifulSoup(page,'html.parser')
	book_div = soup.find(attrs={"id":"book"})
	book_a = book_div.findAll(attrs={"class":"title"})
	text="書名:"+"\n"
	for book in book_a:
		text+=str(book.string)+"\n"
	#寫入txt文件中
	with open('D:\\text.txt','w') as f:
		f.write(text)

#獲取圖片
def getImg(page):
    soup = BeautifulSoup(page,'html.parser')                
    book_list = soup.find(attrs={"id":"book"})             
    book_list_img = book_list.find_all('img')  
    x = 0
    for book_one in book_list_img:
        book_img_url = book_one.get('src')
        request.urlretrieve(book_img_url, 'D:\%s.jpg' %x)
        x += 1

#設置url
url = 'http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book'
#請求網頁
page = getHtmlCode(url)
#獲取書名
getText(page)
#獲取圖片
getImg(page)

感謝
https://www.cnblogs.com/sunnywss/p/6644542.html
https://blog.csdn.net/huxiny/article/details/79679066

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

BeautifulSoup（豆瓣例子）

一、安裝

二、導入

三、基本用法

四、例子（書名）

五、例子（圖片）

druid數據源 xml配置

Android真機調用本地接口失敗原因

unable to connect to zookeeper server within timeout:5000

Handler使用大全

對象訪問方式

BeautifulSoup（豆瓣例子）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結