Python BeautifulSoup基本使用

原創

2020-06-04 07:43

示例代码：

#coding:utf-8
from bs4 import BeautifulSoup
import urllib.request

url = 'http://reeoo.com'
request = urllib.request.Request(url)
response = urllib.request.urlopen(request, timeout=20)
content = response.read()
soup = BeautifulSoup(content, 'html.parser')
print(soup)

# html_doc = """
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title" stlye="color:red"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" rel="external nofollow" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" rel="external nofollow" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" rel="external nofollow"  class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# """
#
# soup = BeautifulSoup(html_doc, 'lxml')
# # print(soup.prettify())
# print(soup.p.attrs)

要点概括：
1、引入类库文件：

from bs4 import BeautifulSoup

需先进行类库安装：

pip install beautifulsoup4

2、初始化，使用BeautifulSoup解析返回结果

soup = BeautifulSoup(content, 'html.parser')

其中，‘html.parser’为解析器（Python的内置标准库、执行速度适中、文档容错能力强，但Python 2.7.3及Python 3.2.2之前的版本文档容错能力差），其他解析器还有：lxml（速度快、文档容错能力强，但需要安装C语言库）、xml（速度快、唯一支持XML的解析器，但需要安装C语言库）、html5lib（最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档，但速度慢，不依赖外部扩展）

3、功能特性：
按照标准的缩进格式进行结构输出

print(soup.prettify())

输出匹配到的首个标签：

print(soup.title)

输出匹配到的首个标签的内容：

print(soup.title.string)

输出所有匹配到的标签：

print(soup.find_all('a'))

通过标签属性进行匹配首个标签：

print(soup.find(id='link1'))

4、Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象，所有对象可以归纳为 4 种: Tag、NavigableString、BeautifulSoup、Comment 。

Tag ：有两个重要的属性 name 和 attrs，name 指标签的名字或者 tag 本身的 name，attrs 通常指一个标签的 class

print(soup.p.name)
print(soup.p.attrs)

NavigableString：获取标签内部的文字

print(soup.p.string)

BeautifulSoup：表示一个文档的全部内容。大部分时候，可以把它当作 Tag 对象，是一个特殊的 Tag
Comment：Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号，但是如果不好好处理它，可能会对我们的文本处理造成意想不到的麻烦

>>> markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
>>> soup = BeautifulSoup(markup)
>>> comment = soup.b.string
>>> print(comment)
Hey, buddy. Want to buy a used parser?
>>> type(comment)
<class 'bs4.element.Comment'>

5、find_all( name , attrs , recursive , text , **kwargs )

soup.find_all('b')
soup.find_all(id='link1')
soup.find_all(attrs={'id':'link1'})
soup.find_all(class_='sister')
soup.find_all('a', string='Elsie')

recursive 参数:Beautiful Soup 会检索当前 tag 的所有子孙节点，如果只想搜索 tag 的直接子节点，可以使用参数 recursive=False。

9、find()
它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法只返回第一个匹配的结果。

10、get_text()
如果只想得到 tag 中包含的文本内容，那么可以用 get_text() 方法，这个方法获取到 tag 中包含的所有文本内容。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python BeautifulSoup基本使用

HTML页面关于高分屏的设置

北欧瑞典挪威芬兰瑞士TikTok海外网红与YouTube博主的合作模式

欧洲英国德国法国TikTok与YouTube海外网红达人的完美合作策略

druid数据源 xml配置

微信公衆號開發---本地調試（使用內網穿透）

phpstorm配置xdebug（遠程服務器）

Python BeautifulSoup基本使用

Python urllib.request基本使用

VMware15配置Centos7靜態IP

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結