beautifulsoup

文章目录

四、beautiful soup 对象的方法

summary:

一、简介：

beautiful soup 是一个HTML XML 的解析库，可以完成导航、收索、修改分析树等功能

它自动将输入的文档转换成 Unicode 编码，输出文档转换为 UTF-8编码

二、beautifulsoup支持的解析器

lxml HTML解析器	BeautifulSoup(markup,“lxml”)	速度快，文档容错能力强
lxml XML解析器	BeautifulSoup(markup,“xml”)	速度快，唯一支持XML
html5lib	BeautifulSoup(makeup,‘html5lib’)	最好容错性，以浏览器的方式解析文档、生成HTML5格式的文档

三、beautiful soup 对象的属性

节点选择器

html='''

<html><head><title>the dormouse's story </title></head>

<body>

<p class="title" name="dromouse"><b>the dormouse story</b></p>

<p class="story"> once up on a time there were three little sisters;and their names were

<a href="http://exampe.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://exampe.com/lacie" class="sister" id="link2">Lacie</a>and


<a href="http://exampe.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

#创建Beautiful Soup 对象，并会自动补全html缺失的标签
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)#利用name属性可以获取节点的名称#title
print(soup.p.attrs)#利用attrs获取节点的属性，Id，class 等等#字典形式{'class': ['title'], 'name': 'dromouse'}
print(soup.p['class'])#获取字典中的值#['title']
print(soup.p.string)#利用 string 获得节点内容#the dormouse story

name：节点名称

attrs: 节点属性 id class 等等获取结果为字典

string ：获取节点内容

注意：属性对应的为第一个匹配值的结果

关联选择：

（1)子节点和子孙节点

contents属性获取节点的直接字节点，既包含节点，又包含文本，最后会将他们以列表形式统一返回

调用childrens属性时，返回结果是生成器类型

descendants 属性，获取所有的子孙节点

（2）父节点和祖先节点

parent 获取某个节点的父节点（直接父节点以及其内部的内容）

parents 获取所有的祖先节点

（3）兄弟节点（同级节点）

4个属性

next_sibling和 previous_sibling分别获取节点的下一个和上一个兄弟元素

next_siblings和 previous_sibling则分别返回所有后面和前面的兄弟节点的生成器

四、beautiful soup 对象的方法

1 find_all(name,attrs,recursive,text,**kwargs)

name : 根据节点名来查询元素

attrs :根据属性名来查询

print(soup.find_all(attrs={'id':'list-1'}))

text 匹配节点的文本

text 参数可用来匹配节点的文本，传入的形式可以是字符串，可以是正则表达式。

2 find()

只返回匹配的第一个元素

而find_all()返回的是所有匹配元素组成的列表

3 find_parents()和 find_parent()

前者返回所有的祖先节点，后者返回直接节点

4 find_next_siblings 和 find_next_sibling

前者返回前面的所有兄弟节点，后者返回前面的第一个兄弟节点

5 find_all_next()和find_next()

前者返回所有符号条件的节点，后者返回第一个符合条件的节点

6 find_all_previous()和find_previous

前者返回节点后所有符号条件的节点，后者返回第一个符合条件的节点

五、CSS选择器

使用 css 选择器时，只需要调用 select（）方法，传人相应的 css 选择器即可

print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(soup.select('ul')[0])

返回的结果均是符合 css 选择器的节点组成的列表。

例如， select(‘ul li’）则是选择所有 ul 节点下面的所有 li 节点，结果便是所有的 li 节点组成的列表

支持遍历

from bs4 import BeatifulSoup
soup = Beautiful(html,'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

正常输出了所有 ul 节点下所有 li 节点组成的列表。

summary:

节点选择筛选器功能弱，但是速度快

如果对CSS选择器熟悉的话，可以使用select()方法选择

看来要去学一下前端三件套了
HTML，一个标签里的内容

beautifulsoup

beautifulsoup

文章目录

一、简介：

二、beautifulsoup支持的解析器

三、beautiful soup 对象的属性

name：节点名称

attrs: 节点属性 id class 等等获取结果为字典

string ：获取节点内容

（1)子节点和子孙节点

（2）父节点和祖先节点

（3）兄弟节点（同级节点）

四、beautiful soup 对象的方法

1 find_all(name,attrs,recursive,text,**kwargs)

2 find()

3 find_parents()和 find_parent()

4 find_next_siblings 和 find_next_sibling

5 find_all_next()和find_next()

6 find_all_previous()和find_previous

summary:

工作中用到的脚本合集

24-5-18 X

urllib.robotparser

數據庫三層安全模型

BeautifulSoup簡介：day3

urllib.parse

網頁基礎（二）結構

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

beautifulsoup

beautifulsoup

文章目录

一、简介：

二、beautifulsoup支持的解析器

三、beautiful soup 对象的属性

name：节点名称

attrs: 节点属性 id class 等等获取结果为字典

string ： 获取节点内容

（1)子节点和子孙节点

（2）父节点和祖先节点

（3）兄弟节点（同级节点）

四、beautiful soup 对象的方法

1 find_all(name,attrs,recursive,text,**kwargs)

2 find()

3 find_parents()和 find_parent()

4 find_next_siblings 和 find_next_sibling

5 find_all_next()和find_next()

6 find_all_previous()和find_previous

summary:

string ：获取节点内容