课堂笔记-爬虫beautifulsoup模块

原創

2020-06-04 06:06

课堂笔记

1. bs4简介

1.1 基本概念
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库
1.2 源码分析
• github下载源码
• 安装
• pip install lxml
• pip install bs4

2. bs4的使用

2.1 快速开始

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""


# 获取bs对象
bs = BeautifulSoup(html_doc,'lxml')
# 打印文档内容(把我们的标签更加规范的打印)

print(bs.prettify())
print(bs.title) # 获取title标签内容 <title>The Dormouse's story</title>
print(bs.title.name) # 获取title标签名称 title
print(bs.title.string) # title标签里面的文本内容 The Dormouse's story
print(bs.p) # 获取p段落

2.2 bs4的对象种类

• tag : 标签
• NavigableString : 可导航的字符串
• BeautifulSoup : bs对象
• Comment : 注释

3. 遍历树遍历子节点

bs里面有三种情况，第一个是遍历，第二个是查找，第三个是修改

3.1 contents children descendants

• contents 返回的是一个列表
• children 返回的是一个迭代器通过这个迭代器可以进行迭代
• descendants 返回的是一个生成器遍历子子孙孙

3.2 .string .strings .stripped strings

• string获取标签里面的内容
• strings 返回是一个生成器对象用过来获取多个标签内容
• stripped strings 和strings基本一致但是它可以把多余的空格去掉

4. 遍历树遍历父节点

parent 和 parents
• parent直接获得父节点
• parents获取所有的父节点

5. 遍历树遍历兄弟结点

• next_sibling 下一个兄弟结点
• previous_sibling 上一个兄弟结点
• next_siblings 下一个所有兄弟结点
• previous_siblings上一个所有兄弟结点

6. 搜索树

• 字符串过滤器
• 正则表达式过滤器
我们用正则表达式里面compile方法编译一个正则表达式传给 find 或者 findall这个方法可以实现一个正则表达式的一个过滤器的搜索
• 列表过滤器
• True过滤器
• 方法过滤器

7. find_all() 和 find()

7.1 find_all()

• find_all()方法以列表形式返回所有的搜索到的标签数据
• find()方法返回搜索到的第一条数据
• find_all()方法参数
def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs):
• name : tag名称
• attr : 标签的属性
• recursive : 是否递归搜索
• text : 文本内容
• limli : 限制返回条数
• kwargs : 关键字参数

7.2 find_parents() find_parent() find_next_siblings() find_next_sibling()

• find_parents() 搜索所有父亲
• find_parrent() 搜索单个父亲
• find_next_siblings()搜索所有兄弟
• find_next_sibling()搜索单个兄弟

7.3 find_previous_siblings() find_previous_sibling find_all_next() find_next()

• find_previous_siblings() 往上搜索所有兄弟
• find_previous_sibling() 往上搜索单个兄弟
• find_all_next() 往下搜索所有元素
• find_next()往下查找单个元素

8. 修改文档树

• 修改tag的名称和属性
• 修改string 属性赋值,就相当于用当前的内容替代了原来的内容
• append() 像tag中添加内容,就好像Python的列表的 .append() 方法
• decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

课堂笔记-爬虫beautifulsoup模块

课堂笔记

1. bs4简介

2. bs4的使用

2.1 快速开始

2.2 bs4的对象种类

3. 遍历树遍历子节点

3.1 contents children descendants

3.2 .string .strings .stripped strings

4. 遍历树遍历父节点

5. 遍历树遍历兄弟结点

6. 搜索树

7. find_all() 和 find()

7.1 find_all()

7.2 find_parents() find_parent() find_next_siblings() find_next_sibling()

7.3 find_previous_siblings() find_previous_sibling find_all_next() find_next()

8. 修改文档树

使用neovim打造go ide(支持代码跳转, 代码补全, 实时语法检查)

挑战程序设计竞赛 2.3章习题 poj 3046 Ant Counting

Shell/Python中的用户名获取

linux操作系統快速入門

pyecharts基本繪圖技巧入門

socket編程---TCP發送與接收數據

數據分析---for循環，while循環，正則表達式

python的GUI編程初探，這部分內容真的非常有意思

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

课堂笔记-爬虫beautifulsoup模块

课堂笔记

1. bs4简介

2. bs4的使用

2.1 快速开始

2.2 bs4的对象种类

3. 遍历树 遍历子节点

3.1 contents children descendants

3.2 .string .strings .stripped strings

4. 遍历树 遍历父节点

5. 遍历树 遍历兄弟结点

6. 搜索树

7. find_all() 和 find()

7.1 find_all()

7.2 find_parents() find_parent() find_next_siblings() find_next_sibling()

7.3 find_previous_siblings() find_previous_sibling find_all_next() find_next()

8. 修改文档树

3. 遍历树遍历子节点

4. 遍历树遍历父节点

5. 遍历树遍历兄弟结点