Python網絡爬蟲與信息提取【BeautifulSoup (“美味的湯”)庫的安裝與用法】

1.Beautiful Soup庫的安裝

Beautiful Soup庫是解析、遍歷、維護“標籤樹”的功能庫
1.Beautiful Soup 是用Python寫的一個HTML/XML的解析器，它可以很好的處理不規範標記並生成剖析樹(parse tree)。
2.它提供簡單而又常用的導航（navigating），搜索以及修改剖析樹的操作。可以節省你的編程時間。
BeautifulSoup庫本身解析的是html和xml文檔，那麼這個文檔與標籤樹是一一對應的，經過了BeautifulSoup類的處理，html或xml文檔這樣的標籤樹，就被轉換成一個BeautifulSoup類。BeautifulSoup類就是能夠代表標籤樹的一個類型。
安裝方式一："windows+R"輸入cmd，在命令行中輸入 pip install beautifulsoup4安裝完成後可以通過from bs4 import BeautifukSoup進行檢測,不報錯，證明安裝成功
安裝方式二：從官網下載Beautifulsoup的軟件包，然後解壓，cmd命令行進入解壓包目錄，輸入以下命令安裝：python setup.py install在Python3裏一定要安裝beautifulsoup4的版本，其它版本安裝不上的。

2.Beautiful Soup庫的基本元素

import requests
r = requests.get("http://www.baidu.com")
demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
print(soup.prettify())

Beautiful Soup庫解析器：

解析器	使用方法	條件
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	安裝bs4庫
xmI的HTML解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk, ‘html5lib’)	pip install html5lib

Beautiful Soup類的基本元素：

基本元素	說明
Tag	標籤，最基本的信息組織單元，分別用`<>`和`</>`標明開頭和結尾
Name	標籤的名字，`<p>...</p>`的名字是’p’, 格式: `<tag>.name`
Attributes	標籤的屬性，字典形式組織，格式:`<tag>.attrs`
NavigableString	標籤內非屬性字符串，`<>...</>`中字符串，格式: `<tag>.string`
Comment	標籤內字符串的註釋部分，一種特殊的Comment類型

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>
>>> soup.title
<title>This is a python demo page</title>
>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
>>> tag.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'
>>> newsoup = BeautifulSoup('<b><!--This is a comment--></b></p>This is not a comment</p>','html.parser')
>>> newsoup.b.string
'This is a comment'
>>> newsoup = BeautifulSoup('<b><!--This is a comment--></b><p>This is not a comment</p>','html.parser')
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

3.基於bs4庫的HTML內容遍歷方法

將網頁的標籤進行分類，其中<html>包含兩個平行標籤<head>、<body>
遍歷的三種方式：下行遍歷、上行遍歷、平行遍歷（發生在同一個父節點下的各節點間）
標籤樹的下行遍歷

屬性	說明
.contents	子節點的列表，將`<tag>`所有兒子節點存入列表
.children	子節點的迭代類型，與.contents類似，用於循環遍歷兒子節點
.descendants	子孫節點的迭代類型，包含所有子孫節點，用於循環遍歷

標籤樹的下行遍歷

for child in soup.body.children:	#遍歷兒子節點
    print(child)
for child in soup.body.children:	#遍歷子孫節點
    print(child)

屬性	說明
.parent	節點的父親標籤
.parents	節點先輩標籤的迭代類型

import requests
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,'html.parser')
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
print(soup.parent)

標籤樹的平行遍歷

屬性	說明
.next_sibling	返回按照HTML文本順序的下一個平行節點標籤
.previous_sibling	返回按照HTML文本順序的上一個平行節點標籤
.next_siblings	迭代類型，返回按照HTML文本順序的後續所有平行節點標籤
.previous_siblings	迭代類型，返回按照HTML文本順序的前序所有平行節點標籤

標籤樹的平行遍歷

for sibling in soup.a.next_siblings:   	    #遍歷後續節點
    print(sibling)
for sibling in soup.a.previous_siblings:    #遍歷前序節點
    print(sibling)

4.基於bs4庫的HTML格式化和編碼

如何讓<html>內容更加“友好”的顯示？
bs4庫的prettify( )方法，爲html文本的標籤和內容增加換行符，也可以對每個標籤進行相關處理

print(soup.prettify())

bs庫將任何讀入的html文件或字符串都轉換成了’utf-8‘編碼

5.信息標記的三種形式：

信息的標記：  標記後的信息可形成信息組織結構，增加了信息維度
                        標記後的信息可用於通信、存儲或展示
                        標記的結構與信息一樣具有重要價值
                        標記後的信息更利於程序理解和運用
HTML（超文本標記語言）：通過預定義的<>…</>標籤形式組織不同類型的信息
國際公認的信息標記的三種形式：XML、JASON、YAML
XML：XML格式是基於HTML格式發展而來的一種通用的信息表達形式。
示例：<name>...</name>、<name />、 
JASON：由有類型的鍵值對(key:value)構建的信息表達形式。會出現利用鍵值對的嵌套使用、以及多值嵌套的形式。"key":"value""key":["value1","value2"]"key":{"subkey":"subvalue"}
YAML：用無類型的鍵值對來組建信息。

key:value
key:#表示並列的信息
-value1
-value2
key:
	subkey:subvalue
text |#後面可接大段文字

三種信息標記形式的比較
XML：最早的通用信息標記語言，可擴展性好，但繁瑣。Internet. 上的信息交互與傳遞。
JASON：信息有類型，適合程序處理(is),較XML簡潔。移動應用雲端和節點的信息通信，無註釋。
YAML：信息無類型，文本信息比例最高，可讀性好。各類系統的配置文件，有註釋易讀。
注：更新中。。。

Python網絡爬蟲與信息提取【BeautifulSoup (“美味的湯”)庫的安裝與用法】

1.Beautiful Soup庫的安裝

2.Beautiful Soup庫的基本元素

3.基於bs4庫的HTML內容遍歷方法

4.基於bs4庫的HTML格式化和編碼

5.信息標記的三種形式：

SQL優化-20231016

2020美賽C題的一些想法總結

VScode C語言項目文件配置

Matlab 圖像處理(基礎篇)

網頁視頻資源下載

Matlab 圖像處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結