1.裝備工作:模塊安裝

1.1命令安裝方式：（開發環境:python3.6環境）

官方文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html

官方文檔中文版：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

pip install beautifulsoup4

easy_install beautifulsoup4

下載源碼包(tar.gz)安裝

python setup.py install

安裝解析器：BeautifulSoup4 本身只包含解析器接口規則，支持 python 內置的 HTML 解析器，同時支持第三方的一些解析器[如 lxml] 如果使用過程中沒有指定使用解析器，BeautifulSoup4 會自動從當前的 python 環境中檢索最優的解析器加載使用

pip install lxml or pip install html5lib

1.2不同解析器對比

解析器	使用	優缺點
標準庫	BeautifulSoup(html, ‘html.parser’)	優點：內置標準庫、執行速度適中、容錯能力強缺點：對中文解析的容錯能力有限
lxml	BeautifulSoup(html, ‘lxml’)	優點：速度快，容錯能力強缺點：需要安裝 c 語言庫
lxml	BeautifulSoup(html, [‘lxml’, ‘xml’])	唯一支持 XML 的解析器
html5lib	BeautifulSoup(html, ‘html5lib’)	優點：容錯能力最好瀏覽器方式解析文檔生成 H5 格式的文檔缺點：速度慢不依賴外部擴展

綜上所述：優先推薦使用 lxml 解析器

2.BeautifulSoup4 初始化和節點對象的認識

BeautifulSoup4(BS4)可以將一個 html 文檔數據流或者 html 文檔對象或者 html 字符串，直接加載到 BS4 對象中，得到一個 BS4 的文檔對象

from bs4 import BeautifulSoup  

    
soup = BeautifulSoup(open(‘index.html’))   #本地要有這個網頁如果沒有就從網站直接獲取   
soup = BeautifulSoup(‘<html>……</html>’)

BeautifulSoup4 將 HTML結構化文檔，解析爲樹形結構化數據，樹形文檔對象中主要包含四種對象

tag: 標籤節點對象節點
name: 標籤名稱對象節點
attributes: 標籤屬性對象節點，like dict，如果是多值屬性，屬性對應的值就是一個 list 數據
NavigateString: 標籤中文本對象節點，可以直接通過 unicode()函數將文本轉換成 unicode 字符串

案例操作：

# 目標數據

html_doc = """

<html>

<head>

<title>The Dormouse's story</title>

</head>

<body>

The Dormouse's story
Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
...

</body>

</html>

“””

from bs4 import BeautifulSoup # 構建 bs4 文檔對象

soup = BeautifulSoup(html_doc, 'lxml')

print(type(soup)) # <class 'bs4.Be autifulSoup'>

3. 節點查詢：子節點

# 節點名稱查詢

print(soup.title) # <title>The Dormouse's story</title>

print(soup.p) # The Dormouse's story

print(soup.a) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

# 子節點查詢

chs = soup.contents print(type(chs), len(chs)) # <class 'list'> 2

chs2 = soup.children print(type(chs2), len(list(chs2))) # <class 'list_iterator'> 2

# 包含節點查詢

chs3 = soup.descendants print(type(chs3), len(list(chs3))) # <class 'generator'> 28

# 子節點內容獲取

# 獲取節點內部文本數據

print(soup.title.string) # The Dormouse's story

# 獲取文本節點的文本數據

print(soup.title.contents[0].string) # The Dormouse's story

# 如果包含子節點並且不是文本節點，返回 None print(soup.body.string) # None

# 獲取包含的所有文本數據，包括空白字符 print(soup.body.strings) # <generator object Tag._all_strings at 0x0000021B6360F6D8>

# 獲取包含的所有文本數據，剔除空白字符

print(soup.body.stripped_strings) # <generator object Tag.stripped_strings at 0x0000022BBB38F6D8>

4.高級查詢：find/find_all 檢索

⚫ find(name, attrs, recursive, text, **kwargs) ◼

查詢得到一個節點對象數據，如果結果不存在返回 None ◼

param name: 要查詢的標籤的名稱 ◼
param attrs: 要查詢的屬性字典，如{‘class’: ‘title’} ◼
param recursive: 是否查詢所有包含的節點，默認 True ◼
param text: 查詢的節點中包含的指定的內容 ◼
param kwargs: 如果查詢的屬性不是內置的指定的屬性，可以通過 kwargs 添加自定義屬性

⚫ find_all(name, attrs, recursive, text, limit, **kwargs) ◼

查詢得到一組節點對象數據，如果結果不存在返回[] ◼
param limit:查詢的結果數量限制 ◼
其他參數參考 find(..)

⚫ 查詢過濾操作 ◼

條件查詢：find_all(‘p’, {‘class’: ‘title’}) ◼
並且查詢：find_all([‘p’, ‘a’]) ◼
屬性查詢：find_all(id=’link2’) ◼
正則查詢：find_all(text=re.compile(‘sisters’)) ◼
指定屬性查詢：find_all(class_=’sister’)

⚫ 其他查詢 ◼

find_parent()/find_parents() ◼
find_previous_sibling()/find_previous_siblings()/find_next_sibling()/find_next_siblings()
find_next()/find_all_next()/find_previous()/find_all_previous()

5.高級查詢：CSS選擇器

⚫ beautifulsoup.select(css syntax) #css查詢

print(soup.select("title")) # 標籤選擇器
print(soup.select("#link1")) # id 選擇器
print(soup.select(".sister")) # class 選擇器
print(soup.select("p > a")) # 子類選擇器
print(soup.select("p a")) # 包含選擇器
print(soup.select("p, a, b")) # 羣組選擇器
print(soup.select("#link1 ~ .sister")) # 兄弟選擇器
print(soup.select("#link1 + .sister")) # 兄弟選擇器
print(soup.select("p[class='title']")) # 屬性選擇器
print(soup.select("a:nth-of-type(2)")) # 僞類選擇器

案例任務

https://www.autohome.com.cn 獲取所擁有車的名稱及新手指導價
http://www.6pifa.net/ 爬取商品圖片、名稱及價格，信息按類別分別放入不同的文本文件
https://b.faloo.com/將小說章節及內容按名稱存儲到不同的文件中
http://www.williamlong.info/ 爬取所有新聞內容及發佈日期
http://sports.sina.com.cn/global/ 獲得所有西甲聯賽相關新聞標題和文字內容
http://www.zhcw.com/ssq/ 所有雙色球新聞標題、內容
https://www.cnblogs.com/cate/python/ 獲取 python 相關標題及內容

案例代碼參考：

"""
Version 1.1.0
Author lkk
Email [email protected]
date 2018-12-01 14:36
DESC NEWS新聞爬取
http://www.williamlong.info/
"""
from urllib import request
from bs4 import BeautifulSoup
import chardet
import re
from fake_useragent import UserAgent
import pymysql


class DownMysql:
    def __init__(self, date, title, author_1, content, classify, target, scan):
        self.date = date
        self.title = title
        self.author_1 = author_1
        self.content = content
        self.classify = classify
        self.target = target
        self.scan = scan
        self.connect = pymysql.connect(
            host='localhost',
            db='data',
            port=3306,
            user='root',
            passwd='123456',
            charset='utf8',
            use_unicode=False
        )
        self.cursor = self.connect.cursor()

    # 保存數據到MySQL中
    def save_mysql(self):
        sql = "insert into blog(date, title, author_1, content, classify, target, scan) VALUES (%s,%s,%s,%s,%s,%s,%s)"
        try:
            self.cursor.execute(sql, (self.date, self.title, self.author_1, self.content, self.classify, self.target, self.scan))
            self.connect.commit()
            print('數據插入成功')
        except Exception as e:
            print(e, '數據插入錯誤')


# 新建對象，然後將數據傳入類中
def mysql(date, title, author_1, content, classify, target, scan):
    down = DownMysql(date, title, author_1, content, classify, target, scan)
    down.save_mysql()


def target_data(url):
    ua = UserAgent()
    headers = {
        'User-agent': ua.random
    }
    start_url = request.Request(url, headers=headers)
    response = request.urlopen(start_url)
    data = response.read()
    encoding = chardet.detect(data).get('encoding')
    data_info = data.decode(encoding, 'ignore')
    soup = BeautifulSoup(data_info, 'lxml')
    return soup


def core(url):
    soup = target_data(url)
    date = soup.select('h4[class="post-date"]')
    title = soup.select('h2[class="post-title"]')
    content = soup.select('div[class=post-body]')
    author = soup.select('h6[class="post-footer"]')
    classify = soup.select('h6[class="post-footer"] > a:nth-of-type(1)')
    target = soup.select('h6[class="post-footer"] > a:nth-of-type(2)')
    for i in range(len(date)):
        authors = author[i].text.strip()
        scan = re.findall(r'.*?瀏覽:(.*?)\s+|', authors)[0]
        author_1 = re.findall(r'作者:(.*?)\s+|', authors)[0]
        mysql(date[i].text.strip(), title[i].text.strip(), author_1, content[i].text.strip(), classify[i].text.strip(), target[i].text.strip(), scan)


url = 'https://www.williamlong.info/cat/?page='
if __name__ == '__main__':
    for j in range(1, 185):
        next_url = url + str(j)
        print(next_url)
        core(next_url)

python爬蟲入門之————————————————第四節--使用bs4語法獲取數據

1.裝備工作:模塊安裝

1.1命令安裝方式：（開發環境:python3.6環境）

pip install beautifulsoup4

easy_install beautifulsoup4

python setup.py install

1.2不同解析器對比

2.BeautifulSoup4 初始化和節點對象的認識

3. 節點查詢：子節點

4.高級查詢：find/find_all 檢索

5.高級查詢：CSS選擇器

案例任務

EXCEL中下拉菜單中添加新選項或者刪除選項

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

同事使用 insert into select 遷移數據，開開心心上線，上線後被公司開除！

Git使用經驗總結5-修改提交信息

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Git使用經驗總結4-撤回上一次本地提交

Java中止線程的方式

壓榨數據庫的真實處理速度

[轉帖]Oracle Exadata 學習筆記之核心特性Part1

2018最常見的Python面試題----------------------------第一波福利

python3.6 環境下的TCP網絡編程

python3.0多進程編程————————————————————————————————淺談

python爬蟲入門之————————————————第一節--瞭解爬蟲

python爬蟲入門之————————————————第二節--使用xpath語法獲取數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結