爬蟲之網頁解析

在對網上數據爬取時，一定會碰到對網頁內容的解析，下面就自己常用的幾種方式進行總結：

（1）beautifulsoup

https://www.bilibili.com/video/av93140655?from=search&seid=18437810415575324694

bs4庫中的beautifulsoup類可以很簡單的實現網頁解析，其內部幾個常用的函數需要注意

soup.find
soup.find_all
soup.select_one
soup.select

還有soup.head.contents和soup.head.children等屬性，而對於單個標籤來說，需要注意a.attrs（獲得所有屬性值），a.string（獲得對應的文本值，和a.get_text()一樣）

補充：在寫 CSS 時，標籤名不加任何修飾，類名（class="className"引號內即爲類名）前加點，id名（id="idName"引號前即爲id名）前加 #，在這裏我們也可以利用類似的方法來篩選元素，用到的方法是 soup.select()，返回類型是 list。參考https://www.cnblogs.com/kangblog/p/9153871.html，其中:nth-of-type(n)選擇器匹配同類型中的第n個同級兄弟元素，與:nth-child(該選擇器匹配父元素中的第n個子元素)相近。

import requests
from bs4 import BeautifulSoup

class Douban:
    def __init__(self):
        self.URL = 'https://movie.douban.com/top250'
        self.starnum =[]
        for start_num in range(0,251,25):
            self.starnum.append(start_num)
            self.header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

    def get_top250(self):
        for start in self.starnum:
            start = str(start)
            #import pdb;pdb.set_trace()
            html = requests.get(self.URL, params={'start':start},headers = self.header)
            soup = BeautifulSoup(html.text,"html.parser")
            names = soup.select('#content > div > div.article > ol > li > div > div.info > div.hd > a > span:nth-of-type(1)') ##中間的>表示從屬關係，之間要有間隔
            for name in names:
                print(name.get_text())

if __name__== "__main__":
    cls = Douban()
    cls.get_top250()

soup.select與soup.find_all()的使用前者需要根據樹目錄進行詳細匹配，而後者可以選擇局部特徵來選擇，具體使用差異可查看https://www.cnblogs.com/suancaipaofan/p/11786046.html，根據自己的喜好選用。

（2）re

正則表達式（Regular Expression）是一種文本模式，包括普通字符（例如，a 到 z 之間的字母）和特殊字符（稱爲"元字符"）。通常被用來匹配、檢索、替換和分割那些符合某個模式(規則)的文本。

單字符：
        . : 除換行以外所有字符
        [] ：[aoe] [a-w] 匹配集合中任意一個字符
        \d ：數字  [0-9]
        \D : 非數字
        \w ：數字、字母、下劃線、中文
        \W : 非\w
        \s ：所有的空白字符包,括空格、製表符、換頁符等等。等價於 [ \f\n\r\t\v]。
        \S : 非空白

    數量修飾：
        * : 任意多次  >=0
        + : 至少1次   >=1
        ? : 可有可無  0次或者1次
        {m} ：固定m次 hello{3,}
        {m,} ：至少m次
        {m,n} ：m-n次

    邊界：
        $ : 以某某結尾 
        ^ : 以某某開頭

    分組：
        (ab)  

    貪婪模式： .*
    非貪婪（惰性）模式： .*?

    re.I : 忽略大小寫
    re.M ：多行匹配
    re.S ：單行匹配

re.sub(正則表達式, 替換內容, 字符串)

實例分析：

#提取170
string = '我喜歡身高爲170的女孩'
re.findall('\d+',string)

#提取出http://和https://
key='http://www.baidu.com and https://boob.com'
re.findall('https?://',key)

#提取出hello
key='lalala<hTml>hello</HtMl>hahah' #輸出<hTml>hello</HtMl>
re.findall('<[Hh][Tt][mM][lL]>(.*)</[Hh][Tt][mM][lL]>',key)



###爬取糗事百科，並保存
import requests
import re
import os

# 創建一個文件夾
if not os.path.exists('./qiutuLibs'):
    os.mkdir('./qiutuLibs')

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}

#封裝一個通用的url模板
url = 'https://www.qiushibaike.com/pic/page/%d/?s=5185803'

for page in range(1,36):
    new_url = format(url%page)                            #不要忘了format，裏面不加引號
    page_text = requests.get(url=new_url, headers=headers).text

    # 進行數據解析（圖片的地址）
    ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
    src_list = re.findall(ex, page_text, re.S)                        # re.S單行匹配，因爲頁面源碼裏面有 \n

    # 發現src屬性值不是一個完整的url，缺少了協議頭
    for src in src_list:
        src = 'https:' + src
        # 對圖片的url單獨發起請求，獲取圖片數據.content返回的是二進制類型的響應數據
        img_data = requests.get(url=src, headers=headers).content
        img_name = src.split('/')[-1]
        img_path = './qiutuLibs/' + img_name
        with open(img_path, 'wb') as fp:
            fp.write(img_data)
            print(img_name, '下載成功！')

（3）lxml

使用lxml對獲取的網頁進行數據抓取，然後在使用xpath對其進行內容篩選，其中細節

/和//的區別：/代表只獲取直接子節點。//獲取子孫節點。一般//用得比較多。當然也要視情況而定。
contains：有時候某個屬性中包含了多個值，那麼可以使用contains函數。

具體形式如下：

from lxml import etree
page = etree.HTML(html.decode('utf-8'))

# a標籤
tags = page.xpath(u'/html/body/a')
print(tags)  
# html 下的 body 下的所有 a
# 結果[<Element a at 0x34b1f08>, ...]

爬取最受歡迎的語言top20實例：

# 導入所需要的庫
import urllib.request as urlrequest
from lxml import etree

# 獲取html
url = r'https://www.tiobe.com/tiobe-index/'
page = urlrequest.urlopen(url).read()
# 創建lxml對象
html = etree.HTML(page)

# 解析HTML，篩選數據
df = html.xpath('//table[contains(@class, "table-top20")]/tbody/tr//text()')
# 數據寫入數據庫
import pandas as pd
tmp = []
for i in range(0, len(df), 5):
    tmp.append(df[i: i+5])
df = pd.DataFrame(tmp)

上述的這三種方式都可以實現對網頁內容進行解析，從而獲取想要的內容，然後進行爬取。最後再來一個福利對比環節：

"""
爬取妹子圖所有的妹子圖片
"""
import requests
import re
import time
import os
from bs4 import BeautifulSoup

header = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
    "cookie": "__cfduid=db8ad4a271aefd9fe52709ba6d3d94f561583915551; UM_distinctid=170c8b96544233-0c58b81612557b-404b032d-100200-170c8b96545354" ,
    "accept": "text/html,application/xhtml+xml,application/xml;q=0."
}


# 獲取當前目錄
root = os.getcwd()
cnt = 48
for page in range(2):
    # 進入當前目錄
    response = requests.get(f"https://www.meizitu.com/a/list_1_{page+1}.html", headers=header,stream=True,timeout=(3,7))
    response.encoding = "gb2312"
    if response.status_code == 200:
        import pdb;pdb.set_trace()
        #result = re.findall("""<a target='_blank' href=".*?"><img src="(.*?)" alt="(.*?)"></a>""", response.text)  ###正則匹配
        result2=BeautifulSoup(response.text,'html.parser').find_all("img") ###beautifulsoup
        #result3=etree.HTML(response.text).xpath("//img/@src")  ###lxml
        for i in result2:
            path = i.attrs["src"]
            try:
                response = requests.get(path, headers=header,stream=True,timeout=(3,7))
                #import pdb;pdb.set_trace()
                with open("./meinv//"+str(cnt)+".jpg", "wb") as f:
                # 響應的文本，內容content
                    f.write(response.content)
                cnt += 1
            except:
                pass
    print(f"第{page+1}獲取成功，請慢慢欣賞")

未完待續！

爬蟲之網頁解析

python之web框架django

mysql數據庫的常見操作

圖像化界面開發之QT入門

大數據之hadoop與spark

神經網絡之loss總結學習

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結