Python爬蟲入門——信息組織與提取方法（2）

1. 信息提取的一般方法

指從標記的信息中提取關注的內容。上一章提到的信息標記有三種形式：XML、JSON、YAML。
一般意義上的幾種方法:
方法一：完整的解析信息的標記形式，再提取關鍵信息。像XML、JSON、YAML等，需要標記解析器，例如bs4庫的標籤樹遍歷，需要解析什麼信息，去遍歷這棵樹就ok了。
優點：信息解析準確，缺點：提取過程繁瑣，速度慢。

方法二：無視任何標記信息，直接搜索關鍵信息。就像在一個Word文檔中搜索關鍵詞一樣，根本不需要去關心文檔具有什麼樣的標題形式和格式，只需要我們對信息的文本查找函數即可。
優點：提取過程簡單，速度較快。缺點：提取結果缺乏準確性。

二者方法哪種好呢？現實生活中我們使用的，是一種融合的方法。
融合方法：結合形式解析與搜索方法，來提取關鍵信息。

2. 一個小例子

例子的原網頁：http://python123.io/ws/demo.html
實例：提取HTML中所有URL鏈接。
思路：（1）觀察網頁源代碼，發現所有的URL鏈接都在< a>標籤中。
（2）搜索到所有的< a>標籤
（3）解析< a>標籤格式，提取herf後的鏈接內容。

import requests
from bs4 import BeautifulSoup      #BeautifulSoup是一個類
r=requests.get("http://python123.io/ws/demo.html")
r.encoding=r.apparent_encoding
demo=r.text
soup=BeautifulSoup(demo,"html.parser")    #兩個參數，第一個是要解析的文章，第二個是“html的解析器”
for link in soup.find_all('a'):
	print(link.get('href'))      #這裏的find_all方法，待會會在後面講。

3. 基於bs4庫的HTML內容查找方法

上面講到的<>.find_all(name,attrs,rescursive,string,**kwargs)
返回一個列表類型，存儲查找的結果。各參數的說明如下：

name：對標籤名稱的檢索字符串

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> for tag in soup.find_all(True):  #當參數爲True時，默認是所有的標籤
		print(tag.name)

	
html
head
title
body
p
b
p
a
a
>>> for tag in soup.find_all(re.compile('b')): #打印以b開頭的標籤
		print(tag.name)

	
body
b

attrs：對標籤屬性值的檢索字符串，可以標註屬性檢索

>>> soup.find_all('p','course')   #對p標籤的course屬性值進行搜索
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')   #對所有標籤的id=link1屬性，進行搜索
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> soup.find_all(id=re.compile('link'))    #所有以link開頭的id
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

recursive：是否對子孫全部檢索，默認爲True

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]

string：<>…< />中字符串區域的檢索字符串

>>> soup.find_all(string='Basic Python')
['Basic Python']
>>> soup.find_all(string=re.compile('python'))   #搜索：字符串區域出現過‘python’字樣的標籤
['This is a python demo page', 'The demo python introduces several python courses.']

find_all()函數非常常用，所有有一種簡寫形式：

< tag>(…)等價於< tag>.find_all(…)
soup(…)等價於soup.find_all(…)

擴展方法：

方法	說明
<>.find()	搜索且返回一個結果，字符串類型。同.find_all（）參數。
<>.find_parents()	在先輩節點中搜索，返回列表類型。同.find_all（）參數。
<>.find_parent()	在先輩節點中返回一個結果，字符串類型。同.find_all（）參數。
<>.find_next_sibling()	在後續平行節點中返一個結果，字符串類型。同.find_all（）參數。
<>.find_next_siblings()	在後續平行節點中搜索，返回列表類型。同.find_all（）參數。
<>.find_previous_siblings()	在前續平行節點中搜索，返回列表類型。同.find_all（）參數。
<>.find_previous_sibling()	在前續平行節點中返一個結果，字符串類型。同.find_all（）參數。

4. 實戰：中國大學排名爬蟲

技術路線：requests—bs4
定向爬蟲，僅對輸入URL進行爬取，不擴展爬取。

4.1 分析

爬取網頁的鏈接：http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html
先打開網頁，查看源代碼。

步驟一：確定可行性，確定我們想要的內容在html代碼中可以找到。因爲有一部分數據可能是通過JavaScripe等腳本語言生成。當你訪問網頁的時候，他的信息是動態生成的。在這種情況下，request庫和bs4是無法提取的。爬取動態網頁的方法以後會講，這裏先爬取靜態網頁。

步驟二：查看robots協議，發現它對爬蟲沒有限制。

步驟三：功能描述：打印出排名、大學名稱、和總分。我們要在HTML文檔中找到我們的目標的具體位置。

步驟四：對程序結構進行設計。

從網絡上獲取大學排名網頁內容——getHTMLText()函數實現
提取網頁內容中信息到合適的數據結構內（重點）——fillUnivList()函數實現
利用數據結構展示並輸出結果——printUnivList()函數實現

4.2 實戰

（1）從網絡上獲取大學排名網頁內容——getHTMLText()函數實現，這段代碼比較簡單，就不再具體分析了。

def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""

（2）這段代碼，是提取網頁內容中需要的信息，到合適的數據結構內——fillUnivList()函數實現

先要看懂HTML代碼，知道我們的目標 “排名、大學名稱、和總分”在哪個標籤下面。然後找到這個標籤。
看過之後，發現，我們需要的內容都在一個叫<tbody>的標籤下面。
再對tbody這個標籤具體分析，使用soup.tbody.prettify()打印出這個標籤的標籤樹結構。分析得知，每個大學的信息都在tbody標籤下面的tr標籤裏面。將tr標籤打印出來如下（soup.tbody.tr.prettify()）

<tr class="alt">
 <td>
  1
 </td>
 <td>
  <div align="left">
   清華大學
  </div>
 </td>
 <td>
  北京市
 </td>
 <td>
  95.9
 </td>
 <td class="hidden-xs need-hidden indicator5">
  100.0
 </td>
 <td class="hidden-xs need-hidden indicator6" style="display:none;">
  97.90%
 </td>
 <td class="hidden-xs need-hidden indicator7" style="display:none;">
  37342
 </td>
 <td class="hidden-xs need-hidden indicator8" style="display:none;">
  1.298
 </td>
 <td class="hidden-xs need-hidden indicator9" style="display:none;">
  1177
 </td>
 <td class="hidden-xs need-hidden indicator10" style="display:none;">
  109
 </td>
 <td class="hidden-xs need-hidden indicator11" style="display:none;">
  1137711
 </td>
 <td class="hidden-xs need-hidden indicator12" style="display:none;">
  1187
 </td>
 <td class="hidden-xs need-hidden indicator13" style="display:none;">
  593522
 </td>
</tr>

這樣就一目瞭然了，大學的排名是第一個td標籤中的字符串，大學名字在第二個td標籤中的字符串，總分在第三個td標籤中的字符串。下面就是打印了。具體代碼如下(一定要好好看註釋！！！)：

def fillUnivList(ulist,html):
   soup=BeautifulSoup(html,"html.parser") 
   for tr in soup.find('tbody').children:    #tr是tbody的兒子節點。   .children返回的是一個迭代類型
       if isinstance(tr,bs4.element.Tag):    #isinstance() 函數來判斷一個對象是否是一個已知的類型，類似type()。 判斷tr是否是一個標籤類型
           tds=tr("td")   #等同於<tr>.find_all("td")，返回一個列表類型，列表中是一個tr標籤下的所有td標籤
           ulist.append([tds[0].string,tds[1].string,tds[2].string])

（3）利用數據結構展示並輸出結果——printUnivList()函數實現

def printUnivList(ulist,num):           #\t相當於tab鍵
    tplt="{0:^10}\t{1:{3}^10}\t{2:^10}"  #{3}表示，當打印學校名字時，採用format函數第三個變量來填充，也就是使用中文空格來填充。
    print(tplt.format("排名","學校","分數",chr(12288)))    #中文對齊問題,採用中文字符填充
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))

這裏需要用到Python中的format（）函數，就不再講解，具體的，可以自行百度。

選項	含義
‘<’	強制字段在可用空間內左對齊（這是大多數對象的默認設置）。
‘>’	強制字段在可用空間內右對齊（這是數字的默認值）。
‘^’	強制字段在可用空間內居中。

4.3 完整代碼

import requests
import bs4
from bs4 import BeautifulSoup      #BeautifulSoup是一個類

def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist,html):
    soup=BeautifulSoup(html,"html.parser") 
    for tr in soup.find('tbody').children:    #tr是tbody的兒子節點，這是一個迭代類型
        if isinstance(tr,bs4.element.Tag):    #isinstance() 函數來判斷一個對象是否是一個已知的類型，類似 type()。 
            tds=tr("td")   #等同於tr.find_all(“td”)，返回一個列表類型
            ulist.append([tds[0].string,tds[1].string,tds[2].string])

def printUnivList(ulist,num):           #\t相當於tab鍵
    tplt="{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","學校","分數",chr(12288)))    #中文對齊問題
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))

def main():
    uinfo=[]
    url="http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html"
    html=getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20)   #只打印前20所學校

main()

5. 總結

文章講解了信息提取的一般方法。利用BeautifulSoup庫和Requests庫，一般來說信息提取的步驟大概可總結爲：找到信息的具體位置（搜索），然後再用解析器進行局部遍歷（形式解析），找到目標打印。
文章有些例子用到了Re正則表達式（但很少），正則表達式還是很重要的，在後面會接着詳解。請多關注~！

Python爬蟲入門——信息組織與提取方法（2）

1. 信息提取的一般方法

2. 一個小例子

3. 基於bs4庫的HTML內容查找方法

4. 實戰：中國大學排名爬蟲

4.1 分析

4.2 實戰

4.3 完整代碼

5. 總結

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

Shell基礎——Bash的運算符

Requests庫——實例講解

Linux 入門基礎——常用命令（二）

Python爬蟲入門——信息組織與提取方法（2）

彙編與技術接口——指令系統漫談

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結