百科詞條爪巴蟲

一、初探爬蟲

1. urllib

使用 urllib 獲得 html 內容

from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())
'''
b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'
'''

2. beautiful soup

使用 bs4 解析 html

from bs4 import BeautifulSoup

bs = BeautifulSoup(html.read(), 'html.parser')
# 等價地
# bs = BeautifulSoup(html, 'html.parser')  #
print(bs.h1)   # 範圍網頁的第一個 h1節點
# 等價地
# print(bs.html.body.h1)
# print(bs.html.h1)
# print(bs.body.h1)
'''
<h1>An Interesting Title</h1>
'''

BeautifulSoup 的第二個參數是 html 解析器,其中 html.parser 是自帶的,可用還可以有 lxml, html5lib

lxml 和 html5lib 需要額外安裝

3. 異常處理

通常需要處理幾種異常:

  • HTTPError 網站上沒有查找的文件
  • URLError 網站掛了
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen("https://gathub.com")
except HTTPError as e:
    print("The server returned an HTTP error")
except URLError as e:
    print("The server could not be found!")
else:
    print(html.read())
'''
The server could not be found!
'''
  • AttributeError 除了網站問題,還有代碼問題,比如:獲取一個不存在的 DOM 節點

4. 綜上所述

一個完整的查詢請求

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

二、html 解析

不要寫類似如下代碼

bs.find_all('table')[4].find_all('tr')[2].find('td').find_all('div')[1].find('a')

過分依賴於DOM節點的組織結構,很容易使爬蟲因爲網站輕微的改動而失效

1. bs.findAll(), bs.find()

根據節點屬性查詢

bs.find_all(tagName, tagAttributes) # 或取具有某種屬性的某類標籤

'''
<span class="red">Heavens! what a virulent attack!</span> replied
<span class="green">the prince</span>, not in the least disconcerted
'''
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())
'''
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
...

'''

name.get_text() # 把標籤的內容分離出來

同時查詢多種屬性值

allText = bs.find_all('span', {'class':{'green', 'red'}})

查找所有標題

titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles])
'''
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
'''

根據內容查詢

nameList = bs.find_all(text='the prince')
print(len(nameList))

lambda 表達式

bs.find_all(lambda tag: len(tag.attrs) == 2)

2. 在DOM 樹上游歷

利用標籤的 parent, children, siblings

chidren

for child in bs.find('table',{'id':'giftList'}).children:
    print(child)

sibling

next_siblings, previous_siblings 返回 generator
next_sibling, previous_sibling 返回 tag

for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling) 

parent

bs.find('img',
              {'src':'../img/gifts/img1.jpg'})
      .parent.previous_sibling.get_text()

利用正則表達式

當然這是一例沒有技術含量的正則表達式

images = bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images: 
    print(image['src'])
'''
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
'''

三、正式搭建網絡爬蟲

隨機遊走

所謂隨機遊走,就是從起始頁面,任選一個鏈接點開,跳轉到下一個頁面,重複該操作。

import requests
from bs4 import BeautifulSoup
import datetime
import random
import re
from urllib.parse import unquote

random.seed(datetime.datetime.now())

headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
    }

def getLinks(articleUrl):
    html = requests.get("http://wiki.hk.wjbk.site/baike-{}".format(articleUrl), headers=headers).text
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id':'content'}).find_all('a', href=re.compile('^(https:).*(baike-)((?!:).)*$'))

links = getLinks('衛生')

while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs['href']
    print(unquote(newArticle))
    keywords = newArticle.split('/')[-1].split('-')[-1]
    links = getLinks(keywords)
'''
輸出:
https://wiki.hk.wjbk.site/baike-衛生
https://wiki.hk.wjbk.site/baike-Anova
https://wiki.hk.wjbk.site/baike-趨勢圖
https://wiki.hk.wjbk.site/baike-統計學
https://wiki.hk.wjbk.site/baike-調和平均數
...
'''

上述代碼中有幾處需要說明:

unquote

將 url 中的中文翻譯出來,主要是給人看的,否則你會看到這樣的漢字: %E8%86%9C%E8

headers

如果不加 headers,爬蟲會被該網站重定向而陷入無限循環,這估計是網站的一種反爬策略

四、數據存儲 MYSQL

常用 sql

一些常用的 sql 命令

CREATE DATABASE scraping; # 建數據庫

USE scraping;  # 用數據庫

CREATE TABLE pages (
id BIGINT(7) NOT NULL AUTO_INCREMENT,
title VARCHAR(200), 
content VARCHAR(10000),
created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, 
PRIMARY KEY(id));  # 建表

DESCRIBE pages;  # 描述表

INSERT INTO pages (title, content) VALUES ("Test page title",
"This is some test page content. It can be up to 10,000 characters long.");  # 增

SELECT * FROM pages WHERE id = 1;  # 查

DELETE FROM pages WHERE id = 1;   # 刪

DROP DATABASE scraping;  # 刪數據庫

把上面的 sql 語句一次執行,結果如下:

mysql>  CREATE DATABASE scraping;USE scraping;CREATE TABLE pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,title VARCHAR(200), content VARCHAR(10000),created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(id));DESCRIBE pages;INSERT INTO pages (title, content) VALUES ("Test page title","This is some test page content. It can be up to 10,000 characters long.");SELECT * FROM pages WHERE id = 1;DELETE FROM pages WHERE id = 1;DROP DATABASE scraping;
Query OK, 1 row affected (0.01 sec)

Database changed
Query OK, 0 rows affected, 1 warning (0.02 sec)

+---------+----------------+------+-----+-------------------+-------------------+
| Field   | Type           | Null | Key | Default           | Extra             |
+---------+----------------+------+-----+-------------------+-------------------+
| id      | bigint(7)      | NO   | PRI | NULL              | auto_increment    |
| title   | varchar(200)   | YES  |     | NULL              |                   |
| content | varchar(10000) | YES  |     | NULL              |                   |
| created | timestamp      | YES  |     | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
+---------+----------------+------+-----+-------------------+-------------------+
4 rows in set (0.00 sec)

Query OK, 1 row affected (0.00 sec)

+----+-----------------+-------------------------------------------------------------------------+---------------------+
| id | title           | content                                                                 | created             |
+----+-----------------+-------------------------------------------------------------------------+---------------------+
|  1 | Test page title | This is some test page content. It can be up to 10,000 characters long. | 2020-01-01 16:43:23 |
+----+-----------------+-------------------------------------------------------------------------+---------------------+
1 row in set (0.00 sec)

Query OK, 1 row affected (0.01 sec)

Query OK, 1 row affected (0.01 sec)

mysql>

使用 pymysql 操作數據庫

from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re
import requests
from urllib.parse import unquote

headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
    }

conn = pymysql.connect(host='127.0.0.1',
                       user='root', 
                       passwd='password', 
                       db='mysql', 
                       charset='utf8')
cur = conn.cursor()
cur.execute('USE scraping')

random.seed(datetime.datetime.now())

def store(title, content):
    cur.execute('INSERT INTO pages (title, content) VALUES ("%s", "%s")', (title, content))
    cur.connection.commit()

def getLinks(articleUrl):
    html = requests.get("http://wiki.hk.wjbk.site/baike-{}".format(articleUrl), headers=headers).text
    bs = BeautifulSoup(html, 'html.parser')
    title = bs.find('h1').get_text()
    p = bs.find('div', {'id':'mw-content-text'}).find('p')
    content = ''.join([text.strip() for text in p.find_all(text=True) 
                       if text.parent.name not in ['script','div']])  # 刪除 javascript 腳本和 div 中的註釋
    content = re.sub(r'\[[0-9]+\]', '',content)  # 去除中括號
    store(title, content)
    return bs.find('div', {'id':'content'}).find_all('a', href=re.compile('^(https:).*(baike-)((?!:).)*$'))

links = getLinks('柯南')
try:
    while len(links) > 0:
        newArticle = links[random.randint(0, len(links)-1)].attrs['href']
        print(unquote(newArticle))
        keywords = newArticle.split('/')[-1].split('-')[-1]
        links = getLinks(keywords)
finally:
    cur.close()
    conn.close()

結果:

mysql> delete from pages;
Query OK, 49 rows affected (0.01 sec)

mysql> select * from pages;
+----+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| id | title                 | content                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | created             |
+----+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| 83 | '柯南'                | '柯南,為塞爾特語名字,可能是指:'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 2020-01-01 17:48:06 |
| 84 | '名偵探柯南 (電視劇)' | ''                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 2020-01-01 17:48:08 |
| 85 | '亂馬?'               | '《亂馬?》(日語:らんま1/2)是日本漫畫家高橋留美子的戀愛喜劇漫畫,及後來陸續改編成動畫、電子遊戲、電視劇等衍生作品。 從1987年36號到1996年12號在小學館《週刊少年Sunday》連載,單行本全38冊。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 2020-01-01 17:48:13 |
| 86 | '別讓我成為潑婦'      | '《別讓我成為潑婦》(日語:じゃじゃ馬にさせないで)是日本女歌手西尾悅子(日語:西尾悅子)的歌曲。1989年4月25日由環球唱片(舊名Kitty Record)發行。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 2020-01-01 17:48:17 |
| 87 | '無線電'              | '無線電,又稱無線電波、射頻電波、電波,或射頻,是指在自由空間(包括空氣和真空)傳播的電磁波,在電磁波譜上,其波長長於 紅外線光(IR)。頻率範圍為300 GHz以下,其對應的波長範圍為1毫米以上。就像其他電磁波一樣,無線電波以光速前進。經由閃電或天文物體,可以產生自然的無線電 波。由人工產生的無線電波,被應用在無線通訊、廣播、雷達、通訊衛星、導航系統、電腦網路等應用上。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 2020-01-01 17:48:19 |
| 88 | '國際廣播'            | '國際廣播,又稱對外廣播,是指向非本國的聽衆進行的廣播,有多種目的類型,例如新聞傳播、文化交流、也有的是政治宣傳。其以 電臺廣播為大宗,但也有電視廣播。在大部分情況下,一般通過短波波段進行,對鄰國有時也使用中波波段進行廣播。同時還使用衛星和互聯網進行廣播。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 2020-01-01 17:48:22 |
| 89 | '緬甸廣播電視臺'      | ''                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 2020-01-01 17:48:24 |
| 90 | '泰國公共電視臺'      | ''                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 2020-01-01 17:48:27 |
| 91 | '達新·欽那瓦'        | '塔克辛·欽那瓦(泰語:?????? ???????,皇家轉寫:Thaksin Chinnawat泰語發音:[t?ák.sǐn t??īn.nā.wát];1949年7月26日-),漢名丘達新,生於泰國北部清邁府,前泰國首相和泰愛泰黨創立人,也為前泰國皇家警察中校和商人。塔克辛的妹妹英叻·欽那瓦亦為前總理,兩人屬於第四代泰國華 人,祖籍廣東潮州府豐順縣,客家人後裔。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 2020-01-01 17:48:29 |
| 92 | '清邁府'              | '清邁府(泰語:????????????????,皇家轉寫:Changwat Chiang Mai,泰語發音:[t??ā?.wàt t???īa?.màj];蘭納語:?????????,.mw-parser-output .IPA{font-family:"Charis SIL","Doulos SIL","Linux Libertine","Segoe UI","Lucida Sans Unicode","Code2000","Gentium","Gentium Alternative","TITUS Cyberbit Basic","Arial Unicode MS","IPAPANNEW","Chrysanthi Unicode","GentiumAlt","Bitstream Vera","Bitstream Cyberbit","Hiragino Kaku Gothic Pro","Lucida Grande",sans-serif;text-decoration:none!important}.mw-parser-output .IPA a:link,.mw-parser-output .IPA a:visited{text-decoration:none!important}t?ia?.màj),泰國北部邊陲的一個府,北面與緬甸接壤。另外,清邁府與另外5個府為鄰:清萊府(東北)、南邦府(東)、南奔府(東南)、達府(南)及夜豐頌府(西)。面積為20,107平方公里。府治為清邁市。'                                                                                                                                                | 2020-01-01 17:48:32 |
| 93 | '西班牙'              | '座標:40°27′49″N3°44′57″W? / ?40.46366700000001°N 3.74922°W? /40.46366700000001; -3.74922'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 2020-01-01 17:48:35 |
| 94 | '馬德里'              | '馬德里(西班牙語:Madrid;/m??dr?d/;西班牙語:.mw-parser-output .IPA{font-family:"Charis SIL","Doulos SIL","Linux Libertine","Segoe UI","Lucida Sans Unicode","Code2000","Gentium","Gentium Alternative","TITUS Cyberbit Basic","Arial Unicode MS","IPAPANNEW","Chrysanthi Unicode","GentiumAlt","Bitstream Vera","Bitstream Cyberbit","Hiragino Kaku Gothic Pro","Lucida Grande",sans-serif;text-decoration:none!important}.mw-parser-output .IPA a:link,.mw-parser-output .IPA a:visited{text-decoration:none!important}[ma???i?])是西班牙首都及最大都市,也是馬德里自治區首府,其位置處於西班牙國土中部,曼薩納雷斯河貫穿市區。市內人口約340萬,都會區人口則約627.1萬(2010年),均佔西班牙首位。其建城於9世紀,是在摩爾人邊貿站「馬格立特」舊址上發展起來的城市;1561年,西班牙國王腓力二世將首都從托萊多遷入於此[註 1],由於其特殊的地位而得到迅速的發展,成為往後西班牙殖民帝國的運籌 中心,現今則與巴塞羅那並列為西班牙的兩大對外文化窗口。' | 2020-01-01 17:48:39 |
| 95 | '卑爾根'              | '卑爾根(挪威語:Bergen聆聽幫助·信息)是挪威第二大城市。根據政府的統計,直至2006年7月1日,卑爾根市區的人口有243,219人,如果連同郊區和周邊區域的話,則有369,099人。整個城市共分為八個區域:Arna、Bergenhus、Fana、Fyllingsdalen、Laksev?g、Ytrebygda、?rstad和?sane。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 2020-01-01 17:48:44 |
| 96 | '德國'                | '–歐洲(綠色及深灰色)–歐盟(綠色)? —? [圖例放大]'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 2020-01-01 17:48:48 |
| 97 | '國家和地區頂級域'    | '國家和地區頂級域名(Country code top-level domain,英語:ccTLD),簡稱國家頂級域,又譯國碼域名、頂級國碼域名、國碼頂 級網域名稱,或頂級國碼網域名稱,是用兩字母的國家或地區名縮寫代稱的頂級域,其域名的指定及分配,政治因素考量凌駕在技術和商業因素之上。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2020-01-01 17:48:49 |
| 98 | '.hu'                 | '.hu為匈牙利國家及地區頂級域(ccTLD)的域名。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 2020-01-01 17:48:50 |
+----+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
16 rows in set (0.00 sec)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章