文章目录

一、初探爬虫

1. urllib

使用 urllib 获得 html 内容

from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())
'''
b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'
'''

2. beautiful soup

使用 bs4 解析 html

from bs4 import BeautifulSoup

bs = BeautifulSoup(html.read(), 'html.parser')
# 等价地
# bs = BeautifulSoup(html, 'html.parser')  #
print(bs.h1)   # 范围网页的第一个 h1节点
# 等价地
# print(bs.html.body.h1)
# print(bs.html.h1)
# print(bs.body.h1)
'''
<h1>An Interesting Title</h1>
'''

BeautifulSoup 的第二个参数是 html 解析器，其中 html.parser 是自带的，可用还可以有 lxml, html5lib

lxml 和 html5lib 需要额外安装

3. 异常处理

通常需要处理几种异常：

HTTPError 网站上没有查找的文件
URLError 网站挂了

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen("https://gathub.com")
except HTTPError as e:
    print("The server returned an HTTP error")
except URLError as e:
    print("The server could not be found!")
else:
    print(html.read())
'''
The server could not be found!
'''

AttributeError 除了网站问题，还有代码问题，比如：获取一个不存在的 DOM 节点

4. 综上所述

一个完整的查询请求

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

二、html 解析

不要写类似如下代码

bs.find_all('table')[4].find_all('tr')[2].find('td').find_all('div')[1].find('a')

过分依赖于DOM节点的组织结构，很容易使爬虫因为网站轻微的改动而失效

1. bs.findAll()， bs.find()

根据节点属性查询

bs.find_all(tagName, tagAttributes) # 或取具有某种属性的某类标签

'''
<span class="red">Heavens! what a virulent attack!</span> replied
<span class="green">the prince</span>, not in the least disconcerted
'''
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())
'''
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
...

'''

name.get_text() # 把标签的内容分离出来

同时查询多种属性值

allText = bs.find_all('span', {'class':{'green', 'red'}})

查找所有标题

titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles])
'''
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
'''

根据内容查询

nameList = bs.find_all(text='the prince')
print(len(nameList))

lambda 表达式

bs.find_all(lambda tag: len(tag.attrs) == 2)

2. 在DOM 树上游历

利用标签的 parent， children， siblings

chidren

for child in bs.find('table',{'id':'giftList'}).children:
    print(child)

sibling

next_siblings, previous_siblings 返回 generator
next_sibling, previous_sibling 返回 tag

for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling)

parent

bs.find('img',
              {'src':'../img/gifts/img1.jpg'})
      .parent.previous_sibling.get_text()

利用正则表达式

当然这是一例没有技术含量的正则表达式

images = bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images: 
    print(image['src'])
'''
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
'''

三、正式搭建网络爬虫

随机游走

所谓随机游走，就是从起始页面，任选一个链接点开，跳转到下一个页面，重复该操作。

import requests
from bs4 import BeautifulSoup
import datetime
import random
import re
from urllib.parse import unquote

random.seed(datetime.datetime.now())

headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
    }

def getLinks(articleUrl):
    html = requests.get("http://wiki.hk.wjbk.site/baike-{}".format(articleUrl), headers=headers).text
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id':'content'}).find_all('a', href=re.compile('^(https:).*(baike-)((?!:).)*$'))

links = getLinks('卫生')

while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs['href']
    print(unquote(newArticle))
    keywords = newArticle.split('/')[-1].split('-')[-1]
    links = getLinks(keywords)
'''
输出：
https://wiki.hk.wjbk.site/baike-卫生
https://wiki.hk.wjbk.site/baike-Anova
https://wiki.hk.wjbk.site/baike-趋势图
https://wiki.hk.wjbk.site/baike-统计学
https://wiki.hk.wjbk.site/baike-调和平均数
...
'''

上述代码中有几处需要说明：

unquote

将 url 中的中文翻译出来，主要是给人看的，否则你会看到这样的汉字： %E8%86%9C%E8

headers

如果不加 headers，爬虫会被该网站重定向而陷入无限循环，这估计是网站的一种反爬策略

四、数据存储 MYSQL

常用 sql

一些常用的 sql 命令

CREATE DATABASE scraping; # 建数据库

USE scraping;  # 用数据库

CREATE TABLE pages (
id BIGINT(7) NOT NULL AUTO_INCREMENT,
title VARCHAR(200), 
content VARCHAR(10000),
created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, 
PRIMARY KEY(id));  # 建表

DESCRIBE pages;  # 描述表

INSERT INTO pages (title, content) VALUES ("Test page title",
"This is some test page content. It can be up to 10,000 characters long.");  # 增

SELECT * FROM pages WHERE id = 1;  # 查

DELETE FROM pages WHERE id = 1;   # 删

DROP DATABASE scraping;  # 删数据库

把上面的 sql 语句一次执行，结果如下：

mysql>  CREATE DATABASE scraping;USE scraping;CREATE TABLE pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,title VARCHAR(200), content VARCHAR(10000),created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(id));DESCRIBE pages;INSERT INTO pages (title, content) VALUES ("Test page title","This is some test page content. It can be up to 10,000 characters long.");SELECT * FROM pages WHERE id = 1;DELETE FROM pages WHERE id = 1;DROP DATABASE scraping;
Query OK, 1 row affected (0.01 sec)

Database changed
Query OK, 0 rows affected, 1 warning (0.02 sec)

+---------+----------------+------+-----+-------------------+-------------------+
| Field   | Type           | Null | Key | Default           | Extra             |
+---------+----------------+------+-----+-------------------+-------------------+
| id      | bigint(7)      | NO   | PRI | NULL              | auto_increment    |
| title   | varchar(200)   | YES  |     | NULL              |                   |
| content | varchar(10000) | YES  |     | NULL              |                   |
| created | timestamp      | YES  |     | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
+---------+----------------+------+-----+-------------------+-------------------+
4 rows in set (0.00 sec)

Query OK, 1 row affected (0.00 sec)

+----+-----------------+-------------------------------------------------------------------------+---------------------+
| id | title           | content                                                                 | created             |
+----+-----------------+-------------------------------------------------------------------------+---------------------+
|  1 | Test page title | This is some test page content. It can be up to 10,000 characters long. | 2020-01-01 16:43:23 |
+----+-----------------+-------------------------------------------------------------------------+---------------------+
1 row in set (0.00 sec)

Query OK, 1 row affected (0.01 sec)

Query OK, 1 row affected (0.01 sec)

mysql>

使用 pymysql 操作数据库

from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re
import requests
from urllib.parse import unquote

headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
    }

conn = pymysql.connect(host='127.0.0.1',
                       user='root', 
                       passwd='password', 
                       db='mysql', 
                       charset='utf8')
cur = conn.cursor()
cur.execute('USE scraping')

random.seed(datetime.datetime.now())

def store(title, content):
    cur.execute('INSERT INTO pages (title, content) VALUES ("%s", "%s")', (title, content))
    cur.connection.commit()

def getLinks(articleUrl):
    html = requests.get("http://wiki.hk.wjbk.site/baike-{}".format(articleUrl), headers=headers).text
    bs = BeautifulSoup(html, 'html.parser')
    title = bs.find('h1').get_text()
    p = bs.find('div', {'id':'mw-content-text'}).find('p')
    content = ''.join([text.strip() for text in p.find_all(text=True) 
                       if text.parent.name not in ['script','div']])  # 删除 javascript 脚本和 div 中的注释
    content = re.sub(r'\[[0-9]+\]', '',content)  # 去除中括号
    store(title, content)
    return bs.find('div', {'id':'content'}).find_all('a', href=re.compile('^(https:).*(baike-)((?!:).)*$'))

links = getLinks('柯南')
try:
    while len(links) > 0:
        newArticle = links[random.randint(0, len(links)-1)].attrs['href']
        print(unquote(newArticle))
        keywords = newArticle.split('/')[-1].split('-')[-1]
        links = getLinks(keywords)
finally:
    cur.close()
    conn.close()

结果：

mysql> delete from pages;
Query OK, 49 rows affected (0.01 sec)

mysql> select * from pages;
+----+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| id | title                 | content                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | created             |
+----+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| 83 | '柯南'                | '柯南，为塞尔特语名字，可能是指：'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 2020-01-01 17:48:06 |
| 84 | '名侦探柯南 (电视剧)' | ''                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 2020-01-01 17:48:08 |
| 85 | '乱马?'               | '《乱马?》（日语：らんま1/2）是日本漫画家高桥留美子的恋爱喜剧漫画，及后来陆续改编成动画、电子游戏、电视剧等衍生作品。 从1987年36号到1996年12号在小学馆《周刊少年Sunday》连载，单行本全38册。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 2020-01-01 17:48:13 |
| 86 | '别让我成为泼妇'      | '《别让我成为泼妇》（日语：じゃじゃ马にさせないで）是日本女歌手西尾悦子（日语：西尾悦子）的歌曲。1989年4月25日由环球唱片（旧名Kitty Record）发行。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 2020-01-01 17:48:17 |
| 87 | '无线电'              | '无线电，又称无线电波、射频电波、电波，或射频，是指在自由空间（包括空气和真空）传播的电磁波，在电磁波谱上，其波长长于 红外线光（IR）。频率范围为300 GHz以下，其对应的波长范围为1毫米以上。就像其他电磁波一样，无线电波以光速前进。经由闪电或天文物体，可以产生自然的无线电 波。由人工产生的无线电波，被应用在无线通讯、广播、雷达、通讯卫星、导航系统、电脑网路等应用上。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 2020-01-01 17:48:19 |
| 88 | '国际广播'            | '国际广播，又称对外广播，是指向非本国的听众进行的广播，有多种目的类型，例如新闻传播、文化交流、也有的是政治宣传。其以 电台广播为大宗，但也有电视广播。在大部分情况下，一般通过短波波段进行，对邻国有时也使用中波波段进行广播。同时还使用卫星和互联网进行广播。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 2020-01-01 17:48:22 |
| 89 | '缅甸广播电视台'      | ''                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 2020-01-01 17:48:24 |
| 90 | '泰国公共电视台'      | ''                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 2020-01-01 17:48:27 |
| 91 | '达新·钦那瓦'        | '塔克辛·钦那瓦（泰语：?????? ???????，皇家转写：Thaksin Chinnawat泰语发音：[t?ák.sǐn t??īn.nā.wát]；1949年7月26日－），汉名丘达新，生于泰国北部清迈府，前泰国首相和泰爱泰党创立人，也为前泰国皇家警察中校和商人。塔克辛的妹妹英叻·钦那瓦亦为前总理，两人属于第四代泰国华 人，祖籍广东潮州府丰顺县，客家人后裔。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 2020-01-01 17:48:29 |
| 92 | '清迈府'              | '清迈府（泰语：????????????????，皇家转写：Changwat Chiang Mai，泰语发音：[t??ā?.wàt t???īa?.màj]；兰纳语：?????????，.mw-parser-output .IPA{font-family:"Charis SIL","Doulos SIL","Linux Libertine","Segoe UI","Lucida Sans Unicode","Code2000","Gentium","Gentium Alternative","TITUS Cyberbit Basic","Arial Unicode MS","IPAPANNEW","Chrysanthi Unicode","GentiumAlt","Bitstream Vera","Bitstream Cyberbit","Hiragino Kaku Gothic Pro","Lucida Grande",sans-serif;text-decoration:none!important}.mw-parser-output .IPA a:link,.mw-parser-output .IPA a:visited{text-decoration:none!important}t?ia?.màj），泰国北部边陲的一个府，北面与缅甸接壤。另外，清迈府与另外5个府为邻：清莱府（东北）、南邦府（东）、南奔府（东南）、达府（南）及夜丰颂府（西）。面积为20,107平方公里。府治为清迈市。'                                                                                                                                                | 2020-01-01 17:48:32 |
| 93 | '西班牙'              | '座标：40°27′49″N3°44′57″W? / ?40.46366700000001°N 3.74922°W? /40.46366700000001; -3.74922'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 2020-01-01 17:48:35 |
| 94 | '马德里'              | '马德里（西班牙语：Madrid；/m??dr?d/；西班牙语：.mw-parser-output .IPA{font-family:"Charis SIL","Doulos SIL","Linux Libertine","Segoe UI","Lucida Sans Unicode","Code2000","Gentium","Gentium Alternative","TITUS Cyberbit Basic","Arial Unicode MS","IPAPANNEW","Chrysanthi Unicode","GentiumAlt","Bitstream Vera","Bitstream Cyberbit","Hiragino Kaku Gothic Pro","Lucida Grande",sans-serif;text-decoration:none!important}.mw-parser-output .IPA a:link,.mw-parser-output .IPA a:visited{text-decoration:none!important}[ma???i?]）是西班牙首都及最大都市，也是马德里自治区首府，其位置处于西班牙国土中部，曼萨纳雷斯河贯穿市区。市内人口约340万，都会区人口则约627.1万（2010年），均占西班牙首位。其建城于9世纪，是在摩尔人边贸站「马格立特」旧址上发展起来的城市；1561年，西班牙国王腓力二世将首都从托莱多迁入于此[注 1]，由于其特殊的地位而得到迅速的发展，成为往后西班牙殖民帝国的运筹 中心，现今则与巴塞罗那并列为西班牙的两大对外文化窗口。' | 2020-01-01 17:48:39 |
| 95 | '卑尔根'              | '卑尔根（挪威语：Bergen聆听帮助·信息）是挪威第二大城市。根据政府的统计，直至2006年7月1日，卑尔根市区的人口有243,219人，如果连同郊区和周边区域的话，则有369,099人。整个城市共分为八个区域：Arna、Bergenhus、Fana、Fyllingsdalen、Laksev?g、Ytrebygda、?rstad和?sane。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 2020-01-01 17:48:44 |
| 96 | '德国'                | '–欧洲（绿色及深灰色）–欧盟（绿色）? —? [图例放大]'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 2020-01-01 17:48:48 |
| 97 | '国家和地区顶级域'    | '国家和地区顶级域名（Country code top-level domain，英语：ccTLD），简称国家顶级域，又译国码域名、顶级国码域名、国码顶 级网域名称，或顶级国码网域名称，是用两字母的国家或地区名缩写代称的顶级域，其域名的指定及分配，政治因素考量凌驾在技术和商业因素之上。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2020-01-01 17:48:49 |
| 98 | '.hu'                 | '.hu为匈牙利国家及地区顶级域（ccTLD）的域名。'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 2020-01-01 17:48:50 |
+----+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
16 rows in set (0.00 sec)

百科词条爪巴虫