文章目錄

網絡爬蟲簡介

參考網站

網絡爬蟲簡介

檢測robots.txt

良性的爬蟲大部分是要根據robots.txt來判斷是否要爬取網站信息的

估算網站大小

估算網站的大小可以使用百度或者google的site

識別網站所用技術 ##########不建議使用該模塊去判斷網站的技術

detectem模塊，此模塊依賴docker+py3.5以上版本，運行如下命令(推薦直接使用Windows版本的docket)

docker pull scrapinghub/splash
pip install detectem

不建議使用該模塊去判斷網站的技術，除非你是玩不轉chrome的F12

查詢網站所有者

安裝對應模塊

pip install python-whois

使用如下代碼

#!/usr/bin/env python
# encoding: utf-8

import whois

print(whois.whois('www.baidu.com'))

編寫第一個爬蟲

設置用戶代理及重試下載

當爬蟲遇到5xx錯誤時，嘗試繼續進行下載，其他的錯誤則不去處理，下面的代碼使用了遞歸，等待一段時間後重試下載，並且設置嘗試次數作爲遞歸結束條件

#!/usr/bin/env python
# encoding: utf-8

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

def download(url, num_retries=2, user_agent='wswp'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        html = urllib.request.urlopen(request).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


if __name__ == "__main__":
    download("http://httpstat.us/500") # 下載錯誤兩次後退出

網站地圖爬蟲

僅僅是爬取後，用re來分析裏面的內容，比較亮點的就是通過request拿到對應網站的編碼聲明，進行decode

#!/usr/bin/env python
# encoding: utf-8

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError
import re

def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset() # 比較亮點的就是通過request拿到對應網站的編碼聲明，進行decode
        if not cs:
            cs = charset
        html = resp.read().decode(cs) # 用得到的網站編碼格式進行decode
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    if sitemap is not None:
        links = re.findall('<loc>(.*?)</loc>', sitemap)
        # download each link
        for link in links:
            html = download(link)
            # scrape html here

if __name__ == "__main__":
    crawl_sitemap("http://example.python-scraping.com/sitemap.xml")

鏈接爬蟲

如果需要讓爬蟲表現得更像普通用戶，跟蹤鏈接、訪問感興趣的內容，但是這種形式很容易下載很多沒必要的網站，所以需要進行過濾，用正則表達式拿到我們需要的鏈接；並且很多網頁之間有關聯，所以還要加入去重功能

def link_crawler(start_url, link_regex):
    " Crawl from the given start URL following links matched by link_regex "
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        if not html:
            continue
        # filter for links matching our regular expression
        for link in get_links(html):
            # if re.match(link_regex, link):
            if re.search(link_regex, link):
                abs_link = urljoin(start_url, link)
                print(abs_link)
                if abs_link not in seen:
                    seen.add(abs_link)
                    crawl_queue.append(abs_link)

def get_links(html):
    " Return a list of links from html "
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)
    
if __name__ == "__main__":
    link_crawler("http://example.python-scraping.com", '/(index|view)/')

可以看到匹配的鏈接和下載的結果

高級功能

解析robots

解析這個文件是避免下載禁止爬取的url，urllib自帶的robotparser模塊可以輕鬆完成這項工作

from urllib import robotparser
def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp

if __name__ == "__main__":
    rp = get_robots_parser("http://example.python-scraping.com/robots.txt")
    user_agent = 'BadCrawler'
    print(rp.can_fetch(user_agent, "http://example.python-scraping.com")) # False
    user_agent = 'GoodCrawler'
    print(rp.can_fetch(user_agent, "http://example.python-scraping.com")) # True

支持代理

只需要在requests時候帶上proxies參數(dict形式即可)

requests.get(url, headers=headers, proxies=proxies)

下載限速

Throttle類用dict記錄了每個域名上次訪問的時間，如果delay大於兩次訪問時間間隔則sleep

from urllib.parse import urlparse
import time


class Throttle:
    """ Add a delay between downloads to the same domain
    """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        domain = urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (time.time() - last_accessed)
            if sleep_secs > 0:
                # domain has been accessed recently
                # so need to sleep
                time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = time.time()

避免進入爬蟲陷阱

本質上是分頁請求不斷訪問空的搜索結果頁，直至達到最大頁數，稱爲爬蟲陷阱

想要避免陷入爬蟲陷阱，一個簡單的方法是記錄當前網頁經過了多少個鏈接，也就是深度，到達最大深度時，就不再向隊列中添加該網頁中的鏈接了

if depth == max_depth:
                print('Skipping %s due to depth' % url)
                continue

最終完成以上功能的鏈接爬蟲

#!/usr/bin/env python
# encoding: utf-8

# 最終版本
from urllib.parse import urlparse
import time


class Throttle:
    """ Add a delay between downloads to the same domain
    """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        domain = urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (time.time() - last_accessed)
            if sleep_secs > 0:
                # domain has been accessed recently
                # so need to sleep
                time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = time.time()


import re
import urllib.request
from urllib import robotparser
from urllib.parse import urljoin
from urllib.error import URLError, HTTPError, ContentTooShortError
# from throttle import Throttle
# from throtte import Throttle


def download(url, num_retries=2, user_agent='wswp', charset='utf-8', proxy=None):
    """ Download a given URL and return the page content
        args:
            url (str): URL
        kwargs:
            user_agent (str): user agent (default: wswp)
            charset (str): charset if website does not include one in headers
            proxy (str): proxy url, ex 'http://IP' (default: None)
            num_retries (int): number of retries if a 5xx error is seen (default: 2)
    """
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        if proxy:
            proxy_support = urllib.request.ProxyHandler({'http': proxy})
            opener = urllib.request.build_opener(proxy_support)
            urllib.request.install_opener(opener)
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp


def get_links(html):
    " Return a list of links (using simple regex matching) from the html content "
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
                 proxy=None, delay=3, max_depth=4):
    """ Crawl from the given start URL following links matched by link_regex. In the current
        implementation, we do not actually scrapy any information.

        args:
            start_url (str): web site to start crawl
            link_regex (str): regex to match for links
        kwargs:
            robots_url (str): url of the site's robots.txt (default: start_url + /robots.txt)
            user_agent (str): user agent (default: wswp)
            proxy (str): proxy url, ex 'http://IP' (default: None)
            delay (int): seconds to throttle between requests to one domain (default: 3)
            max_depth (int): maximum crawl depth (to avoid traps) (default: 4)
    """
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = {}
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)
    throttle = Throttle(delay)
    while crawl_queue:
        url = crawl_queue.pop()
        # check url passes robots.txt restrictions
        if rp.can_fetch(user_agent, url):
            depth = seen.get(url, 0)
            if depth == max_depth:
                print('Skipping %s due to depth' % url)
                continue
            throttle.wait(url)
            html = download(url, user_agent=user_agent, proxy=proxy)
            if not html:
                continue
            # TODO: add actual data scraping here
            # filter for links matching our regular expression
            for link in get_links(html):
                # if re.match(link_regex, link):
                if re.search(link_regex, link):
                    abs_link = urljoin(start_url, link)
                    print(abs_link)
                    if abs_link not in seen:
                        seen[abs_link] = depth + 1
                        crawl_queue.append(abs_link)
        else:
            print('Blocked by robots.txt:', url)

if __name__ == "__main__":
    link_regex = '/(index|view)/'
    link_crawler('http://example.python-scraping.com/index',link_regex,max_depth = 1)

執行如下

爲鏈接爬蟲添加抓取回調

如果想要複用上面的代碼抓取其他網站。需要添加一個callback處理抓取行爲，這樣每次只需要修改該函數就能針對其他的網站，下面的代碼只保留了link_crawler一些功能的版本

#!/usr/bin/env python
# encoding: utf-8

import re
import urllib.request
from lxml.html import fromstring
from urllib.error import URLError, HTTPError, ContentTooShortError
from urllib.parse import urljoin

def scrape_callback(url, html):
    """ Scrape each row from the country or district data using XPath and lxml """
    fields = ('flag_img', 'area', 'population', 'iso', 'country_or_district', 'capital',
              'continent', 'tld', 'currency_code', 'currency_name',
              'phone', 'postal_code_format', 'postal_code_regex',
              'languages', 'neighbours')
    if re.search('/view/', url):
        tree = fromstring(html)
        try:
            all_rows = [
                tree.xpath('//tr[@id="places_%s__row"]/td[@class="w2p_fw"]' % field)[0].text_content()
                for field in fields]
            print(url, all_rows)
        except Exception as ee:
            print("ee:", ee)


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset() # 比較亮點的就是通過request拿到對應網站的編碼聲明，進行decode
        if not cs:
            cs = charset
        html = resp.read().decode(cs) # 用得到的網站編碼格式進行decode
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

def get_links(html):
    " Return a list of links from html "
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    return webpage_regex.findall(html)

def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
                 max_depth=4, scrape_callback=None):
    crawl_queue = [start_url]
    seen = {}
    data = []
    while crawl_queue:
        url = crawl_queue.pop()
        # check url passes robots.txt restrictions
        depth = seen.get(url, 0)
        if depth == max_depth:
            print('Skipping %s due to depth' % url)
            continue
        html = download(url, user_agent=user_agent)
        if not html:
            continue
        if scrape_callback:
            data.extend(scrape_callback(url, html) or [])
        for link in get_links(html):
            if re.search(link_regex, link):
                abs_link = urljoin(start_url, link)
                print(abs_link)
                if abs_link not in seen:
                    seen[abs_link] = depth + 1
                    crawl_queue.append(abs_link)

if __name__ == "__main__":
    link_regex = '/(index|view)/'
    link_crawler('http://example.python-scraping.com', link_regex, scrape_callback=scrape_callback)

最後可以看到得到的輸出爲

參考網站

用python寫網絡爬蟲（第二版）
python爬蟲筆記之re.match匹配，與search、findall區別

《用Python寫網絡爬蟲》讀書筆記1