爬蟲學習之webmagic源碼剖析

摘要

筆者最近發現偶然發現一個非常友好的java爬蟲框架，感覺非常適合用來java代碼以及爬蟲知識的學習，隨通過查閱網上資料以及閱讀其官方手冊，並且分析其源代碼，學習到了非常多的有用知識，包括java開發的基本哲學，面向對象的知識，設計模式，當然最重要的還是爬蟲開發的一系列知識。本篇作爲源代碼webmagic的開篇，主要聊一聊其框架的整體結構以及其關鍵的核心代碼。

0xo0:什麼是爬蟲🕷

簡單來講，爬蟲是通過程序自動的從網絡上抓取特定的url並且從獲得的html(通常情況下)分析出我們感興趣的內容。也就是說爬蟲可以大致分爲兩個步驟：1、抓取（下載）url，2、分析下載到的內容。當然這是一個極度簡化的步驟，但是足以說明爬蟲的工作原理了。

0x01:webmagic的整體分析🕷

這裏是webmagic框架的官方介紹：WebMagic的結構分爲Downloader、PageProcessor、Scheduler、Pipeline四大組件，並由Spider將它們彼此組織起來。這四大組件對應爬蟲生命週期中的下載、處理、管理和持久化等功能。WebMagic的設計參考了Scapy，但是實現方式更Java化一些。

而Spider則將這幾個組件組織起來，讓它們可以互相交互，流程化的執行，可以認爲Spider是一個大的容器，它也是WebMagic邏輯的核心。

WebMagic總體架構圖如下：

從示意圖上我們大致可以看出：

Downloader負責下載功能，其輸入是Request，輸出是Page。由於download是從互聯網上獲取內容，因此他是io密集型的程序。並且他實際上承擔着生產者和消費者雙重身份，消費Request，生產Page。
PageProcessor負責頁面的處理功能。具體來說，一般是分析頁面，抽取感興趣的內容，以及發現新的url。其輸入是Page，輸出是Request。由於PageProcessor是對文本頁面進行分析，設計到複雜的操作，因此他是cpu密集型的程序。而且他也承擔着生產者和消費者的雙重身份。和Downloader正好相反，消費Page,生產Request。需要注意的是，這部分是爬蟲開發者必須實現的類，也是創建爬蟲的必選參數。
Scheduler則負責協調PageProcessor和Downloader。通過對Request的管理，來實現對兩者的管理。原理類似於通過阻塞隊列來同步生產者和消費者。這裏涉及大量的多線程知識，待後續博客詳細講解。
Pipeline則是對持久化的抽象。一種持久化手段對應一個PipeLine類。包括將結果持久化到文件、數據庫或者直接打印到屏幕。

下面通過一段代碼直觀感受Spider的使用：

public class ZhihuPageProcessor implements PageProcessor {
    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().
               regex("https://www\\.zhihu\\.com/question/\\d+/answer/\\d+.*").all());
        page.putField("title", page.getHtml().
               xpath("//h1[@class='QuestionHeader-title']/text()").toString());
        page.putField("question",page.getHtml().
               xpath("//div[@class='QuestionRichText']//tidyText()").toString());
        page.putField("answer", page.getHtml().
               xpath("//div[@class='QuestionAnswercontent']/tidyText()").toString());
        if (page.getResultItems().get("title")==null){
            //skip this page
            page.setSkip(true);
        }
    }
    @Override
    public Site getSite() {
        return site;
    }
    public static void main(String[] args) {
    Spider.create(new GithubRepoPageProcessor())
            //從https://github.com/code4craft開始抓    
            .addUrl("https://zhihu.com//question/")
            //設置Scheduler，使用Redis來管理URL隊列
            .setScheduler(new RedisScheduler("localhost"))
            //設置Pipeline，將結果以json方式保存到文件
            .addPipeline(new JsonFilePipeline("D:\\data\\webmagic"))
            //開啓5個線程同時執行
            .thread(5)
            //啓動爬蟲
            .run();
    }
}

可以看到，我們在使用Spider時，只需要實現自定義的PageProcessor即可，然後使用Spider的靜態方法創建爬蟲，其中create方法的參數就是必選參數PageProcessor，而後面的一系列參數設置，都可以看成是設計模式中的Builder模式，這樣方便我們的鏈式調用。

0x02:Spider類的核心數據域

我們先來看Spider的核心數據：

public class Spider implements Runnable,Task {
	protected Downloader downloader;
    protected List<Pipeline> pipelines = new ArrayList<Pipeline>();
    protected PageProcessor pageProcessor;
    protected List<Request> startRequests;
    protected Site site;
    protected String uuid;
    protected Scheduler scheduler = new QueueScheduler();
    protected Logger logger = LoggerFactory.getLogger(getClass());
    protected CountableThreadPool threadPool;
    protected ExecutorService executorService;
    protected int threadNum = 1;
    protected AtomicInteger stat = new AtomicInteger(STAT_INIT);
    protected boolean exitWhenComplete = true;
    protected final static int STAT_INIT = 0;
    protected final static int STAT_RUNNING = 1;
    protected final static int STAT_STOPPED = 2;
    protected boolean spawnUrl = true;
    protected boolean destroyWhenExit = true;
    private ReentrantLock newUrlLock = new ReentrantLock();
    private Condition newUrlCondition = newUrlLock.newCondition();
    private List<SpiderListener> spiderListeners;
    private final AtomicLong pageCount = new AtomicLong(0);
    private Date startTime;
    private int emptySleepTime = 30000;
}

我們可以看到上述提到的核心組件Downloader、Pipeline、PageProcessor、Scheduler都包含其中，由於在第二節已經闡述清楚，這裏不再贅述。下面個講解其他的重要數據

startRequests:List<Request>從名字便可以聽出來，這是爬蟲的爬取入口。
threadPool:CountableThreadPool:java線程池，pageProcessor多線程的管理，通過複用線程的方式，避免創建大量的PageProcessor線程而造成系統開銷過大。
newUrlLock:ReentrantLock，輔助scheduler來進行Downloader和PageProcessor的同步。

這裏需要額外的在講解幾個Spider設計中的核心類，這些類是數據流轉的輔助類。、

Request，他是對url的一層封裝。一個url對應一個Request對象，同時他也是PageProcessor通過Scheduler來管理Downloader的媒介類。
Page，他是對Downloader下載結果的封裝。同時他也是PageProcessor待處理的核心對象。

其他的數據域在我們分析代碼時後續再詳細說明。

0x03:Spider的核心代碼🕷

public void run() {
    checkRunningStat();
    initComponent();
    logger.info("Spider {} started!",getUUID());
    while (!Thread.currentThread().isInterrupted() && stat.get() == STAT_RUNNING) {
        final Request request = scheduler.poll(this);
        if (request == null) {
            if (threadPool.getThreadAlive() == 0 && exitWhenComplete) {
                break;
            }
            // wait until new url added
            waitNewUrl();
        } else {
            threadPool.execute(new Runnable() {
                @Override
                public void run() {
                    try {
                        processRequest(request);
                        onSuccess(request);
                    } catch (Exception e) {
                        onError(request);
                        logger.error("process request " + request + " error", e);
                    } finally {
                        pageCount.incrementAndGet();
                        signalNewUrl();
                    }
                }
            });
        }
    }
    stat.set(STAT_STOPPED);
    // release some resources
    if (destroyWhenExit) {
        close();
    }
    logger.info("Spider {} closed! {} pages downloaded.", getUUID(), pageCount.get());
}

下面逐步分析代碼，首先執行checkRunningStat()檢查爬蟲是否已經運行，如果運行，則跑出異常。通過檢查完畢，則初始化組件。也就是執行initComponent()代碼。我們具體看一下initComponent()的具體工作:

protected void initComponent() {
    if (downloader == null) {
        this.downloader = new HttpClientDownloader();
    }
    if (pipelines.isEmpty()) {
        pipelines.add(new ConsolePipeline());
    }
    downloader.setThread(threadNum);
    if (threadPool == null || threadPool.isShutdown()) {
        if (executorService != null && !executorService.isShutdown()) {
            threadPool = new CountableThreadPool(threadNum, executorService);
        } else {
            threadPool = new CountableThreadPool(threadNum);
        }
    }
    if (startRequests != null) {
        for (Request request : startRequests) {
            addRequest(request);
        }
        startRequests.clear();
    }
    startTime = new Date();
}

實際上就是進行組件的默認配置，也就是有些組件用戶並沒有明確使用自定製化組件或者指定具體的組件時，系統進行自定義的配置。包括Downloader、Pipeline、ThreadPool、以及startRequest的push工作。

下面繼續分析Spider的核心代碼，組件初始化完成以後，便開始進行的爬蟲的爬取、頁面分析、持久化等相應的工作。我們來看其具體是如何實現的。

首先判斷當前的運行的狀態，然後從scheduler拉取Request。
當拉取到和結果爲空時，這是相當於消費隊列爲空，那我們需要進行同步處理。首先判斷線程池中是否還有存活的線程，如果爲空，並且設置了抓取完成後退出，那麼就退出整個程序，結束爬取，否則的話，說明還需要繼續等待Downloader處理Page，發現新的url，也就是執行waitNewUrl()。
如果拉取到的結果不爲空，進行頁面的一系列處理工作，包括下載、抽取以及持久化，也就是pageProcess(Request)。由於這個函數比較複雜，我們首先看一下其源代碼：

private void processRequest(Request request) {
    //執行具體的下載工作
    Page page = downloader.download(request, this);
    //如果成功下載，那麼便進行頁面的分析
    if (page.isDownloadSuccess()){
        //將分析工作再次封裝到該函數中
        onDownloadSuccess(request, page);
    } else {
        onDownloaderFail(request);
    }
}

當下載成功後，纔可以執行後續的抽取和持久化工作，這又被進一步封裝到onDownloadSuccess()中。源代碼如下：

private void onDownloadSuccess(Request request, Page page) {
    if (site.getAcceptStatCode().contains(page.getStatusCode())){
        //具體的抽取工作
        pageProcessor.process(page);
        //新url發現，實際上就是將新的url添加到scheduler中去
        extractAndAddRequests(page, spawnUrl);
        if (!page.getResultItems().isSkip()) {
            //結果的持久化
            for (Pipeline pipeline : pipelines) {
                pipeline.process(page.getResultItems(), this);
            }
        }
    } else {
        logger.info("page status code error, page {} , code: {}", 
                    request.getUrl(), page.getStatusCode());
    }
    sleep(site.getSleepTime());
    return;
}

5、當消費完Request同時發現新的url之後，便可以發出信號，告訴其他線程現在有了新的url，可以開始新下載、抽取和持久化工作了。

爬蟲學習之webmagic源碼剖析

摘要

0xo0:什麼是爬蟲🕷

0x01:webmagic的整體分析🕷

0x02:Spider類的核心數據域

0x03:Spider的核心代碼🕷

win11關閉自動檢測病毒刪文件

千兆寬帶實際網速能到達多少？

rust學習筆記第一篇

docker容器原理探究第一篇:unix隔離技術

[翻譯系列]正則表達式簡介

說說我理解的ioc

海量數據的處理問題：海量IP頻次統計

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結