一. 前言

NetDiscovery 是本人開發的一款基於 Vert.x、RxJava 2 等框架實現的通用爬蟲框架。它包含了豐富的特性。

二. 多線程的使用

NetDiscovery 雖然藉助了 RxJava 2 來實現線程的切換，仍然有大量使用多線程的場景。本文列舉一些爬蟲框架常見的多線程使用場景。

2.1 爬蟲的暫停、恢復

暫停和恢復是最常見的爬蟲使用場景，這裏藉助 CountDownLatch 類實現。

CountDownLatch是一個同步工具類，它允許一個或多個線程一直等待，直到其他線程的操作執行完後再執行。

暫停方法會初始化一個 CountDownLatch 類 pauseCountDown，並設置它的計數值爲1。

恢復方法會執行 pauseCountDown 的 countDown() ，正好它的計數到達零。

    /**
     * 爬蟲暫停，當前正在抓取的請求會繼續抓取完成，之後的請求會等到resume的調用才繼續抓取
     */
    public void pause() {
        this.pauseCountDown = new CountDownLatch(1);
        this.pause = true;
        stat.compareAndSet(SPIDER_STATUS_RUNNING, SPIDER_STATUS_PAUSE);
    }

    /**
     * 爬蟲重新開始
     */
    public void resume() {

        if (stat.get() == SPIDER_STATUS_PAUSE
                && this.pauseCountDown!=null) {

            this.pauseCountDown.countDown();
            this.pause = false;
            stat.compareAndSet(SPIDER_STATUS_PAUSE, SPIDER_STATUS_RUNNING);
        }
    }

從消息隊列中取出爬蟲的 Request 時，會先判斷是否需要暫停爬蟲的行爲，如果需要暫停則執行 pauseCountDown 的 await()。await() 會使線程一直受阻塞，也就是暫停爬蟲的行爲，直到 CountDownLatch 的計數爲0，此時正好能夠恢復爬蟲運行的狀態。

        while (getSpiderStatus() != SPIDER_STATUS_STOPPED) {

            //暫停抓取
            if (pause && pauseCountDown!=null) {
                try {
                    this.pauseCountDown.await();
                } catch (InterruptedException e) {
                    log.error("can't pause : ", e);
                }

                initialDelay();
            }
            // 從消息隊列中取出request
           final Request request = queue.poll(name);
           ......
      }

2.2 多緯度控制爬取速度

下圖反映了單個爬蟲的流程。

如果爬蟲爬取速度太快一定會被對方系統識別，NetDiscovery 可以通過限速來實現基本的反反爬蟲。

在 NetDiscovery 內部支持多個緯度實現爬蟲限速。這些緯度也基本上對應了單個爬蟲的流程。

2.2.1 Request

首先，爬蟲封裝的請求 Request 支持暫停。從消息隊列取出 Request 之後，會校驗該 Request 是否需要暫停。

        while (getSpiderStatus() != SPIDER_STATUS_STOPPED) {

            //暫停抓取
            ......

            // 從消息隊列中取出request
            final Request request = queue.poll(name);

            if (request == null) {

                waitNewRequest();
            } else {

                if (request.getSleepTime() > 0) {

                    try {
                        Thread.sleep(request.getSleepTime());
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }
                ......
            }
        }

2.2.2 Download

爬蟲下載時，下載器會創建 RxJava 的 Maybe 對象。Download 的限速藉助於 RxJava 的 compose、Transformer 來實現。

下面的代碼展示了 DownloaderDelayTransformer：

import cn.netdiscovery.core.domain.Request;
import io.reactivex.Maybe;
import io.reactivex.MaybeSource;
import io.reactivex.MaybeTransformer;

import java.util.concurrent.TimeUnit;

/**
 * Created by tony on 2019-04-26.
 */
public class DownloaderDelayTransformer implements MaybeTransformer {

    private Request request;

    public DownloaderDelayTransformer(Request request) {
        this.request = request;
    }

    @Override
    public MaybeSource apply(Maybe upstream) {

        return request.getDownloadDelay() > 0 ? upstream.delay(request.getDownloadDelay(), TimeUnit.MILLISECONDS) : upstream;
    }
}

下載器只要藉助 compose 、DownloaderDelayTransformer，就可以實現 Download 的限速。

以 UrlConnectionDownloader 爲例：

        Maybe.create(new MaybeOnSubscribe<InputStream>() {

                @Override
                public void subscribe(MaybeEmitter<InputStream> emitter) throws Exception {

                    emitter.onSuccess(httpUrlConnection.getInputStream());
                }
            })
             .compose(new DownloaderDelayTransformer(request))
             .map(new Function<InputStream, Response>() {

                @Override
                public Response apply(InputStream inputStream) throws Exception {

                    ......
                    return response;
                }
            });

2.2.3 Domain

Domain 的限速參考了 Scrapy 框架的實現，將每個域名以及它對應的最近訪問時間存到 ConcurrentHashMap 中。每次請求時，可以設置 Request 的 domainDelay 屬性，從而實現單個 Request 對某個 Domain 的限速。

import cn.netdiscovery.core.domain.Request;

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

/**
 * Created by tony on 2019-05-06.
 */
public class Throttle {

    private Map<String,Long> domains = new ConcurrentHashMap<String,Long>();

    private static class Holder {
        private static final Throttle instance = new Throttle();
    }

    private Throttle() {
    }

    public static final Throttle getInsatance() {
        return Throttle.Holder.instance;
    }

    public void wait(Request request) {

        String domain = request.getUrlParser().getHost();
        Long lastAccessed = domains.get(domain);

        if (lastAccessed!=null && lastAccessed>0) {
            long sleepSecs = request.getDomainDelay() - (System.currentTimeMillis() - lastAccessed);
            if (sleepSecs > 0) {
                try {
                    Thread.sleep(sleepSecs);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }

        domains.put(domain,System.currentTimeMillis());
    }
}

待 Request 從消息隊列中取出時，會先判斷 Request 是否需要暫停之後，然後再判斷一下 Domain 的訪問是否需要暫停。

        while (getSpiderStatus() != SPIDER_STATUS_STOPPED) {

            //暫停抓取
            ......

            // 從消息隊列中取出request
            final Request request = queue.poll(name);

            if (request == null) {

                waitNewRequest();
            } else {

                if (request.getSleepTime() > 0) {

                    try {
                        Thread.sleep(request.getSleepTime());
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }

                Throttle.getInsatance().wait(request);
 
                ......
            }
        }

2.2.4 Pipeline

爬蟲處理 Request 的流程大體是這樣的：調用網絡請求 (包括重試機制) -> 將 response 存放到 page -> 解析 page -> 順序執行 pipelines -> 完成一次 Request 請求。

                // request正在處理
                downloader.download(request)
                        .retryWhen(new RetryWithDelay(maxRetries, retryDelayMillis, request)) // 對網絡請求的重試機制
                        .map(new Function<Response, Page>() {

                            @Override
                            public Page apply(Response response) throws Exception {
                                // 將 response 存放到 page
                                ......                            
                                return page;
                            }
                        })
                        .map(new Function<Page, Page>() {

                            @Override
                            public Page apply(Page page) throws Exception {

                                if (parser != null) {

                                    parser.process(page);
                                }

                                return page;
                            }
                        })
                        .map(new Function<Page, Page>() {

                            @Override
                            public Page apply(Page page) throws Exception {

                                if (!page.getResultItems().isSkip() && Preconditions.isNotBlank(pipelines)) {

                                    pipelines.stream()
                                            .forEach(pipeline -> {
                                                pipeline.process(page.getResultItems());
                                            });
                                }

                                return page;
                            }
                        })
                        .observeOn(Schedulers.io())
                        .subscribe(new Consumer<Page>() {

                            @Override
                            public void accept(Page page) throws Exception {

                                log.info(page.getUrl());

                                if (request.getAfterRequest() != null) {

                                    request.getAfterRequest().process(page);
                                }

                                signalNewRequest();
                            }
                        }, new Consumer<Throwable>() {
                            @Override
                            public void accept(Throwable throwable) throws Exception {

                                log.error(throwable.getMessage(), throwable);
                            }
                        });

Pipeline 的限速實質藉助了 RxJava 的 delay 和 block 操作符實現。

map(new Function<Page, Page>() {

        @Override
        public Page apply(Page page) throws Exception {

               if (!page.getResultItems().isSkip() && Preconditions.isNotBlank(pipelines)) {

                   pipelines.stream()
                          .forEach(pipeline -> {

                                if (pipeline.getPipelineDelay()>0) {

                                        // Pipeline Delay
                                        Observable.just("pipeline delay").delay(pipeline.getPipelineDelay(),TimeUnit.MILLISECONDS).blockingFirst();
                                 }

                                pipeline.process(page.getResultItems());
                          });
               }

                return page;
       }
})

另外，NetDiscovery 支持通過配置 application.yaml 或 application.properties 文件，來配置爬蟲。當然也支持配置限速的參數，同時支持使用隨機的數值來配置相應的限速參數。

2.3 非阻塞的爬蟲運行

早期的版本，爬蟲運行之後無法再添加新的 Request。因爲爬蟲消費完隊列中的 Request 之後，默認退出程序了。

新版本藉助於 Condition，即使某個爬蟲正在運行仍然可以添加 Request 到它到消息隊列中。

Condition 的作用是對鎖進行更精確的控制。它用來替代傳統的 Object 的wait()、notify() 實現線程間的協作，相比使用 Object 的 wait()、notify()，使用Condition 的 await()、signal() 這種方式實現線程間協作更加安全和高效。

在 Spider 中需要定義好 ReentrantLock 以及 Condition。

然後再定義 waitNewRequest() 、signalNewRequest() 方法，它們的作用分別是掛起當前的爬蟲線程等待新的 Request 、喚醒爬蟲線程消費消息隊列中的 Request。

    private ReentrantLock newRequestLock = new ReentrantLock();
    private Condition newRequestCondition = newRequestLock.newCondition();
  
    ......

    private void waitNewRequest() {
        newRequestLock.lock();

        try {
            newRequestCondition.await(sleepTime, TimeUnit.MILLISECONDS);
        } catch (InterruptedException e) {
            log.error("waitNewRequest - interrupted, error {}", e);
        } finally {
            newRequestLock.unlock();
        }
    }

    public void signalNewRequest() {
        newRequestLock.lock();

        try {
            newRequestCondition.signalAll();
        } finally {
            newRequestLock.unlock();
        }
    }

可以看到，如果從消息隊列中取不出 Request，則會運行 waitNewRequest()。

        while (getSpiderStatus() != SPIDER_STATUS_STOPPED) {

            //暫停抓取
            if (pause && pauseCountDown!=null) {
                try {
                    this.pauseCountDown.await();
                } catch (InterruptedException e) {
                    log.error("can't pause : ", e);
                }

                initialDelay();
            }

            // 從消息隊列中取出request
            final Request request = queue.poll(name);

            if (request == null) {

                waitNewRequest();
            } else {
                ......
            }
     }

然後，在 Queue 接口中包含了一個 default 方法 pushToRunninSpider() ,它內部除了將 request push 到 queue 中，還有調用了 spider.signalNewRequest()。

    /**
     * 把Request請求添加到正在運行爬蟲的Queue中，無需阻塞爬蟲的運行
     *
     * @param request request
     */
    default void pushToRunninSpider(Request request, Spider spider) {

        push(request);
        spider.signalNewRequest();
    }

最後，即使爬蟲已經運行，也可以在任意時刻將 Request 添加到該爬蟲對應到Queue 中。

        Spider spider = Spider.create(new DisruptorQueue())
                .name("tony")
                .url("http://www.163.com");

        CompletableFuture.runAsync(()->{
            spider.run();
        });

        try {
            Thread.sleep(2000L);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        spider.getQueue().pushToRunninSpider(new Request("https://www.baidu.com", "tony"),spider);

        try {
            Thread.sleep(2000L);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        spider.getQueue().pushToRunninSpider(new Request("https://www.jianshu.com", "tony"),spider);

        System.out.println("end....");

總結

爬蟲框架 github 地址：https://github.com/fengzhizi715/NetDiscovery

本文總結了通用爬蟲框架在某些特定場景中如何使用多線程。未來，NetDiscovery 還會增加更爲通用的功能。

通用爬蟲框架中多線程的使用

一. 前言

二. 多線程的使用

2.1 爬蟲的暫停、恢復

2.2 多緯度控制爬取速度

2.2.1 Request

2.2.2 Download

2.2.3 Domain

2.2.4 Pipeline

2.3 非阻塞的爬蟲運行

總結

OpenCV + Kotlin 實現 USB 攝像頭(相機)實時畫面、拍照一. 業務背景二. 原先的實現方式以及痛點三. 使用 OpenCV 進行重構四. 總結

Java 多線程模式 —— Guarded Suspension 模式 Guarded Suspension 模式的介紹 Guarded Suspension 模式的使用總結

《Kotlin 進階實戰》及勘誤表本書的內容致謝勘誤

使用 Kotlin Compose Desktop 實現了一個簡易的"手機助手"

基於 Laplacian 實現簡單的圖像模糊檢測業務背景 Laplacian 算子圖像模糊檢測算法總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結