webMagic學習系列：downloader模塊學習

摘要：

本篇主要剖析webmagic的downloader模塊，對於httpclient模塊涉及到的http相關的知識，例如：Request、Response以及重定向策略進行一定的分析。首先梳理了本模塊的結構、然後對於執行流程進行了分析，最後對於其中涉及的設計模式：單例模式和相關算法進行了代碼分析。

0x00：downloader的模塊結構

downloader涉及到的類和接口主要如下表所示：

類名稱	作用	方法說明	備註
Downloader	定義downloader接口規範	downloade(r:Request,t:Task):Page	接口
AbstractDownloader	定義downloader狀態接口	onSuccess(),onError()，@Overdide:downloade()	抽象類，
HttpClientDownloader	具體的下載接口	繼承自AbstractDownloader	具體類
CustomRedirectStrategy	定義重定向策略
HttpClientGenerator	配置httpCliet的輔助類	getHttpClient(s:Site):HttpClient
HttpClientRequestContext	數據類	存儲requestcontext和clinetcontext
HttpUriRequestConverter	配置Request的輔助類	convert(r:Request,s:Site,p:Proxy):Request

ox01:downloade的具體執行邏輯

首先來看具體的downloade代碼：

    @Override
    public Page download(Request request, Task task) {
        if (task == null || task.getSite() == null) {
            throw new NullPointerException("task or site can not be null");
        }
        CloseableHttpResponse httpResponse = null;
        CloseableHttpClient httpClient = getHttpClient(task.getSite());
        Proxy proxy = proxyProvider != null ? proxyProvider.getProxy(task) : null;
        HttpClientRequestContext requestContext = httpUriRequestConverter.convert(request, task.getSite(), proxy);
        Page page = Page.fail();
        try {
            httpResponse = httpClient.execute(requestContext.getHttpUriRequest(), requestContext.getHttpClientContext());
            page = handleResponse(request, request.getCharset() != null ? request.getCharset() : task.getSite().getCharset(), httpResponse, task);
            onSuccess(request);
            logger.info("downloading page success {}", request.getUrl());
            return page;
        } catch (IOException e) {
            logger.warn("download page {} error", request.getUrl(), e);
            onError(request);
            return page;
        } finally {
            if (httpResponse != null) {
                //ensure the connection is released back to pool
                EntityUtils.consumeQuietly(httpResponse.getEntity());
            }
            if (proxyProvider != null && proxy != null) {
                proxyProvider.returnProxy(proxy, page, task);
            }
        }
    }

可以看到主要的代碼流程還是很清晰的，首先得到配置好的httpClient，這是通過getClient()方法得到的，這個方法具體涉及到設計模式中的單例，我們稍後再詳細講，然後根據傳遞過來的Request得到RequestContext和ClinetContext，根據執行httlClient的execute方法，這個方法就是具體的向服務端發送資源請求的方法，該方法會將服務器的資源封裝到Response對象中。最後將Request和Response封裝到Page中去，供後續的PageProcessor使用。

下面個用僞代碼描述上面的流程：

fun download(r:Requst,t:Task):Page
    httpClient = getClient(t.site())
    context = convert(r,t.site(),proxy)
    response = httpClient.execute(context.requestContext,context.clinetContext)
    page = handle(r,response)
    return page

可以看到downloade函數實際上關鍵的核心代碼就是httpClinet的execute方法，其他的代碼統一都可以抽象成準備工作。

0x02:初始化策略

httpClient初試化實際上涉及了一系列的參數配置，包括使用到的socket的參數配置，以及http一些連接配置，由於涉及到的參數非常多，對於socket的參數配置和httpClinet均使用到了Builder模式。具體的代碼代碼如下：

   private CloseableHttpClient generateClient(Site site) {
        HttpClientBuilder httpClientBuilder = HttpClients.custom();
        
        httpClientBuilder.setConnectionManager(connectionManager);
        if (site.getUserAgent() != null) {
            httpClientBuilder.setUserAgent(site.getUserAgent());
        } else {
            httpClientBuilder.setUserAgent("");
        }
        if (site.isUseGzip()) {
            httpClientBuilder.addInterceptorFirst(new HttpRequestInterceptor() {

                public void process(
                        final HttpRequest request,
                        final HttpContext context) throws HttpException, IOException {
                    if (!request.containsHeader("Accept-Encoding")) {
                        request.addHeader("Accept-Encoding", "gzip");
                    }
                }
            });
        }
        //解決post/redirect/post 302跳轉問題
        httpClientBuilder.setRedirectStrategy(new CustomRedirectStrategy());

        SocketConfig.Builder socketConfigBuilder = SocketConfig.custom();
        socketConfigBuilder.setSoKeepAlive(true).setTcpNoDelay(true);
        socketConfigBuilder.setSoTimeout(site.getTimeOut());
        SocketConfig socketConfig = socketConfigBuilder.build();
        httpClientBuilder.setDefaultSocketConfig(socketConfig);
        connectionManager.setDefaultSocketConfig(socketConfig);
        httpClientBuilder.setRetryHandler(new DefaultHttpRequestRetryHandler(site.getRetryTimes(), true));
        generateCookie(httpClientBuilder, site);
        return httpClientBuilder.build();
    }

可以看到實際上就是根據站點來配置client參數的過程，也就是說，我們可以將一些自定義參數放置到Site實例中，這樣就可以將參數填入了。這實際上也是我麼常用的初始化策略，當參數衆多時，我們抽象出相關的配置類，這樣可以將參數集中管理起來，實現代碼的結構化。

ox03:單例模式

在第一節中我們提到，httpClinet使用了單例模式，下面我們看具體的實現過程：

    private CloseableHttpClient getHttpClient(Site site) {
        if (site == null) {
            return httpClientGenerator.getClient(null);
        }
        String domain = site.getDomain();
        CloseableHttpClient httpClient = httpClients.get(domain);
        if (httpClient == null) {
            synchronized (this) {
                httpClient = httpClients.get(domain);
                if (httpClient == null) {
                    httpClient = httpClientGenerator.getClient(site);
                    httpClients.put(domain, httpClient);
                }
            }
        }
        return httpClient;
    }

可以看到代碼的關鍵部分如下：

if(httpClient == null) {
    synchronized(this) {
        if(httpClinet == null) {
            htttpClinet = httpClinetGenerator.getClinet();
        }
    }
}

也就是代碼判斷了兩次單例是否爲空，第一次判斷爲空，然後加鎖進行單例的判斷，這個比較容易理解，但是第二次再次判斷是爲什麼呢，我們設想如下情況：

當前單例未被創建，所以httpClient爲null，線程一判斷結果爲空後還未加鎖，此時進行了線程的切換，線程2得到了執行權，此時由於線程1爲創建實例，所以線程2會創建一個實例出來。然後再切回線程1執行，由於之前線程1判斷了httpClient爲空，然後取得鎖，此時仍進行了實例的創建。也就不滿足單例模式了。所以第二次的再次判空時必要的。只有這樣才能保證即使多線程也能創建唯一的實例。

webMagic學習系列：downloader模塊學習

摘要：

0x00：downloader的模塊結構

ox01:downloade的具體執行邏輯

0x02:初始化策略

ox03:單例模式

win11關閉自動檢測病毒刪文件

千兆寬帶實際網速能到達多少？

rust學習筆記第一篇

docker容器原理探究第一篇:unix隔離技術

[翻譯系列]正則表達式簡介

說說我理解的ioc

海量數據的處理問題：海量IP頻次統計

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結