教您使用java爬蟲gecco抓取JD全部商品信息（二）

教您使用java爬蟲gecco抓取JD全部商品信息（一）

抓取商品列表信息

AllSortPipeline已經將需要進一步抓取的商品列表信息的鏈接提取出來了，可以看到鏈接的格式是：http://list.jd.com/list.html?cat=9987,653,659&delivery=1&JL=4_10_0&go=0。因此我們建立商品列表的Bean——ProductList，代碼如下：

@Gecco(matchUrl="http://list.jd.com/list.html?cat={cat}&delivery={delivery}&page={page}&JL={JL}&go=0", pipelines={"consolePipeline", "productListPipeline"})
public classProductListimplementsHtmlBean{

    private static final long serialVersionUID = 4369792078959596706L;

    @Request
    private HttpRequest request;

    /**
     * 抓取列表項的詳細內容，包括titile，價格，詳情頁地址等
     */
    @HtmlField(cssPath="#plist .gl-item")
    private List<ProductBrief> details;
    /**
     * 獲得商品列表的當前頁
     */
    @Text
    @HtmlField(cssPath="#J_topPage > span > b")
    private int currPage;
    /**
     * 獲得商品列表的總頁數
     */
    @Text
    @HtmlField(cssPath="#J_topPage > span > i")
    private int totalPage;

    public List<ProductBrief> getDetails(){
        return details;
    }

    publicvoidsetDetails(List<ProductBrief> details){
        this.details = details;
    }

    publicintgetCurrPage(){
        return currPage;
    }

    publicvoidsetCurrPage(int currPage){
        this.currPage = currPage;
    }

    publicintgetTotalPage(){
        return totalPage;
    }

    publicvoidsetTotalPage(int totalPage){
        this.totalPage = totalPage;
    }

    public HttpRequest getRequest(){
        return request;
    }

    publicvoidsetRequest(HttpRequest request){
        this.request = request;
    }

}

currPage和totalPage是頁面上的分頁信息，爲之後的分頁抓取提供支持。ProductBrief對象是商品的簡介，主要包括標題、預覽圖、詳情頁地址等。

public classProductBriefimplementsHtmlBean{

    private static final long serialVersionUID = -377053120283382723L;

    @Attr("data-sku")
    @HtmlField(cssPath=".j-sku-item")
    private String code;

    @Text
    @HtmlField(cssPath=".p-name> a > em")
    private String title;

    @Image({"data-lazy-img", "src"})
    @HtmlField(cssPath=".p-img > a > img")
    private String preview;

    @Href(click=true)
    @HtmlField(cssPath=".p-name > a")
    private String detailUrl;

    public String getTitle(){
        return title;
    }

    publicvoidsetTitle(String title){
        this.title = title;
    }

    public String getPreview(){
        return preview;
    }

    publicvoidsetPreview(String preview){
        this.preview = preview;
    }

    public String getDetailUrl(){
        return detailUrl;
    }

    publicvoidsetDetailUrl(String detailUrl){
        this.detailUrl = detailUrl;
    }

    public String getCode(){
        return code;
    }

    publicvoidsetCode(String code){
        this.code = code;
    }

}

這裏需要說明一下@Href(click=true)的click屬性，click屬性形象的說明了，這個鏈接我們希望gecco繼續點擊抓取。對於增加了click=true的鏈接，gecco會自動加入下載隊列中，不需要在手動調用SchedulerContext.into()增加。

編寫ProductList的業務邏輯

ProductList抓取完成後一般需要進行持久化，也就是將商品的基本信息入庫，入庫的方式有很多種，這個例子並沒有介紹，gecco支持整合spring，可以利用spring進行pipeline的開發，大家可以參考gecco-spring這個項目。本例子是進行了控制檯輸出。ProductList的業務處理還有一個很重要的任務，就是對分頁的處理，列表頁通常都有很多頁，如果需要全部抓取，我們需要將下一頁的鏈接入抓取隊列。

@PipelineName("productListPipeline")
public classProductListPipelineimplementsPipeline<ProductList> {

    @Override
    publicvoidprocess(ProductList productList){
        HttpRequest currRequest = productList.getRequest();
        //下一頁繼續抓取
        int currPage = productList.getCurrPage();
        int nextPage = currPage + 1;
        int totalPage = productList.getTotalPage();
        if(nextPage <= totalPage) {
            String nextUrl = "";
            String currUrl = currRequest.getUrl();
            if(currUrl.indexOf("page=") != -1) {
                nextUrl = StringUtils.replaceOnce(currUrl, "page=" + currPage, "page=" + nextPage);
            } else {
                nextUrl = currUrl + "&" + "page=" + nextPage;
            }
            SchedulerContext.into(currRequest.subRequest(nextUrl));
        }
    }

}

JD的列表頁通過page參數來指定頁碼，我們通過替換page參數達到分頁抓取的目的。至此，所有的商品的列表信息都已經可以正常抓取了。

xtuhcy

發佈了35 篇原創文章 · 獲贊 1 · 訪問量 2萬+

私信關注

教您使用java爬蟲gecco抓取JD全部商品信息（二）

教您使用java爬蟲gecco抓取JD全部商品信息（一）

抓取商品列表信息

編寫ProductList的業務邏輯

[軟件工具百科] 互聯網資源歷史快照歸檔站點與數字圖書館

網易面試：SpringBoot如何開啓虛擬線程？

杭州的 IT 崩盤了麼？

程序員常見的文本查看工具

VS2022 解決方案打不開 .NET Framework 4.0 、 4.5 等老項目

Vue3 運行可以，build 打包發佈報錯，app.config.globalProperties 用法坑

既然測試也要求寫代碼，那乾脆讓開發兼任測試不就好了嗎？

ITSM落地經驗之建設藍圖規劃

PDF 補丁丁 1.0.2 版更新

奇怪！應用的日誌呢？？

Java主題爬蟲Gecco發佈1.0.4版本

一個易用的輕量級的網絡爬蟲(Easy to use lightweight web crawler)

使用Gecco主題爬蟲爬取旅遊折扣信息

轉載一篇單字符串匹配KMP算法最好理解的文章

正則表達式的完全匹配和部分匹配

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結