教您使用java爬蟲gecco抓取JD全部商品信息（一）

原創

xtuhcy

2020-02-21 00:54

教您使用java爬蟲gecco抓取JD全部商品信息（一）

gecco爬蟲

如果對gecco還沒有了解可以參看一下gecco的github首頁。gecco爬蟲十分的簡單易用，JD全部商品信息的抓取9個類就能搞定。

JD網站的分析

要抓取JD網站的全部商品信息，我們要先分析一下網站，京東網站可以大體分爲三級，首頁上通過分類跳轉到商品列表頁，商品列表頁對每個商品有詳情頁。那麼我們通過找到所有分類就能逐個分類抓取商品信息。

入口地址

http://www.jd.com/allSort.aspx，這個地址是JD全部商品的分類列表，我們以該頁面作爲開始頁面，抓取JD的全部商品信息

新建開始頁面的HtmlBean類AllSort

@Gecco(matchUrl="http://www.jd.com/allSort.aspx", pipelines={"consolePipeline", "allSortPipeline"})
public classAllSortimplementsHtmlBean{

    private static final long serialVersionUID = 665662335318691818L;

    @Request
    private HttpRequest request;

    //手機
    @HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl")
    private List<Category> mobile;

    //家用電器
    @HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > div.mc > div.items > dl")
    private List<Category> domestic;

    public List<Category> getMobile(){
        return mobile;
    }

    publicvoidsetMobile(List<Category> mobile){
        this.mobile = mobile;
    }

    public List<Category> getDomestic(){
        return domestic;
    }

    publicvoidsetDomestic(List<Category> domestic){
        this.domestic = domestic;
    }

    public HttpRequest getRequest(){
        return request;
    }

    publicvoidsetRequest(HttpRequest request){
        this.request = request;
    }
}

可以看到，這裏以抓取手機和家用電器兩個大類的商品信息爲例，可以看到每個大類都包含若干個子分類，用List<Category>表示。gecco支持Bean的嵌套，可以很好的表達html頁面結構。Category表示子分類信息內容，HrefBean是共用的鏈接Bean。

public classCategoryimplementsHtmlBean{

    private static final long serialVersionUID = 3018760488621382659L;

    @Text
    @HtmlField(cssPath="dt a")
    private String parentName;

    @HtmlField(cssPath="dd a")
    private List<HrefBean> categorys;

    public String getParentName(){
        return parentName;
    }

    publicvoidsetParentName(String parentName){
        this.parentName = parentName;
    }

    public List<HrefBean> getCategorys(){
        return categorys;
    }

    publicvoidsetCategorys(List<HrefBean> categorys){
        this.categorys = categorys;
    }

}

獲取頁面元素cssPath的小技巧

上面兩個類難點就在cssPath的獲取上，這裏介紹一些cssPath獲取的小技巧。用Chrome瀏覽器打開需要抓取的網頁，按F12進入發者模式。選擇你要獲取的元素，如圖：

在瀏覽器右側選中該元素，鼠標右鍵選擇Copy--Copy selector，即可獲得該元素的cssPath

body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > div.mc > div.items

如果你對jquery的selector有了解，另外我們只希望獲得dl元素，因此即可簡化爲：

.category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl

編寫AllSort的業務處理類

完成對AllSort的注入後，我們需要對AllSort進行業務處理，這裏我們不做分類信息持久化等處理，只對分類鏈接進行提取，進一步抓取商品列表信息。看代碼：

@PipelineName("allSortPipeline")
public classAllSortPipelineimplementsPipeline<AllSort> {

    @Override
    public void process(AllSort allSort) {
        List<Category> categorys = allSort.getMobile();
        for(Category category : categorys) {
            List<HrefBean> hrefs = category.getCategorys();
            for(HrefBean href : hrefs) {
                String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";
                HttpRequest currRequest = allSort.getRequest();
                SchedulerContext.into(currRequest.subRequest(url));
            }
        }
    }

}

@PipelinName定義該pipeline的名稱，在AllSort的@Gecco註解裏進行關聯，這樣，gecco在抓取完並注入Bean後就會逐個調用@Gecco定義的pipeline了。爲每個子鏈接增加"&delivery=1&page=1&JL=4_10_0&go=0"的目的是隻抓取京東自營並且有貨的商品。SchedulerContext.into()方法是將待抓取的鏈接放入隊列中等待進一步抓取。

xtuhcy

發佈了35 篇原創文章 · 獲贊 1 · 訪問量 2萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

教您使用java爬蟲gecco抓取JD全部商品信息（一）

教您使用java爬蟲gecco抓取JD全部商品信息（一）

gecco爬蟲

JD網站的分析

入口地址

新建開始頁面的HtmlBean類AllSort

獲取頁面元素cssPath的小技巧

編寫AllSort的業務處理類

Java主題爬蟲Gecco發佈1.0.4版本

一個易用的輕量級的網絡爬蟲(Easy to use lightweight web crawler)

使用Gecco主題爬蟲爬取旅遊折扣信息

轉載一篇單字符串匹配KMP算法最好理解的文章

正則表達式的完全匹配和部分匹配

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結