Jsoup+WebMagic+Selenium+phantomJs簡易爬取房源信息網站內容並解析

好長時間沒寫博客了。

最近有個需求要一些房源平臺的數據，看了看相關的網站數據再加上之前用Jsoup爬取過網站數據的經驗覺得沒什麼問題。

於是用jsoup寫了個簡易demo進行某8數據爬取，沒爬個幾分鐘呢就驗證碼了~

看來網站有反爬意識，畢竟靠數據資源吃飯的。咦？瀏覽器看到的價位是正常的怎麼爬取下來的html內容價位就是亂碼呢？

因爲自定義字體，每個詳情頁的字體都是動態加載的，關於加密字體解析的可以參考下面幾個文章

https://www.cnblogs.com/a595452248/p/10800845.html

https://www.jianshu.com/p/a5d904c5d88e

仔細看了看html內容發現在meta 標籤的description裏價位又寫上正確的內容了，那我就不明白爲什麼要加密字體了（斜眼笑）

但貓眼就不一樣了，所有數字都使用加密字體，無所謂是什麼字體，把base64的內容下載下來存儲成文件，在找出映射關係即可知道數字具體內容了。

這裏可以參考 Python版的字體反爬處理

https://blog.csdn.net/xing851483876/article/details/82928607

後來在搜索資料的途中發現了webmagic這個插件，用起來也相當方便

webmagic項目地址 https://github.com/code4craft/webmagic 中文文檔 http://webmagic.io/

以csdn論壇爬取爲例，我要爬取論壇列表頁面的html存儲至本地，還有列表中的每條帖子的詳情html內容存儲至本地

DemoProcessor

package com.personal.secondhand.processor;

import com.personal.secondhand.pipeline.FileInfoPipeline;
import com.personal.secondhand.pipeline.FilePagePipeline;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;

/**
 * csdn論壇爬取demo
 */
public class DemoProcessor implements PageProcessor {

    /**
     * 加入一些F12 debug到的請求頭信息組裝
     */
    private Site site = Site.me()
            .addHeader("F12查到的頭信息key", "信息值")
            .setUserAgent("僞裝的useragent")
            .setSleepTime(3000)
            .setRetryTimes(3)
            .setCycleRetryTimes(3);

    @Override
    public void process(Page page) {
        String html = page.getHtml().get();
        String url = page.getUrl().get();
        if (url.indexOf("page") != -1) {
            System.out.println("列表頁");
            // 使用jsoup 解析html內容 並分析出每個帖子的詳情鏈接
            Document document = Jsoup.parse(html);
            Elements elements = document.select("a[class=forums_title]");
            for (Element ele : elements) {
                String infoUrl = ele.attr("href");
                // 將列表頁的詳情url添加至任務中繼續處理
                page.addTargetRequest(infoUrl);
            }
            // 將獲取到的列表頁html內容交由FilePagePipeline數據處理裏
            page.putField("pageHtml",html);
        } else {
            System.out.println("詳情頁");
            // 將獲取到的詳情頁html內容交由FileInfoPipeline數據處理裏
            page.putField("infoHtml",html);
        }

    }

    public static void main(String[] args) {
        // 創建一個任務 處理
        Spider spider = Spider.create(new DemoProcessor());
        // 多個任務url
        spider.addUrl("https://bbs.csdn.net/forums/J2EE?page=1");
        spider.addUrl("https://bbs.csdn.net/forums/J2EE?page=2");
        // 將頁面解析後的數據交給FileInfoPipeline/FilePagePipeline處理
        spider.addPipeline(new FileInfoPipeline("d:/csdnhtml/"));
        spider.addPipeline(new FilePagePipeline("d:/csdnhtml/"));
        // 開啓多個線程
        spider.thread(4);
        // 啓動
        spider.run();
    }


    @Override
    public Site getSite() {
        return site;
    }
}

FileInfoPipeline

package com.personal.secondhand.pipeline;

import lombok.extern.slf4j.Slf4j;
import org.apache.commons.lang3.StringUtils;
import org.joda.time.DateTime;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.FilePipeline;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;

/**
 * 公共的詳情頁面下載到本地
 * 只要接收到infoHtml即生成列表html頁面
 * 文件命名替換（原命名可能無法找到具體訪問url）：
 *  （1）去除https://
 *  （2）?替換成#
 *  （3）/替換成_
 */
@Slf4j
public class FileInfoPipeline extends FilePipeline {

    public FileInfoPipeline() {
        super();
    }

    public FileInfoPipeline(String path) {
        super(path);
    }

    @Override
    public void process(ResultItems resultItems, Task task) {
        // 從PageProcess獲取設置的html內容
        String html = resultItems.get("infoHtml");
        if (StringUtils.isBlank(html)) {
            // 沒有就跳出
            return;
        }
        String url = resultItems.getRequest().getUrl();
        url = url.replaceAll("https://", "").replaceAll("\\?", "#").replaceAll("/", "_");
        // 文件命名爲url路徑，替換文件命名不符合的情況後形式如：
        // 以下內容參考至FilePipeline#process 替換了寫入內容
        String today = new DateTime().toString("yyyyMMdd");
        String path = super.path + PATH_SEPERATOR + today + PATH_SEPERATOR + "infoHtml" + PATH_SEPERATOR + url;
        try {
//            PrintWriter printWriter = new PrintWriter(new OutputStreamWriter(new FileOutputStream(this.getFile(path + DigestUtils.md5Hex(resultItems.getRequest().getUrl()) + ".html")), "UTF-8"));
            PrintWriter printWriter = new PrintWriter(new OutputStreamWriter(new FileOutputStream(this.getFile(path + ".html")), "UTF-8"));
            printWriter.println(html);
            printWriter.close();
        } catch (IOException e) {
            log.error("info html文件寫入異常", e);
        }
    }
}

FilePagePipeline

package com.personal.secondhand.pipeline;

import lombok.extern.slf4j.Slf4j;
import org.apache.commons.lang3.StringUtils;
import org.joda.time.DateTime;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.FilePipeline;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;

/**
 * 公共列表頁面html存儲到本地
 * 只要接收到pageHtml即生成列表html頁面
 * 文件命名替換（原命名可能無法找到具體訪問url）：
 *  （1）去除https://
 *  （2）?替換成#
 *  （3）/替換成_
 */
@Slf4j
public class FilePagePipeline extends FilePipeline {

    public FilePagePipeline() {
        super();
    }

    public FilePagePipeline(String path) {
        super(path);
    }

    @Override
    public void process(ResultItems resultItems, Task task) {
        // 從PageProcess獲取設置的html內容
        String html = resultItems.get("pageHtml");
        if (StringUtils.isBlank(html)) {
            // 沒有就跳出
            return;
        }
        String url = resultItems.getRequest().getUrl();
        url = url.replaceAll("https://", "").replaceAll("\\?", "#").replaceAll("/", "_");
        // 以下內容參考至FilePipeline#process 替換了寫入內容
        String today = new DateTime().toString("yyyyMMdd");
        String path = super.path + PATH_SEPERATOR + today + PATH_SEPERATOR + "pageHtml" + PATH_SEPERATOR + url;
        try {
//            PrintWriter printWriter = new PrintWriter(new OutputStreamWriter(new FileOutputStream(this.getFile(path + DigestUtils.md5Hex(resultItems.getRequest().getUrl()) + ".html")), "UTF-8"));
            PrintWriter printWriter = new PrintWriter(new OutputStreamWriter(new FileOutputStream(this.getFile(path + ".html")), "UTF-8"));
            printWriter.println(html);
            printWriter.close();
        } catch (IOException e) {
            log.error("page html文件寫入異常", e);
        }
    }
}

執行main方法即可，在D盤的csdnhtml文件夾就出現了相關文件打開是需要的html內容，將html文件用jsoup解析內容即可。

這樣一個簡單的數據爬取就完成了。

在Demo的process方法中，判斷列表任務還是詳情任務後解析html內容發給待處理的FileInfoPipeline和FilePagePipeline

對應的key纔會去處理。官方demo有很挺多例子可以參考。

另外還有的網站是js動態鏈接跳轉的頁面，你無法使用jsoup或webmagic直接訪問返回正確的html內容，如芒果房源

這時候需要selenium 和 phantomjs結合去處理了（phantomjs不在維護了，最新版的selenium 也不支持phantomjs了，如需替換可以使用Chrome或Firefox的headless）

代碼在下面項目地址中

Jsoup就不多介紹了，主要是簡單url訪問及解析頁面document。【也使用過過濾xss，因爲踩坑印象挺深，當初某個版本（1.7.1？）解析過濾把一個正常字符串如(&orderby=xx)吃了部分並改變了一個字符串類似亂碼(亂碼字符+by)形式，導致參數錯誤。升級最新版就好了，但方法也改變了一些】

jsoup的解析語法大家可以自行搜索。

附上項目地址：https://github.com/UncleY/secondhand

另外各大網站對爬蟲的限制都在根目錄下的robots.txt聲明，關於爬蟲協議搜索引擎講的比我詳細。提一下是讓大家儘量少消耗對方服務器資源。

先到這裏~

Jsoup+WebMagic+Selenium+phantomJs簡易爬取房源信息網站內容並解析

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

不同平臺下處理【java.lang.OutOfMemoryError: Java heap space】內存溢出。

【轉】javascript操作數組

java跨平臺運行【有關路徑獲取】

仿淘寶篩選模塊功能

Jsoup+WebMagic+Selenium+phantomJs簡易爬取房源信息網站內容並解析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結