Java爬蟲系列四：使用selenium-java爬取js異步請求的數據

在之前的系列文章中介紹了如何使用httpclient抓取頁面html以及如何用jsoup分析html源文件內容得到我們想要的數據，但是有時候通過這兩種方式不能正常抓取到我們想要的數據，比如看如下例子。

1.需求場景：

想要抓取股票的最新價格，頁面F12信息如下：

按照前面的方式，爬取的代碼如下：

/**
 * @description: 爬取股票的最新股價
 * @author: JAVA開發老菜鳥
 * @date: 2021-10-16 21:47
 */
public class StockPriceSpider {

    Logger logger = LoggerFactory.getLogger(this.getClass());

    public static void main(String[] args) {

        StockPriceSpider stockPriceSpider = new StockPriceSpider();
        String html = stockPriceSpider.httpClientProcess();
        stockPriceSpider.jsoupProcess(html);
    }

    private String httpClientProcess() {
        String html = "";
        String uri = "http://quote.eastmoney.com/sh600036.html";
        //1.生成httpclient，相當於該打開一個瀏覽器
        CloseableHttpClient httpClient = HttpClients.createDefault();
        CloseableHttpResponse response = null;
        //2.創建get請求，相當於在瀏覽器地址欄輸入 網址
        HttpGet request = new HttpGet(uri);
        try {
            request.setHeader("user-agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36");
            request.setHeader("accept", "application/json, text/javascript, */*; q=0.01");

//            HttpHost proxy = new HttpHost("3.211.17.212", 80);
//            RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
//            request.setConfig(config);

            //3.執行get請求，相當於在輸入地址欄後敲回車鍵
            response = httpClient.execute(request);

            //4.判斷響應狀態爲200，進行處理
            if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                //5.獲取響應內容
                HttpEntity httpEntity = response.getEntity();
                html = EntityUtils.toString(httpEntity, "utf-8");
                logger.info("訪問{} 成功，返回頁面數據{}", uri, html);
            } else {
                //如果返回狀態不是200，比如404（頁面不存在）等，根據情況做處理，這裏略
                logger.info("訪問{}，返回狀態不是200", uri);
                logger.info(EntityUtils.toString(response.getEntity(), "utf-8"));
            }
        } catch (ClientProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //6.關閉
            HttpClientUtils.closeQuietly(response);
            HttpClientUtils.closeQuietly(httpClient);
        }
        return html;
    }

    private void jsoupProcess(String html) {
        Document document = Jsoup.parse(html);
        Element price = document.getElementById("price9");
        logger.info("股價爲：>>> {}", price.text());
    }

}

運行結果：

納尼，股價爲"-" ？不可能。
之所以爬不到正確的結果，是因爲這個值在網站上是通過異步加載渲染的，因此不能正常獲取。

2.java爬取異步加載的數據的方法

那如何爬取異步加載的數據呢？通常有兩種做法：

2.1內置瀏覽器內核

內置瀏覽器就是在抓取的程序中啓動一個瀏覽器內核，使我們獲取到 js 渲染後的頁面就和靜態頁面一樣。常用的內核有

Selenium
PhantomJs
HtmlUnit

這裏我選了Selenium，它是一個模擬瀏覽器，是進行自動化測試的工具，它提供一組 API 可以與真實的瀏覽器內核交互。當然，爬蟲也可以用它。
具體做法如下：

引入pom依賴

<dependency>
   <groupId>org.seleniumhq.selenium</groupId>
   <artifactId>selenium-java</artifactId>
   <version>3.141.59</version>
</dependency>

配置對應瀏覽器的驅動
要使用selenium，需要下載瀏覽器的驅動，根據不同的瀏覽器要下載的驅動程序也不一樣，下載地址爲：https://npm.taobao.org/mirrors/chromedriver/
我用的是谷歌瀏覽器，因此下載了對應版本的windows和linux驅動。

下載後需要配置進java環境變量裏面，指定驅動的目錄：

System.getProperties().setProperty("webdriver.chrome.driver", "F:/download/chromedriver_win32_1/chromedriver.exe");

代碼實現：

Logger logger = LoggerFactory.getLogger(this.getClass());

  public static void main(String[] args) {

      StockPriceSpider stockPriceSpider = new StockPriceSpider();
      stockPriceSpider.seleniumProcess();
  }

  private void seleniumProcess() {

      String uri = "http://quote.eastmoney.com/sh600036.html";

      // 設置 chromedirver 的存放位置
      System.getProperties().setProperty("webdriver.chrome.driver", "F:/download/chromedriver_win32_1/chromedriver.exe");

      // 設置瀏覽器參數
      ChromeOptions chromeOptions = new ChromeOptions();
      chromeOptions.addArguments("--no-sandbox");//禁用沙箱
      chromeOptions.addArguments("--disable-dev-shm-usage");//禁用開發者shm
      chromeOptions.addArguments("--headless"); //無頭瀏覽器，這樣不會打開瀏覽器窗口
      WebDriver webDriver = new ChromeDriver(chromeOptions);

      webDriver.get(uri);
      WebElement webElements = webDriver.findElement(By.id("price9"));
      String stockPrice = webElements.getText();
      logger.info("最新股價爲 >>> {}", stockPrice);
      webDriver.close();
  }

執行結果：

爬取成功！

2.2反向解析法

反向解析法就是通過F12查找到 Ajax 異步獲取數據的鏈接，直接調用該鏈接得到json結果，然後直接解析json結果獲取想要的數據。
這個方法的關鍵就在於找到這個Ajax鏈接。這種方式我沒有去研究，感興趣的可以百度下。這裏略。

3.結束語

以上即爲如何通過selenium-java爬取異步加載的數據的方法。通過本方法，我寫了一個小工具:
持倉市值通知系統，他會每日根據自己的持倉配置，自動計算賬戶總市值，並郵件通知到指定郵箱。
用到的技術如下：

相關代碼已經上傳到我的碼雲，感興趣可以看下。

Java爬蟲系列四：使用selenium-java爬取js異步請求的數據

1.需求場景：

2.java爬取異步加載的數據的方法

2.1內置瀏覽器內核

2.2反向解析法

3.結束語

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

druid數據源 xml配置

centos7 安裝RabbitMQ

搭建centos7 Java開發環境

RPC failed; curl 56 Recv failure: Connection was reset

Navicat切換到DBeaver，如何同步連接信息

Java爬蟲系列四：使用selenium-java爬取js異步請求的數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結