基於JSoup庫的java爬蟲開發學習——小步快跑

原創

2020-07-01 09:31

因某需求，需要使用java從網頁上爬取一些數據來使用，花了點時間看了一下JSoup,簡單介紹一下

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Java HTML Parser官網

譯：jsoup是一個用於處理實際HTML的Java庫。它提供了一個非常方便的API來提取和操作數據，使用最好的DOM、CSS和類jquery方法。

簡單來說就是可以使用這個jsoup庫根據HTML標籤元素來定位你想要的數據，下面直接切入主題學習使用JSoup.

一、導入所需jar包

本文寫作時使用的maven文件，如需下載jar包，文低引用2中有相關下載鏈接

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.12.1</version>
</dependency>

二、main測試

1.讀取超鏈接URL（本文測試這一種方式，欲使用他方式請參考引用3），簡單點，就去訪問百度的首頁面

        try {
            //首先，通過工具類連接上URL
            Document doc = Jsoup.connect("https://www.baidu.com/").get();
            //通過文檔獲取標題信息
            String title = doc.title();
            System.out.println(title);
        } catch (IOException e) {
            e.printStackTrace();
        }

打印內容：

百度一下，你就知道

2.獲取<a>標籤的URL及文本

        try {
            Document doc = Jsoup.connect("https://www.baidu.com/").get();
            /*獲取URL的鏈接*/
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                System.out.println("link : " + link.attr("href"));
                System.out.println("text : " + link.text());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

打印內容（部分）：

text : 百度首頁
link : javascript:;
text : 設置
link : https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5
text : 登錄
link : http://news.baidu.com
text : 新聞
link : https://www.hao123.com

3.獲取URL的元信息

        try {
            Document doc = Jsoup.connect("https://www.baidu.com/").get();
            /*獲取URL的元信息*/
            //查詢metab標籤的第一個name屬性值爲referrer的屬性爲content的值
            String keywords = doc.select("meta[name=referrer]").first().attr("content");
            System.out.println("Meta keyword : " + keywords);
            String description = doc.select("meta[name=theme-color]").get(0).attr("content");
            System.out.println("Meta description : " + description);

        } catch (IOException e) {
            e.printStackTrace();
        }

打印結果：

Meta keyword : always
Meta description : #2932e1

4.獲取URL的圖像信息

        try {
            Document doc = Jsoup.connect("https://www.baidu.com/").get();
            /*獲取URL的圖像*/
            Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
            for (Element image : images) {
                System.out.println("src : " + image.attr("src"));
                System.out.println("height : " + image.attr("height"));
                System.out.println("width : " + image.attr("width"));
                System.out.println("alt : " + image.attr("alt"));
            }

        } catch (IOException e) {
            e.printStackTrace();
        }

打印結果（部分）：

src : //www.baidu.com/img/baidu_jgylogo3.gif
height :
width :
alt : 到百度首頁
src : //www.baidu.com/img/[email protected]
height :
width :
alt : 到百度首頁

5.獲取表單參數

        try {
            Document doc = Jsoup.connect("https://www.baidu.com/").get();
            /*獲取表單參數*/
            //首先通過ID定位指定標籤
            Element loginform = doc.getElementById("form");
            //獲取標籤input因是通過標籤獲取，所以它是一個複數集合
            Elements inputElements = loginform.getElementsByTag("input");
            //遍歷集合獲取每一個input標籤中的屬性值（根據此法可定位自己想要的數據）
            for (Element inputElement : inputElements) {
                String key = inputElement.attr("name");
                String value = inputElement.attr("value");
                System.out.println("Param name: "+key+" \nParam value: "+value);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

打印結果（部分）：

Param name: rsv_pq
Param value: 935e81bd0003a4fa
Param name: rsv_t
Param value: 995c4PHOYhjruVrvWzHXHuwlKcndZzriFTV+H6ELp2VaJNhvTjAP9/aule8
Param name: rqlang
Param value: cn

實戰測試

某網站獲取近十年的河南高考分數線 http://www.gaokao.com/henan/fsx/

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * Created with CosmosRay
 *
 * @author CosmosRay
 * @date 2019/6/24
 * Function:
 */
public class MyJSoup {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("http://www.gaokao.com/henan/fsx/").get();
            Element element = doc.getElementsByTag("table").first();
            Elements titls = element.getElementsByTag("tr");
            boolean flag = false;
            for (Element titl : titls
            ) {
                if(!flag) {
                    Elements ths = titl.getElementsByTag("th");
                    for (Element element1 : ths
                    ) {
                        String s = element1.text();
                        System.out.print(s + "  ");
                    }
                    System.out.println();
                    flag = true;
                }else {
                    Elements ths = titl.getElementsByTag("td");
                    for (Element element1 : ths
                    ) {
                        String s = element1.text();
                        System.out.print(s + "  ");
                    }
                    System.out.println();
                }
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

打印結果：

2018 2017 2016 2015 2014 2013 2012 2011 2010 2009
一本 547 516 517 513 536 519 557 562 532 552
二本 436 389 458 455 483 465 509 515 489 510
專科 200 180 183 180 200 - 360 393 397 417

引用：1.Java HTML Parser 2. jsoup Cookbook(中文版) 3.易百教程

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

基於JSoup庫的java爬蟲開發學習——小步快跑

一、導入所需jar包

二、main測試

1.讀取超鏈接URL（本文測試這一種方式，欲使用他方式請參考引用3），簡單點，就去訪問百度的首頁面

2.獲取<a>標籤的URL及文本

3.獲取URL的元信息

4.獲取URL的圖像信息

5.獲取表單參數

實戰測試

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

基於JSoup庫的java爬蟲開發學習——小步快跑

tomcat和solr的整合——小步快跑

圖數據庫基礎

java桌面應用開發 - javaFx (從0-1 小步快跑)

Java開發文檔Swagger的使用詳細教程

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結