JAVA爬蟲練習

爲什麼我們要爬取數據

在大數據時代,我們要獲取更多數據,就要進行數據的挖掘、分析、篩選,比如當我們做一個項目的時候,需要大量真實的數據的時候,就需要去某些網站進行爬取,有些網站的數據爬取後保存到數據庫還不能夠直接使用,需要進行清洗、過濾後才能使用,我們知道有些數據是非常珍貴的。

今天我們使用Jsoup爬取整個頁面數據。

什麼是Jsoup?

jsoup 是一款 Java 的HTML 解析器，可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API，可通過DOM，CSS以及類似於JQuery的操作方法來取出和操作數據。該版本包含一個支持 HTML5 的解析器分支，可確保跟現在的瀏覽器一樣解析 HTML 的方法，同時降低了解析的時間和內存的佔用。

JSOUP主要功能

從一個URL，文件或字符串中解析HTML；
使用DOM或CSS選擇器來查找、取出數據；
可操作HTML元素、屬性、文本

詳細介紹參考https://www.cnblogs.com/zhangyinhua/p/8037599.html

接下來我們編寫代碼。

一、首先我們先添加pom.xml（創建項目略過）

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.9.2</version>
</dependency>

二、以搜狐網址爲例，首先我們先獲取頁面**

RequestAndResponseTool，請求當前url的html頁面，並封裝到我們自己定義的Page對象中

package com.etoak.crawl.page;

import org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpMethodParams;

import java.io.IOException;

public class RequestAndResponseTool {


    public static Page  sendRequstAndGetResponse(String url) {
        Page page = null;
        // 1.生成 HttpClinet 對象並設置參數
        HttpClient httpClient = new HttpClient();
        // 設置 HTTP 連接超時 5s
        httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(5000);
        // 2.生成 GetMethod 對象並設置參數
        GetMethod getMethod = new GetMethod(url);
        // 設置 get 請求超時 5s
        getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT, 5000);
        // 設置請求重試處理
        getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler());
        // 3.執行 HTTP GET 請求
        try {
            int statusCode = httpClient.executeMethod(getMethod);
        // 判斷訪問的狀態碼
            if (statusCode != HttpStatus.SC_OK) {
                System.err.println("Method failed: " + getMethod.getStatusLine());
            }
        // 4.處理 HTTP 響應內容
            byte[] responseBody = getMethod.getResponseBody();// 讀取爲字節 數組
            String contentType = getMethod.getResponseHeader("Content-Type").getValue(); // 得到當前返回類型

            page = new Page(responseBody,url,contentType); //封裝成爲頁面
        } catch (HttpException e) {
        // 發生致命的異常，可能是協議不對或者返回的內容有問題
            System.out.println("Please check your provided http address!");
            e.printStackTrace();
        } catch (IOException e) {
        // 發生網絡異常
            e.printStackTrace();
        } finally {
        // 釋放連接
            getMethod.releaseConnection();
        }
        return page;
    }
}

響應的相關內容，Page

package com.etoak.crawl.page;


import com.etoak.crawl.util.CharsetDetector;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.UnsupportedEncodingException;

/*
* page
*   1: 保存獲取到的響應的相關內容;
* */
public class Page {

    private byte[] content ;
    private String html ;  //網頁源碼字符串
    private Document doc  ;//網頁Dom文檔
    private String charset ;//字符編碼
    private String url ;//url路徑
    private String contentType ;// 內容類型


    public Page(byte[] content , String url , String contentType){
        this.content = content ;
        this.url = url ;
        this.contentType = contentType ;
    }

    public String getCharset() {
        return charset;
    }
    public String getUrl(){return url ;}
    public String getContentType(){ return contentType ;}
    public byte[] getContent(){ return content ;}

    public void setContent(byte[] content) {
        this.content = content;
    }

    /**
     * 返回網頁的源碼字符串
     *
     * @return 網頁的源碼字符串
     */
    public String getHtml() {
        if (html != null) {
            return html;
        }
        if (content == null) {
            return null;
        }
        if(charset==null){
            charset = CharsetDetector.guessEncoding(content); // 根據內容來猜測 字符編碼
        }
        try {
            this.html = new String(content, charset);
            return html;
        } catch (UnsupportedEncodingException ex) {
            ex.printStackTrace();
            return null;
        }
    }

    /*
    *  得到文檔
    * */
    public Document getDoc(){
        if (doc != null) {
            return doc;
        }
        try {
            this.doc = Jsoup.parse(getHtml(), url);
            return doc;
        } catch (Exception ex) {
            ex.printStackTrace();
            return null;
        }
    }


}

儲存網頁數據，FileTool

package com.etoak.crawl.util;



import com.etoak.crawl.page.Page;
import com.etoak.crawl.page.PageParserTool;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

/*  本類主要是 下載那些已經訪問過的文件*/
public class FileTool {

    private static String dirPath;


    /**
     * getMethod.getResponseHeader("Content-Type").getValue()
     * 根據 URL 和網頁類型生成需要保存的網頁的文件名，去除 URL 中的非文件名字符
     */
    private static String getFileNameByUrl(String url, String contentType) {
        //去除 http://

        //text/html 類型
        if (contentType.indexOf("html") != -1) {
           url = url.replaceAll("[\\?/:*|<>\"]", "_") + ".html";
           return url;
        }else{
            int i = url.lastIndexOf("/");
            url = url.substring(i+1,url.length());
            return url;
        }
    }

    /*
    *  生成目錄
    * */
    private static void mkdir() {
        if (dirPath == null) {
            dirPath = Class.class.getClass().getResource("/").getPath() + "temp\\";
        }
        File fileDir = new File(dirPath);
        if (!fileDir.exists()) {
            fileDir.mkdir();
        }
    }

    /**
     * 保存網頁字節數組到本地文件，filePath 爲要保存的文件的相對地址
     */

    public static void saveToLocal(Page page) {
        mkdir();
        String fileName = getFileNameByUrl(page.getUrl(), page.getContentType()) ;

        String filePath = dirPath + fileName ;
        byte[] data = page.getContent();
        try {

            DataOutputStream out = new DataOutputStream(new FileOutputStream(new File(filePath)));
            for (int i = 0; i < data.length; i++) {
                out.write(data[i]);
            }
            out.flush();
            out.close();
            System.out.println("文件："+ fileName + "已經被存儲在"+ filePath  );
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

方法入口，抓取頁面過程，及配置需要抓取的頁面。

package com.etoak.crawl.main;

import com.etoak.crawl.link.LinkFilter;
import com.etoak.crawl.link.Links;
import com.etoak.crawl.page.Page;
import com.etoak.crawl.page.PageParserTool;
import com.etoak.crawl.page.RequestAndResponseTool;
import com.etoak.crawl.util.FileTool;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.util.Set;

public class MyCrawler {

    /**
     * 使用種子初始化 URL 隊列
     *
     * @param seeds 種子 URL
     * @return
     */
    private void initCrawlerWithSeeds(String[] seeds) {
        for (int i = 0; i < seeds.length; i++){
            Links.addUnvisitedUrlQueue(seeds[i]);
        }
    }

    /**
     * 抓取過程
     *
     * @param seeds
     * @return
     */
    public void crawling(String[] seeds) {

        //初始化 URL 隊列
        initCrawlerWithSeeds(seeds);

        //定義過濾器，提取以 http://www.baidu.com 開頭的鏈接
        LinkFilter filter = new LinkFilter() {
            public boolean accept(String url) {
                if (url.startsWith("http://www.baidu.com"))
                    return true;
                else
                    return false;
            }
        };

        //循環條件：待抓取的鏈接不空且抓取的網頁不多於 1000
        while (!Links.unVisitedUrlQueueIsEmpty()  && Links.getVisitedUrlNum() <= 1000) {

            //先從待訪問的序列中取出第一個；
            String visitUrl = (String) Links.removeHeadOfUnVisitedUrlQueue();
            if (visitUrl == null){
                continue;
            }

            //根據URL得到page;
            Page page = RequestAndResponseTool.sendRequstAndGetResponse(visitUrl);

            //將保存文件
            FileTool.saveToLocal(page);

            //將已經訪問過的鏈接放入已訪問的鏈接中；
            Links.addVisitedUrlSet(visitUrl);

            //得到超鏈接
            Set<String> links = PageParserTool.getLinks(page,"link");
            for (String link : links) {
                Links.addUnvisitedUrlQueue(link);
                System.out.println("新增爬取路徑: " + link);
            }
            Set<String> links2 = PageParserTool.getLinks(page,"img");
            for (String link2 : links2) {
                Links.addUnvisitedUrlQueue(link2);
                System.out.println("新增爬取路徑: " + link2);
            }

        }
    }


    //main 方法入口
    public static void main(String[] args) {
        MyCrawler crawler = new MyCrawler();
        crawler.crawling(new String[]{"http://www.baidu.com/"});
    }
}

總結：

這樣我們就可以完成對於java爬蟲的一個demo練習，較爲簡單，裏面主要對於Jsoup的一些使用。（每天進步一點點。）

爲什麼我們要爬取數據

什麼是Jsoup?

JSOUP主要功能

響應的相關內容，Page

儲存網頁數據，FileTool

方法入口，抓取頁面過程，及配置需要抓取的頁面。

總結：

python gdal 安裝使用（Windows， python 3.6.8）

centos7 安裝jdk1.8環境

StringRedisTemplate操作redis數據與StringRedisTemplate與RedisTemplate的區別

SpringCloud+Spring Security OAuth2 實現微服務統一認證授權

微服務高併發秒殺系統

MySql: 替換某個字段中的指定字符串——replace函數

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結