Java簡單實現爬蟲技術，抓取整個網站所有鏈接+圖片+文件（思路+代碼）

寫這個純屬個人愛好，前兩天想玩爬蟲，但是百度了一大圈也沒發現有好一點的帖子，所以就自己研究了下，親測小點的網站還是能隨隨便便爬完的，由於是單線程所以速度嘛~~你懂的
（多線程沒學好，後期再慢慢加上多線程吧）

先上幾張效果圖

需要用到的知識點

網絡請求（至於用哪個嘛，看個人喜好，文章用的是okhttp）
File文件讀寫
Jsoup框架（html解析器）

需要的jar包

注意:okhttp內部依賴okio，別忘了同時導入okio

難點

如圖（隨便弄了個草圖）
要說技術難點的話，無非就是如何遍歷整個網站了，首先，要考慮到的是抓取到一個鏈接後，這個鏈接裏面肯定還有好幾十甚至上百個鏈接，接下來這幾十個鏈接裏面又有鏈接，鏈接裏面有鏈接一層一層嵌套，該如何去獲取這些鏈接？

實現思路

1.鏈接存儲: 使用文件操作儲存所有鏈接，至於爲什麼不用集合存儲，據博主瞭解，寫爬蟲基本都不用集合去存儲數據，原因在於鏈接多了之後會報內存溢出錯誤。也就是集合裏面存太多東西了，然後還要對它進行查找操作，所以不推薦使用集合進行儲存。
2.鏈接讀取: 將每次讀到的鏈接存入.txt文本文件中，這裏要注意的是每次存入鏈接的時候要在後面加上\r\n（換行），也就是讓每個鏈接各佔一行，這樣有利於後期以行的形式讀取鏈接。
3.鏈接遍歷: ①、獲取首頁鏈接中的子鏈接，存入文件中，已行爲單位存儲。; ②、定義一個變量num（默認爲-1），用於記錄當前讀的是第幾條鏈接，每次遍歷完一條鏈接後判斷如果（num<鏈接文件行數）則 num++。; ③、遍歷解析鏈接的方法，每一次遍歷的目標鏈接等於文件內的第num行

這樣基本就實現了鏈接的遍歷

舉個栗子
假設index.html頁面內有5個子鏈接分別對應 a~e.html，解析index.html頁面後將該頁面中的5個鏈接存入文件中，num++（此時num=0），文件中的1~5行就分別對應這5個鏈接，第二次調用讀取方法的時候用到的鏈接就是文件中的第num行，也就是a.html。
接着解析a.html，將a.html中的所有超鏈接追加進文件中。

上圖：

圖中的遍歷方式似乎有點像一個橫放着的wifi信號

接下來貼代碼：

首先創建兩個類: HttpUtil.java (網絡請求類，用於獲取網頁源代碼); Spider.java (爬蟲主代碼)

HttpUtil.java 類

import java.io.IOException;
import java.util.ArrayList;
import java.util.concurrent.TimeUnit;
import okhttp3.Call;
import okhttp3.FormBody;
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.Response;

/**
 * Created by XieTiansheng on 2018/3/7.
 */

public class HttpUtil {
    private static OkHttpClient okHttpClient;
    private static int num = 0;

    static{
        okHttpClient = new OkHttpClient.Builder()
                .readTimeout(1, TimeUnit.SECONDS)
                .connectTimeout(1, TimeUnit.SECONDS)
                .build();
    }


    public static String get(String path){
        //創建連接客戶端
        Request request = new Request.Builder()
                .url(path)
                .build();
        //創建"調用" 對象
        Call call = okHttpClient.newCall(request);
        try {
            Response response = call.execute();//執行
            if (response.isSuccessful()) {
                return response.body().string();
            }
        } catch (IOException e) {
            System.out.println("鏈接格式有誤:"+path);
        }
        return null;
    }

}

這個就不多寫註釋了百度有一大堆okhttp教程

Spider.java 類

首先定義要爬的網站首頁與儲存鏈接的文件對象

    public static String path = "http://www.yada.com.cn/";  //雅達公司官網
    public static int num = -1,sum = 0;
    /**
     * 定義四個文件類（鏈接存儲，圖片儲存，文件存儲，錯誤鏈接存儲）
     */
    public static File aLinkFile,imgLinkFile,docLinkFile,errorLinkFile;

解析html頁面的方法

/**
     * 
     * @param path      目標地址
     */
    public static void getAllLinks(String path){
        Document doc = null;
        try{
            doc = Jsoup.parse(HttpUtil.get(path));
        }catch (Exception e){
            //解析出的錯誤鏈接（404頁面）
            writeTxtFile(errorLinkFile, path+"\r\n");   //寫入錯誤鏈接收集文件
            num++;  
            if(sum>num){    //如果文件總數（sum）大於num(當前讀取位置)則繼續遍歷
                getAllLinks(getFileLine(aLinkFile, num));
            }
            return;
        }
        //獲取html代碼中所有帶有href屬性的a標籤，和圖片
        Elements aLinks = doc.select("a[href]");
        Elements imgLinks = doc.select("img[src]");
        System.out.println("本次抓取的鏈接："+path);
        for(Element element:aLinks){
            String url =element.attr("href");
            //判斷鏈接是否包含這兩個頭
            if(!url.contains("http://")&&!url.contains("https://")){
                //不是則加上 例：<a href="xitongshow.php?cid=67&id=113" />
                //則需要加上前綴   http://www.yada.com.cn/xitongshow.php?cid=67&id=113
                //否則下次解析該鏈接的時候會報404錯誤            
                url = Spider.path+url;//網站首頁加上該鏈接
            }
            //如果文件中沒有這個鏈接，而且鏈接中不包含javascript:則繼續(因爲有的是用js語法跳轉)
            if(!readTxtFile(aLinkFile).contains(url)
                    &&!url.contains("javascript")){ 
                //路徑必須包含網頁主鏈接--->防止爬入別的網站
                if(url.contains(Spider.path)){      
                    //判斷該a標籤的內容是文件還是子鏈接
                    if(url.contains(".doc")||url.contains(".exl")
                            ||url.contains(".exe")||url.contains(".apk")
                            ||url.contains(".mp3")||url.contains(".mp4")){
                        //寫入文件中，文件名+文件鏈接
                        writeTxtFile(docLinkFile, element.text()+"\r\n\t"+url+"\r\n");
                    }else{
                        //將鏈接寫入文件
                        writeTxtFile(aLinkFile, url+"\r\n");
                        sum++;  //鏈接總數+1
                    }
                    System.out.println("\t"+element.text()+"：\t"+url);
                }
            }
        }
        //同時抓取該頁面圖片鏈接
        for(Element element:imgLinks){
            String srcStr = element.attr("src");
            if(!srcStr.contains("http://")&&!srcStr.contains("https://")){//沒有這兩個頭
                srcStr = Spider.path+srcStr;
            }
            if(!readTxtFile(imgLinkFile).contains(srcStr)){ 
                //將圖片鏈接寫進文件中
                writeTxtFile(imgLinkFile, srcStr+"\r\n");
            }
        }
        num++;
        if(sum>num){    //如果文件總數（sum）大於num(當前讀取位置)則繼續遍歷
            getAllLinks(getFileLine(aLinkFile, num));
        }
    }

該方法用於解析html頁面，取到所有鏈接，存入文件

兩個操作文件的方法（讀/取）

/**
     * 讀取文件
     * @param file  文件類
     * @return  文件內容
     */
    public static String readTxtFile(File file){
        String result = "";     //讀取結果
        String thisLine = "";   //每次讀取的行
        try {
            BufferedReader reader = new BufferedReader(new FileReader(file));
            try {
                while((thisLine=reader.readLine())!=null){
                    result += thisLine+"\n";
                }
                reader.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        return result;
    }

    /**
     * 寫入內容
     * @param file  文件類
     * @param urlStr    要寫入的文本
     */
    public static void writeTxtFile(File file,String urlStr){
        try {
            BufferedWriter writer = new BufferedWriter(new FileWriter(file,true));
            writer.write(urlStr);
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

簡單的文件操作方法，用於儲存每次解析出來的鏈接

獲取文件中的指定行內容

/**
     * 獲取文件指定行數的數據，用於爬蟲獲取當前要爬的鏈接
     * @param file  目標文件
     * @param num   指定的行數
     */
    public static String getFileLine(File file,int num){
        String thisLine = "";
        int thisNum = 0 ;
        try {
            BufferedReader reader = new BufferedReader(new FileReader(file));
            while((thisLine = reader.readLine())!=null){
                if(num == thisNum){
                    return thisLine;
                }
                    thisNum++;
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return "";
    }

這個方法很重要，用於獲取文件中的第幾條鏈接

下面是這個類的完整代碼

package com.xietiansheng.shangmao.cn;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.lang.reflect.Field;
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import okio.ForwardingTimeout;

public class Spider {

    public static String path = "http://www.yada.com.cn/";  //雅達公司官網
    public static int num = -1,sum = 0;
    /**
     * 定義四個文件類（鏈接存儲，圖片儲存，文件存儲，錯誤鏈接存儲）
     */
    public static File aLinkFile,imgLinkFile,docLinkFile,errorLinkFile;
    /**
     * 
     * @param path      目標地址
     */
    public static void getAllLinks(String path){
        Document doc = null;
        try{
            doc = Jsoup.parse(HttpUtil.get(path));
        }catch (Exception e){
            //接收到錯誤鏈接（404頁面）
            writeTxtFile(errorLinkFile, path+"\r\n");   //寫入錯誤鏈接收集文件
            num++;  
            if(sum>num){    //如果文件總數（sum）大於num(當前座標)則繼續遍歷
                getAllLinks(getFileLine(aLinkFile, num));
            }
            return;
        }
        Elements aLinks = doc.select("a[href]");
        Elements imgLinks = doc.select("img[src]");
        System.out.println("開始鏈接："+path);
        for(Element element:aLinks){
            String url =element.attr("href");
            //判斷鏈接是否包含這兩個頭
            if(!url.contains("http://")&&!url.contains("https://")){
                //不是則加上 例：<a href="xitongshow.php?cid=67&id=113" />
                //則需要加上前綴   http://www.yada.com.cn/xitongshow.php?cid=67&id=113
                //否則404
                url = Spider.path+url;
            }
            //如果文件中沒有這個鏈接，而且鏈接中不包含javascript:則繼續(因爲有的是用js語法跳轉)
            if(!readTxtFile(aLinkFile).contains(url)
                    &&!url.contains("javascript")){ 
                //路徑必須包含網頁主鏈接--->防止爬入別的網站
                if(url.contains(Spider.path)){      
                    //判斷該a標籤的內容是文件還是子鏈接
                    if(url.contains(".doc")||url.contains(".exl")
                            ||url.contains(".exe")||url.contains(".apk")
                            ||url.contains(".mp3")||url.contains(".mp4")){
                        //寫入文件中，文件名+文件鏈接
                        writeTxtFile(docLinkFile, element.text()+"\r\n\t"+url+"\r\n");
                    }else{
                        //將鏈接寫入文件
                        writeTxtFile(aLinkFile, url+"\r\n");
                        sum++;  //鏈接總數+1
                    }
                    System.out.println("\t"+element.text()+"：\t"+url);
                }
            }
        }
        //同時抓取該頁面圖片鏈接
        for(Element element:imgLinks){
            String srcStr = element.attr("src");
            if(!srcStr.contains("http://")&&!srcStr.contains("https://")){//沒有這兩個頭
                srcStr = Spider.path+srcStr;
            }
            if(!readTxtFile(imgLinkFile).contains(srcStr)){ 
                //將圖片鏈接寫進文件中
                writeTxtFile(imgLinkFile, srcStr+"\r\n");
            }
        }
        num++;
        if(sum>num){
            getAllLinks(getFileLine(aLinkFile, num));
        }
    }

    /**
     * 讀取文件內容
     * @param file  文件類
     * @return  文件內容
     */
    public static String readTxtFile(File file){
        String result = "";     //讀取結果
        String thisLine = "";   //每次讀取的行
        try {
            BufferedReader reader = new BufferedReader(new FileReader(file));
            try {
                while((thisLine=reader.readLine())!=null){
                    result += thisLine+"\n";
                }
                reader.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        return result;
    }

    /**
     * 寫入內容
     * @param file  文件類
     * @param urlStr    要寫入的文本
     */
    public static void writeTxtFile(File file,String urlStr){
        try {
            BufferedWriter writer = new BufferedWriter(new FileWriter(file,true));
            writer.write(urlStr);
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 獲取文件指定行數的數據，用於爬蟲獲取當前要爬的鏈接
     * @param file  目標文件
     * @param num   指定的行數
     */
    public static String getFileLine(File file,int num){
        String thisLine = "";
        int thisNum = 0 ;
        try {
            BufferedReader reader = new BufferedReader(new FileReader(file));
            while((thisLine = reader.readLine())!=null){
                if(num == thisNum){
                    return thisLine;
                }
                    thisNum++;
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return "";
    }

    /**
     * 獲取文件總行數（有多少鏈接）
     * @param file  文件類
     * @return  總行數
     */
    public static int getFileCount(File file){
        int count = 0;
        try {
            BufferedReader reader = new BufferedReader(new FileReader(file));
            while(reader.readLine()!=null){ //遍歷文件行
                count++;
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return count;
    }


    public static void main(String[] args) {
        aLinkFile = new File("D:/Spider/ALinks.txt");
        imgLinkFile = new File("D:/Spider/ImgLinks.txt");   
        docLinkFile = new File("D:/Spider/DocLinks.txt");
        errorLinkFile = new File("D:/Spider/ErrorLinks.txt");
        //用數組存儲四個文件對象，方便進行相同操作
        File[] files = new File[]{aLinkFile,imgLinkFile,docLinkFile,errorLinkFile};
        try {
            for(File file: files){
                if(file.exists())   //如果文件存在
                    file.delete();  //則先刪除
                file.createNewFile();   //再創建
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        long startTime = System.currentTimeMillis();    //獲取開始時間
        Spider.getAllLinks(path);   //開始爬取目標內容
        System.out.println(""
                + "——————————————————爬取結束——————————————————"
                + "\n目標網址："+path
                + "\n鏈接總數："+sum+"條"
                + "\n圖片總數："+getFileCount(imgLinkFile)+"張"
                + "\n文件總數："+getFileCount(docLinkFile)+"份");
        writeTxtFile(aLinkFile, "鏈接總數："+getFileCount(aLinkFile)+"條");
        writeTxtFile(imgLinkFile, "圖片總數："+getFileCount(imgLinkFile)+"張");
        writeTxtFile(docLinkFile, "文件總數："+getFileCount(docLinkFile)+"份");
        writeTxtFile(errorLinkFile, "問題鏈接總數："+getFileCount(errorLinkFile)+"條");
        long endTime = System.currentTimeMillis();    //獲取結束時間
        System.out.println("\n程序運行時間：" + (endTime - startTime) + "ms");    //輸出程序運行時間
    }
}

結束

代碼比較初級
爬爬小網站就可以了
純屬娛樂而已
有問題可以給我留言或者在下面評論
可以用於服務端於安卓客戶端結合達到想要的效果

Java簡單實現爬蟲技術，抓取整個整個網站所有鏈接+圖片+文件（思路+代碼）

Java簡單實現爬蟲技術，抓取整個網站所有鏈接+圖片+文件（思路+代碼）

需要用到的知識點

需要的jar包

難點

實現思路

HttpUtil.java 類

Spider.java 類

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

APP反編譯記錄

Linux從零開始發佈Vue項目（CenterOS7）

Webstorm使用技巧

（爬蟲、自動化測試、Python）Linux系統安裝chrome與chromedriver

TypeScript實現Vue數據變化監聽原理（觀察者模式）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結