【Java爬蟲】HttpClient快速入門

HttpClient是Java的爬蟲庫,其使用起來挺方便的,一般步驟可以總計爲:

1)實例化HttpClient對象

2)實例化HttpGet對象

3)設置請求頭

4)執行get請求拿到數據

5)獲取實體

6)解析實體

7)  對解析內容進行處理

8)關閉資源

HttpClient的maven座標如下

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.4.10</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.6</version>
        </dependency>

基礎的使用代碼如下

public class HttpTest {
    public static void main(String[] args) {
        //創建HttpClient對象
        CloseableHttpClient request = HttpClients.createDefault();
        //使用HttpGet請求
        HttpGet httpGet = new HttpGet("https://www.tuicool.com/");
        //僞裝一下
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36");
        //獲得response
        CloseableHttpResponse response = null;
        try {
            response = request.execute(httpGet);

            //解析響應
            if(response.getStatusLine().getStatusCode() == 200){
                String content = EntityUtils.toString(response.getEntity(), "GBK");
                System.out.println(content);
            }


        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            //關閉response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                request.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

當然我們也可以用爬蟲下載圖片,我們可以使用commons-io來簡化io操作,其maven座標如下

<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.5</version>
        </dependency>

下載圖片的代碼如下

public class pictureSpider {
    public static void main(String[] args) throws IOException {
        CloseableHttpClient client = HttpClients.createDefault();
        String url = "https://c-ssl.duitang.com/uploads/item/201504/15/20150415H2548_4vjy3.jpeg";
        HttpGet httpGet = new HttpGet(url);
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36");
        CloseableHttpResponse response = client.execute(httpGet);
        HttpEntity entity = response.getEntity();

        //下載圖片
        if(entity != null){
            System.out.println("ContentType:" + entity.getContentType().getValue());
            InputStream input = entity.getContent();
            FileUtils.copyToFile(input, new File("pic.jpg"));
        }

        response.close();
        client.close();
    }
}

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章