Java學習之爬蟲篇
0x00 前言
總結完基礎階段,來寫個爬蟲練練手,從中能學到不少。
0x01 爬蟲結構與概念
爬蟲更官方點的名字叫數據採集,英文一般稱作spider,就是通過編程來全自動的從互聯網上採集數據。
爬蟲需要做的就是模擬正常的網絡請求,比如你在網站上點擊一個網址,就是一次網絡請求。
這裏可以再來說說爬蟲在滲透中的作用,例如我們需要批量去爬取該網站上面的外鏈或者是論壇的發帖人用戶名,手機號這些。如果說我們手工去進行收集的話,大大影響效率。
爬蟲的流程總體來說其實就是請求,過濾也就是數據提取,然後就是對提取的內容存儲。
0x02 爬蟲的請求
這裏那先知論壇來做一個演示,
get請求
package is.text;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class http1get {
public static void main(String[] args) {
CloseableHttpClient client = HttpClients.createDefault(); //創建httpclient 對象。
HttpGet httpGet = new HttpGet("https://xz.aliyun.com/?page=1"); //創建get請求對象。
CloseableHttpResponse response = null;
try {
response = client.execute(httpGet); //發送get請求
if (response.getStatusLine().getStatusCode()==200){
String s = EntityUtils.toString(response.getEntity(),"utf-8");
System.out.println(s);
System.out.println(httpGet);
}
} catch (IOException e) {
e.printStackTrace();
}finally {
try {
response.close();
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
方法解析:
createDefault
公共靜態CloseableHttpClient createDefault()
CloseableHttpClient使用默認配置創建實例。
get攜帶參數請求:
package is.text;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.net.URISyntaxException;
public class http1get {
public static void main(String[] args) throws URISyntaxException {
CloseableHttpClient client = HttpClients.createDefault(); //創建httpclient 對象。
URIBuilder uriBuilder = new URIBuilder("https://xz.aliyun.com/"); //使用URIBuilder設置地址
uriBuilder.setParameter("page","2"); //設置傳入參數
HttpGet httpGet = new HttpGet(uriBuilder.build()); //創建get請求對象。
// https://xz.aliyun.com/?page=1
CloseableHttpResponse response = null;
try {
response = client.execute(httpGet); //發送get請求
if (response.getStatusLine().getStatusCode()==200){
String s = EntityUtils.toString(response.getEntity(),"utf-8");
System.out.println(s);
System.out.println(httpGet);
}
} catch (IOException e) {
e.printStackTrace();
}finally {
try {
response.close();
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
post請求
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class httppost {
public static void main(String[] args) {
CloseableHttpClient client = HttpClients.createDefault();
HttpPost httpPost = new HttpPost("https://xz.aliyun.com/");
CloseableHttpResponse response = null;
try {
response = client.execute(httpPost);
String s = EntityUtils.toString(response.getEntity());
System.out.println(s);
System.out.println(httpPost);
} catch (IOException e) {
e.printStackTrace();
}
}
}
在get和post的請求不攜帶參數請求當中,get的請求方式和post的請求方式基本類似。但是創建請求對象時,get請求用的是HttpGet
來生成對象,而Post則是HttpPost
來生成對象。
post攜帶參數請求
package is.text;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class httpparams {
public static void main(String[] args) throws IOException {
CloseableHttpClient client = HttpClients.createDefault();//創建httpClients對象
HttpPost httpPost = new HttpPost("https://xz.aliyun.com/"); //設置請求對象
List<NameValuePair> params = new ArrayList<NameValuePair>(); //聲明list集合,存儲傳入參數
params.add(new BasicNameValuePair("page","3"));
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf-8"); //創建表單的Entity對象,傳入params參數
httpPost.setEntity(formEntity); //設置表單內容到post包中
CloseableHttpResponse response = client.execute(httpPost);
String s = EntityUtils.toString(response.getEntity());
System.out.println(s);
System.out.println(s.length());
System.out.println(httpPost);
}
}
連接池
如果每次請求都要創建HttpClient,會有頻繁創建和銷燬的問題,可以使用連接池來解決這個問題。
創建一個連接池對象:
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
常用方法:
PoolingHttpClientConnectionManager 類
public void setMaxTotal(int max)
設置最大連接數
public void setDefaultMaxPerRoute(int max)
設置每個主機的併發數
HttpClients類
常用方法:
createDefault()
CloseableHttpClient使用默認配置 創建實例。
custom()
創建用於構建定製CloseableHttpClient實例的構建器對象 。
創建連接池代碼
package is.text;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class PoolHttpGet {
public static void main(String[] args) {
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
cm.setMaxTotal(100); //設置最大連接數
cm.setDefaultMaxPerRoute(100); //設置每個主機的併發數
doGet(cm);
doGet(cm);
}
private static void doGet(PoolingHttpClientConnectionManager cm) {
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
HttpGet httpGet = new HttpGet("www.baidu.com");
try {
CloseableHttpResponse response = httpClient.execute(httpGet);
String s = EntityUtils.toString(response.getEntity(),"utf-8");
} catch (IOException e) {
e.printStackTrace();
}
}
}
HttpClient 請求配置
package is.text;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class gethttp1params {
public static void main(String[] args) throws IOException {
CloseableHttpClient client = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("http://www.baidu.com");
RequestConfig config = RequestConfig.custom().setConnectTimeout(1000) // 設置創建連接的最長時間
.setConnectionRequestTimeout(500) //設置獲取連接最長時間
.setSocketTimeout(500).build(); //設置數據傳輸最長時間
httpGet.setConfig(config);
CloseableHttpResponse response = client.execute(httpGet);
String s = EntityUtils.toString(response.getEntity());
System.out.println(s);
}
}
0x03 爬蟲的數據提取
jsoup
jsoup 是一款Java 的HTML解析器,可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API,可通過DOM,CSS以及類似於jQuery的操作方法來取出和操作數據。
jsoup的主要功能如下:
- 從一個URL,文件或字符串中解析HTML;
- 使用DOM或CSS選擇器來查找、取出數據;
- 可操作HTML元素、屬性、文本;
來寫一段爬取論壇title的代碼:
package Jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.junit.Test;
import java.net.URL;
public class JsoupTest1 {
@Test
public void testUrl() throws Exception {
Document doc = Jsoup.parse(new URL("https://xz.aliyun.com/"),10000);//設置請求url與超時時間
String title = doc.getElementsByTag("title").first().text();// //獲取title的內容
System.out.println(title);
}
}
這裏的 first()代表獲取第一個元素,text()表示獲取標籤內容
dom遍歷元素
getElementById 根據id查詢元素
getElementsByTag
根據標籤獲取元素
getElementsByClass 根據class獲取元素
getElementsByAttribute 根據屬性獲取元素
爬取先知論壇文章
package Jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.junit.Test;
import java.io.File;
import java.io.IOException;
import java.net.URL;
public class HttpDomTest {
@Test
public void TestDom() throws IOException {
Document doc = Jsoup.parse(new URL("https://xz.aliyun.com/t/8091"), 10000);
String topic_content = doc.getElementById("topic_content").text();
String titile = doc.getElementsByClass("content-title").first().text();
System.out.println("title"+titile);
System.out.println("topic_content"+topic_content);
}
}
爬取10頁全部文章
元素中獲取數據:
1. 從元素中獲取id
2. 從元素中獲取className
3. 從元素中獲取屬性的值attr
4. 從元素中獲取所有屬性attributes
5. 從元素中獲取文本內容text
package Jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import java.io.File;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.List;
public class HttpDomTest10 {
@Test
public void xianzhi10test() throws Exception {
String url = "https://xz.aliyun.com";
Document doc = Jsoup.parse(new URL(url), 10000);
Elements element = doc.getElementsByClass("topic-title");
List<String> href = element.eachAttr("href");
for (String s : href) {
try{
Document requests = Jsoup.parse(new URL(url+s), 100000);
String topic_content = requests.getElementById("topic_content").text();
String titile = requests.getElementsByClass("content-title").first().text();
System.out.println(titile);
System.out.println(topic_content);
}catch (Exception e){
System.out.println("爬取"+url+s+"報錯"+"報錯信息"+e);
}
}
/*
String topic_content = doc.getElementById("topic_content").text();
String titile = doc.getElementsByClass("content-title").first().text();
System.out.println("title"+titile);
System.out.println("topic_content"+topic_content);
*/
}
}
成功爬取到一頁的內容。
既然能爬取一頁內容,那麼我們可以直接定義一個for循環遍歷10次,然後進行請求。
爬取10頁的內容就這麼完成了。
package Jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import java.io.File;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.List;
public class HttpdomTEST2 {
@Test
public void xianzhi10test() throws Exception {
String url = "https://xz.aliyun.com/";
for (int i = 1; i < 10; i++) {
String requesturl = "https://xz.aliyun.com/?page="+i;
Document doc = Jsoup.parse(new URL(requesturl), 10000);
Elements element = doc.getElementsByClass("topic-title");
List<String> href = element.eachAttr("href");
for (String s : href) {
try{
Document requests = Jsoup.parse(new URL(url+s), 100000);
String topic_content = requests.getElementById("topic_content").text();
String titile = requests.getElementsByClass("content-title").first().text();
System.out.println(titile);
System.out.println(topic_content);
}catch (Exception e){
System.out.println("爬取"+url+s+"報錯"+"報錯信息"+e);
}
}
}
/*
String topic_content = doc.getElementById("topic_content").text();
String titile = doc.getElementsByClass("content-title").first().text();
System.out.println("title"+titile);
System.out.println("topic_content"+topic_content);
*/
}
}
爬蟲爬取一頁的內容的連接再去請求,如果一頁裏面有十幾篇文章,爬取十頁的話,那麼這下請求肯定就多了,單線程是遠遠不夠的。需要多線程來進行爬取數據。
多線程爬取文章自定義線程與頁面
實現類:
import java.util.concurrent.locks.ReentrantLock;
public class Climbimpl implements Runnable {
private String url ;
private int pages;
Lock lock = new ReentrantLock();
public Climbimpl(String url, int pages) {
this.url = url;
this.pages = pages;
}
public void run() {
lock.lock();
// String url = "https://xz.aliyun.com/";
System.out.println(this.pages);
for (int i = 1; i < this.pages; i++) {
try {
String requesturl = this.url+"?page="+i;
Document doc = null;
doc = Jsoup.parse(new URL(requesturl), 10000);
Elements element = doc.getElementsByClass("topic-title");
List<String> href = element.eachAttr("href");
for (String s : href) {
try{
Document requests = Jsoup.parse(new URL(this.url+s), 100000);
String topic_content = requests.getElementById("topic_content").text();
String titile = requests.getElementsByClass("content-title").first().text();
System.out.println(titile);
System.out.println(topic_content);
}catch (Exception e){
System.out.println("爬取"+this.url+s+"報錯"+"報錯信息"+e);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
lock.unlock();
}
}
main:
package Jsoup;
public class TestClimb {
public static void main(String[] args) {
int Threadlist_num = 50; //線程數
String url = "https://xz.aliyun.com/"; //url
int pages = 10; //讀取頁數
Climbimpl climbimpl = new Climbimpl(url,pages);
for (int i = 0; i < Threadlist_num; i++) {
new Thread(climbimpl).start();
}
}
}
Select選擇器
tagname: 通過標籤查找元素,比如:span
#id: 通過ID查找元素,比如:# city_bj
.class: 通過class名稱查找元素,比如:.class_a
[attribute]: 利用屬性查找元素,比如:[abc]
[attr=value]: 利用屬性值來查找元素,比如:[class=s_name]
代碼案例:
通過標籤查找元素:
Elements span = document.select("span");
通過id查找元素:
String str = document.select("#city_bj").text();
通過類名查找元素:
str = document.select(".class_a").text();
通過屬性查找元素
str = document.select("[abc]").text();
屬性值來查找元素:
str = document.select("[class=s_name]").text();
標籤+元素組合:
str = document.select("span[abc]").text();
任意組合:
str = document.select("span[abc].s_name").text();
查找某個父元素下的直接子元素:
str = document.select(".city_con > ul > li").text();
查找某個父元素下所有直接子元素:
str =
document.select(".city_con > *").text();
0x04 結尾
java的爬蟲依賴於jsoup,jsoup基本集成了爬蟲所有需要的功能。