情景說明:網頁的數據格式比較簡單,只是把小說內容爬取到本地保存,沒有遇到反爬。
使用到的依賴如下:
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient --> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.3</version> </dependency> <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup --> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.3</version> </dependency>
網頁代碼:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>第十一章 末代皇帝&最後一個克格勃(3)-龍族3·黑月之潮(中)</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="keywords" content="第十一章 末代皇帝&最後一個克格勃(3)-龍族3·黑月之潮(中)" />
<meta name="description" content="第十一章 末代皇帝&最後一個克格勃(3)-龍族3·黑月之潮(中)" />
<!–[if lt IE 9]>
<script src=/css3-mediaqueries.js></script>
<![endif]–>
<link rel="stylesheet" type="text/css" media="screen and (max-width: 900px)" href="/wap.css" />
<link rel="stylesheet" type="text/css" media="screen and (min-width: 900px)" href="/dcy.css" />
<link rel="alternate" type="application/rss+xml" href="http://www.********.cc/longzu3heiyuezhichaozhong/feed.asp?cmt=371" title="Comments Feed for 第十一章 末代皇帝&最後一個克格勃(3)" />
<script src="http://www.********.cc/longzu3heiyuezhichaozhong/script/common.js" type="text/javascript"></script>
<script src="http://www.********.cc/longzu3heiyuezhichaozhong/function/c_html_js_add.asp" type="text/javascript"></script>
</head>
<body><div class="v"><h1 align="center" class="STYLE1">龍族3·黑月之潮(中)</h1></div>
<div class="site clearfix"><span style="float:right;"> <a href="http://www.********.cc/longzu3heiyuezhichaozhong/" >返回首頁</a></span><a href="http://www.********.cc/longzu3heiyuezhichaozhong/">龍族3·黑月之潮(中)</a> > 第十一章 末代皇帝&最後一個克格勃(3)</div>
<div class="chaptertitle clearfix">
<h1>第十一章 末代皇帝&最後一個克格勃(3)</h1>
</div>
<div id="p_adtop" class="clearfix">
<div id="p_ad_t1"><script language="javascript" type="text/javascript" src="/ad1.js"></script></div>
<div id="p_ad_t2"><script language="javascript" type="text/javascript" src="/ad1.js"></script></div>
<div id="p_ad_t4"></div>
</div>
<div class="bookcontent clearfix" id="BookText"> 御神刀斬落,帶着大片的弧光。橘正宗血光飛濺,戰慄着倒地。<br/><br/> 懷刃插在地上,橘正宗用來握刀的右手五指盡落,因此他沒能把懷劍插進自己的肚子裏。<br/><br/> 源稚生面無表情地收刀回鞘,從懷裏抽出手帕沿着斷指根部紮緊來止血。他的刀術極精,一刀斬斷橘正宗的五指,卻還留下短短的指根來止血。<br/><br/> <br/><br/> 1937年12月,南京被攻克,之後的六個星期中。城裏有三十萬平民被屠殺。南京城裏西方橋民的證詞是審判戰犯的關鍵證據,一位法國天主教堂的修女說,日軍甚至衝進西方教堂開設的育嬰堂。強暴藏身在裏面的中國女人。老嬤嬤讓中國女人們穿上修女的衣服,祕密地帶他們出城。他們在江邊被日本軍隊攔截,藤原勝少校發現他們都是假修女,於是所有女人都遭到了強暴,反抗者被用刺刀刨開了肚子。沒有遭到侵害的只有帶隊的那位老嬤嬤,但她目睹了那血腥殘酷的一幕後無法忍受,於是開槍自殺。死前她詛咒說神會懲罰罪人,用雷電用火焰……”<br/><br/> 【THEEND】<br/><br/><div id="p_ad_t3"><script language="javascript" type="text/javascript" src="/xm.js"></script></div></div>
<!--content-->
<div id="p_ad_b1" class="clearfix">
</div>
<div class="bottomlink clearfix">
<div class="linkbtn clearfix"> <h2><a href="http://www.********.cc/longzu3heiyuezhichaozhong/370.html"><span>(快捷鍵:←)上一頁</span></a> <a href="http://www.********.cc/longzu3heiyuezhichaozhong/"><span>返回章節目錄(快捷鍵:回車)</span></a> <a href=""><span>下一頁(快捷鍵:→)</span></a></h2> </div>
</div>
<div class="bottomlink clearfix">
<div style="display:none;" id="divAjaxComment"></div>
<div class="post" id="divCommentPost">
<p class="posttop"><a name="comment">發表評論:</a></p>
<form id="frmSumbit" target="_self" method="post" action="http://www.********.cc/longzu3heiyuezhichaozhong/cmd.asp?act=cmt&key=32c3ee99" >
<input type="hidden" name="inpId" id="inpId" value="371" />
<input type="hidden" name="inpArticle" id="inpArticle" value="" />
<input type="hidden" name="inpLocation" id="inpLocation" value="" />
<p><input type="text" name="inpName" id="inpName" class="text" value="" size="28" tabindex="1" /> <label for="inpName">名稱(必填)</label></p>
<p><input type="text" name="inpEmail" id="inpEmail" class="text" value="" size="28" tabindex="2" /> <label for="inpEmail">郵箱(可以不填寫)</label></p>
<!--<p><input type="text" name="inpHomePage" id="inpHomePage" class="text" value="" size="28" tabindex="3" /> <label for="inpHomePage">網站鏈接</label></p>-->
<p><label for="txaArticle">正文(留言最長字數:1000)</label></p>
<p>
<textarea name="txaArticle" id="txaArticle" onchange="GetActiveText(this.id);" onclick="GetActiveText(this.id);" onfocus="GetActiveText(this.id);" class="text" cols="50" rows="4" tabindex="5" style="width:80%;resize:none;" ></textarea>
</p>
<p><input name="btnSumbit" type="submit" tabindex="6" value="提交" onclick="JavaScript:return VerifyMessage()" class="button" /> <input type="checkbox" name="chkRemember" value="1" id="chkRemember" /> <label for="chkRemember">記住我,下次回覆時不用重新輸入個人信息</label></p>
<script language="JavaScript" type="text/javascript">objActive="txaArticle";ExportUbbFrame();</script>
</form>
<p class="postbottom">◎歡迎參與討論,請在這裏發表您的看法、交流您的觀點。</p>
<script language="JavaScript" type="text/javascript">LoadRememberInfo();</script>
</div>
</div>
<div id="p_ad_b2" class="clearfix">
</div>
<!--頁腳-->
<div class="footer clearfix"> <span class="page-comment">
</span> <span class="fright">
<div id="pagebottom">
</div>
</span> <span class="fleft gray-link"></script>Copyright 2015-2017 <a href="http://www.********.cc/longzu3heiyuezhichaozhong/">龍族3·黑月之潮(中)</a> all rights reserved <script language="javascript" type="text/javascript" src="//js.users.51.la/19241152.js"></script>
</span></div>
<div id="allbottom">
</div>
</body>
</html>
網站就不給看了用***替代一下,下面直接上代碼
import org.apache.http.HttpEntity; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.*; /* 爬取網站小說 */ public class CaptureDemo { public static void main(String[] args) { for (int page = 345; page <= 360 ; page++) { String url = "http://www.********.cc/longzu3heiyuezhichaoxia/"+page+".html"; String bookContent = getBookContent(url); System.out.println(bookContent); File file = new File("E:\\龍族3-黑月之潮(下).txt"); saveToLocal(bookContent, file); System.out.println(url+" is over."); } } // 保存數據到本地文件中 private static String saveToLocal(String bookContent, File file) { FileWriter fw = null; try { // 如果文件存在就在文件中追加內容,不存在就創建 fw = new FileWriter(file,true); fw.write(bookContent); fw.flush(); fw.close(); return "scueess"; } catch (IOException e) { e.printStackTrace(); } return "failed"; } // 獲取目標信息 private static String getBookContent(String url) { StringBuffer sb = new StringBuffer("\n"); // 爬取網頁信息 CloseableHttpClient closeableHttpClient = HttpClients.createDefault(); try { HttpGet httpGet = new HttpGet(url); CloseableHttpResponse closeableHttpResponse = closeableHttpClient.execute(httpGet); try { // 獲取響應實體 HttpEntity entity = closeableHttpResponse.getEntity(); // 打印響應狀態 if (entity != null){ System.out.println(entity.toString()); // 將獲取的網頁數據以utf8編碼讀取出來 String html = EntityUtils.toString(entity, "utf8"); // Jsoup 解析網頁數據 Document document = Jsoup.parse(html); // 獲取目標內容 Element bookText = document.getElementById("BookText"); // 章節標題 Elements chaptertitle = document.getElementsByClass("chaptertitle"); String headTitle = chaptertitle.text(); String content = bookText.text().replaceAll(" ","\n"); return sb.append(headTitle).append("\n").append(content).append("\n\n").toString(); } }catch (Exception e){ e.printStackTrace(); } } catch (ClientProtocolException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return null; } }
僅做學習記錄。