兩個爬蟲HtmlBean如下:
第一個HtmlBean,獲取小說內容
@Gecco(
matchUrl="http://www.xs2345.com/read/18/18914/([^0{1}]|{index}).html",
pipelines="xybwPipeline"
)
/**
* 獲取小說內容
*/
public class XYBW implements HtmlBean{
/**
*
*/
private static final long serialVersionUID = 2833184596055251729L;
@RequestParameter
private Long index;
@Text
@HtmlField(cssPath=".read_m > h1:nth-child(2) > a:nth-child(1)")
private String bookName;
@Text
@HtmlField(cssPath=".ydleft > h2:nth-child(2)")
private String chapterName;
@Html
@HtmlField(cssPath=".yd_text2")
private String content;
public Long getIndex() {
return index;
}
public void setIndex(Long index) {
this.index = index;
}
public String getBookName() {
return bookName;
}
public void setBookName(String bookName) {
this.bookName = bookName;
}
public String getChapterName() {
return chapterName;
}
public void setChapterName(String chapterName) {
this.chapterName = chapterName;
}
public String getContent() {
return content;
}
public void setContent(String content) {
if (content != null && !content.isEmpty()) {
content = content.replaceAll(" ", "");
content = content.replaceAll(" ", "");
content = content.replaceAll("<br/>", "");
content = content.replaceAll("<br>", "");
content = content.replaceAll("\\n{2}", "\n");
this.content = content;
}else{
this.content = "";
}
}
}
第二個HtmlBean ,獲取小說目錄
@Gecco(
matchUrl="http://www.xs2345.com/read/18/18914/0.html",
pipelines="xybwIndexPipeline"
)
public class XYBWIndex implements HtmlBean{
private static final long serialVersionUID = 6065963771104230481L;
@Text
@HtmlField(cssPath=".ml_title > h1:nth-child(1)")
private String bookName;
@Text
@HtmlField(cssPath=".ml_main > dl > dd > a")
private List<String> chapterNameList;
@Href(click=true)
@HtmlField(cssPath=".ml_main > dl > dd > a")
private List<String> chapterList;
public String getBookName() {
return bookName;
}
public void setBookName(String bookName) {
this.bookName = bookName;
}
public List<String> getChapterNameList() {
return chapterNameList;
}
public void setChapterNameList(List<String> chapterNameList) {
this.chapterNameList = chapterNameList;
}
public List<String> getChapterList() {
return chapterList;
}
public void setChapterList(List<String> chapterList) {
this.chapterList = chapterList;
}
}
注意相應的處理Pipeline,這裏忽略不提
啓動抓取
HttpRequest request_xybw = new HttpGetRequest();
request_xybw.setUrl("http://www.xs2345.com/read/18/18914/0.html");
request_xybw.setCharset("gbk");
GeccoEngine.create()
.classpath("com.xfire")
.start(request_xybw)
.thread(1)
.interval(1000)
.mobile(false)
.start();
分析:
剛開始出現問題在於
XYBW 的
matchUrl="http://www.xs2345.com/read/18/18914/{index}.html"
XYBWIndex 的matchUrl="http://www.xs2345.com/read/18/18914/0.html"
當運行時第一個HtmlBean被匹配後(就是http://www.xs2345.com/read/18/18914/0.html
先被
http://www.xs2345.com/read/18/18914/{index}.html
匹配了,),spider運行就結束了
所以本想獲取小說目錄的HtmlBean 沒有被處理。
將XYBW 的matchUrl改成如下就解決了這個問題
matchUrl="http://www.xs2345.com/read/18/18914/([^0{1}]|{index}).html"
但我覺得更好的解決方法是將所有的匹配HtmlBean都處理,將Spider中單獨獲取一個匹配,改成獲取所有匹配的數組
//匹配SpiderBean
currSpiderBeanClass = engine.getSpiderBeanFactory().matchSpider(request);