這系列文章主要分析分析webmagic框架,沒有實戰內容,如有實戰問題可以討論,也可以提供技術支持。
歡迎加羣313557283(剛創建),小白互相學習~
Scheduler
我們先來看看接口
package us.codecraft.webmagic.scheduler;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Task;
/**
* Scheduler is the part of url management.<br>
* You can implement interface Scheduler to do:
* manage urls to fetch
* remove duplicate urls
*
* @author [email protected] <br>
* @since 0.1.0
*/
public interface Scheduler {
/**
* add a url to fetch
*
* @param request request
* @param task task
*/
public void push(Request request, Task task);
/**
* get an url to crawl
*
* @param task the task of spider
* @return the url to crawl
*/
public Request poll(Task task);
}
也很簡單,放,取 兩個方法
我們再來看看默認調用實現scheduler的那個類QueueScheduler
package us.codecraft.webmagic.scheduler;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Task;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
/**
* Basic Scheduler implementation.<br>
* Store urls to fetch in LinkedBlockingQueue and remove duplicate urls by HashMap.
*
* @author [email protected] <br>
* @since 0.1.0
*/
public class QueueScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler {
private BlockingQueue<Request> queue = new LinkedBlockingQueue<Request>();
@Override
public void pushWhenNoDuplicate(Request request, Task task) {
queue.add(request);
}
@Override
public Request poll(Task task) {
return queue.poll();
}
@Override
public int getLeftRequestsCount(Task task) {
return queue.size();
}
@Override
public int getTotalRequestsCount(Task task) {
return getDuplicateRemover().getTotalRequestsCount(task);
}
}
沒啥好看的。。我們主要看下實現那個接口和繼承的類
DuplicateRemovedScheduler
package us.codecraft.webmagic.scheduler;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.scheduler.component.DuplicateRemover;
import us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover;
import us.codecraft.webmagic.utils.HttpConstant;
/**
* Remove duplicate urls and only push urls which are not duplicate.<br><br>
*
* @author [email protected]
* @since 0.5.0
*/
public abstract class DuplicateRemovedScheduler implements Scheduler {
protected Logger logger = LoggerFactory.getLogger(getClass());
private DuplicateRemover duplicatedRemover = new HashSetDuplicateRemover();
public DuplicateRemover getDuplicateRemover() {
return duplicatedRemover;
}
public DuplicateRemovedScheduler setDuplicateRemover(DuplicateRemover duplicatedRemover) {
this.duplicatedRemover = duplicatedRemover;
return this;
}
@Override
public void push(Request request, Task task) {
logger.trace("get a candidate url {}", request.getUrl());
if (shouldReserved(request) || noNeedToRemoveDuplicate(request) || !duplicatedRemover.isDuplicate(request, task)) {
logger.debug("push to queue {}", request.getUrl());
pushWhenNoDuplicate(request, task);
}
}
//額外參數重試請求
protected boolean shouldReserved(Request request) {
return request.getExtra(Request.CYCLE_TRIED_TIMES) != null;
}
//不需要去重
protected boolean noNeedToRemoveDuplicate(Request request) {
return HttpConstant.Method.POST.equalsIgnoreCase(request.getMethod());
}
protected void pushWhenNoDuplicate(Request request, Task task) {
}
}
簡單理解下就是request get 重複請求去除,post 重複不去除,沒有用布隆過濾,還有個接口MonitorableScheduler接口是提供監控功能,也就是查看還剩下多少待爬取的URL,和總共有多少URL
還有帶優先級PriorityScheduler
擴展
BloomFilterDuplicateRemover 用了布隆過濾器 重複post 也支持過濾了,沒有測試過
FileCacheQueueScheduler 文件方式,主要是用於增量爬取記錄url,這個指的是比如今天共100個頁面我爬了20個下班了我關閉了爬蟲,第二天他先把20個去重了。
RedisScheduler 加入了redis
RedisScheduler 加入了redis和優先級
總結
我們基本上把所有模塊都分析完了,知道了工作原理,分析源碼,知道了如何正確使用,接下來帶來最後一篇。