最近一直用heritrix爬取網站, 晚上heritrix一直運行着, 但奇怪的是heritrix 抓取速度非常慢, 抓取一個網站, 用了8個多小時, 竟然沒有運行完。 於是根據LOG 分析了一下慢的原因
- -----===== SNOOZED QUEUES =====-----
- SNOOZED#0:
- Queue us,imageshack,img245,+2 (p1)
- 1 items
- wakes in: 99m19s74ms
- last enqueued: <a href="http://img245.xxx.us/img245/596/193183637x01ss500sclzzzbx0.jpg">http://img245.xxx.us/img245/596/193183637x01ss500sclzzzbx0.jpg
- </a> last peeked: <a href="http://img245.xxxx.us/img245/596/193183637x01ss500sclzzzbx0.jpg">http://img245.xxxx.us/img245/596/193183637x01ss500sclzzzbx0.jpg
- </a> total expended: 12 (total budget: -1)
- active balance: 2988
- last(avg) cost: 1(1)
- totalScheduled fetchSuccesses fetchFailures fetchDisregards fetchResponses robotsDenials successBytes totalBytes fetchNonResponses
- 2 1 0 0 1 0 59 59 12
- SimplePrecedenceProvider
- 1
SNOOZED QUene 裏面有一些圖片一直在那裏, 並且運行時間相當長,
用瀏覽器打開, 那圖片不存在,於是那圖片一直在QUENE當中。
接着我分析了一下heritrix 中代碼:
workQueneFrontier 有下面代碼, 由於圖片不存在會進入needsRetrying代碼塊中。
- if (needsRetrying(curi)) {
- // Consider errors which can be retried, leaving uri atop queue
- if(curi.getFetchStatus()!=S_DEFERRED) {
- wq.expend(curi.getHolderCost()); // all retries but DEFERRED cost
- }
- long delay_sec = retryDelayFor(curi);
- curi.processingCleanup(); // lose state that shouldn't burden retry
- wq.unpeek(curi);
- // TODO: consider if this should happen automatically inside unpeek()
- wq.update(this, curi); // rewrite any changes
- if (delay_sec > 0) {
- long delay_ms = delay_sec * 1000;
- snoozeQueue(wq, now, delay_ms);
- } else {
- reenqueueQueue(wq);
- }
- // Let everyone interested know that it will be retried.
- appCtx.publishEvent(
- new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY));
- doJournalRescheduled(curi);
- return;
- }
retryDelayFor方法是用來抓取失敗, 計算等待的時間, 代碼於如下
- /**
- * Return a suitable value to wait before retrying the given URI.
- *
- * @param curi
- * CrawlURI to be retried
- * @return millisecond delay before retry
- */
- protected long retryDelayFor(CrawlURI curi) {
- int status = curi.getFetchStatus();
- return (status == S_CONNECT_FAILED || status == S_CONNECT_LOST ||
- status == S_DOMAIN_UNRESOLVABLE)? getRetryDelaySeconds() : 0;
- // no delay for most
- }
- public int getRetryDelaySeconds() {
- return (Integer) kp.get("retryDelaySeconds");
- }
由於heritrix 默認是等待900秒, 也就是15分鐘, 如果抓取失敗一個小時也只能運行4次, 8 個小時也就是32次, 難怪一直在運行啊
- /** for retryable problems, seconds to wait before a retry */
- {
- setRetryDelaySeconds(900);
- }
知道原因後就好辦了, 修改一下配置文件:
- <!-- FRONTIER: Record of all URIs discovered and queued-for-collection -->
- <bean id="frontier"
- class="org.archive.crawler.frontier.BdbFrontier">
- <!-- <property name="holdQueues" value="true" /> -->
- <!-- <property name="queueTotalBudget" value="-1" /> -->
- <!-- <property name="balanceReplenishAmount" value="3000" /> -->
- <!-- <property name="errorPenaltyAmount" value="100" /> -->
- <!-- <property name="precedenceFloor" value="255" /> -->
- <!-- <property name="queuePrecedencePolicy">
- <bean class="org.archive.crawler.frontier.precedence.BaseQueuePrecedencePolicy" />
- </property> -->
- <!-- <property name="snoozeLongMs" value="300000" /> -->
- <property name="retryDelaySeconds" value="90" />
- <!-- <property name="maxRetries" value="30" /> -->
- <!-- <property name="recoveryDir" value="logs" /> -->
- <!-- <property name="recoveryLogEnabled" value="true" /> -->
- <!-- <property name="maxOutlinks" value="6000" /> -->
- <!-- <property name="outboundQueueCapacity" value="50" /> -->
- <!-- <property name="inboundQueueMultiple" value="3" /> -->
- <!-- <property name="dumpPendingAtClose" value="false" /> -->
- </bean>
這是heritrix3的配置, 把時間改成90秒, 也就是隻等待1分半鐘,
如果是H1的配置, 那可以用管理界面進行配置。
改了一下速度一下提高了很多, 原來8小時才能爬完一個網站, 現在2個小時就行了。
如果再用一下heritrix
增量抓取, 那下次再抓取這個網站時, 速度又會增加很多。這樣問題解決了