對於大多數教育網中的用戶,都是不可以直接上國外網站的(主要由於學校封鎖),一定要上的話也只能通過代理。今天我需要抓取一些國外的網站,但發現全部都抓取不成功。經過檢查發現需要設置代理,具體設置方法如下:
在/conf/nutch-site.xml中添加如下內容:
<property>
<name>http.proxy.host</name>
<value>***.***.***.***</value>
<description>The proxy hostname. If empty, no proxy is used.</description>
</property>
<property>
<name>http.proxy.port</name>
<value>8080</value>
<description>The proxy port.</description>
</property>
<property>
<name>http.proxy.username</name>
<value></value>
<description>Username for proxy. This will be used by
'protocol-httpclient', if the proxy server requests basic, digest
and/or NTLM authentication. To use this, 'protocol-httpclient' must
be present in the value of 'plugin.includes' property.
NOTE: For NTLM authentication, do not prefix the username with the
domain, i.e. 'susam' is correct whereas 'DOMAIN/susam' is incorrect.
</description>
</property>