有的網站限制網絡爬蟲的抓取,例如javaeye。會出現錯誤提示:
您的訪問請求被拒絕
您可能使用了網絡爬蟲抓取ITeye網站頁面!
ITeye網站不允許您使用網絡爬蟲對ITeye進行惡意的網頁抓取,請您立刻停止該抓取行爲!
如果您的網絡爬蟲不屬於惡意抓取行爲,希望ITeye網站允許你進行網頁抓取,請和ITeye管理員聯繫,取得授權: webmasteriteye.com
如果您確實使用瀏覽器訪問,但是被錯誤的識別爲網絡爬蟲,請將您瀏覽器發送的“User Agent”信息告知我們,幫助我們解決錯誤: webmasteriteye.com
可以設置下鏈接的參數來解決。
URL url=new URL(pathString);
URLConnection con=url.openConnection();
con.setDoOutput(true);
con.setRequestProperty("User-Agent", "");
加上上面紅色的一句,就可以了。
或者使用httpclient是,添加
HttpClient httpClient=new HttpClient();
httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(50000);
httpClient.getParams().setParameter(HttpMethodParams.USER_AGENT,"Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.1.2) Gecko/20090803 Fedora/3.5");
這樣就可以了。
--------------------
我的部分代碼參考:
private HttpClient httpClient = null;
public SuNingHtmlparseUtil(String host){
if(httpClient == null){
httpClient = new HttpClient();
}
httpClient.getHostConfiguration().setHost(host, 80, "http");
httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(50000);
httpClient.getParams().setParameter(HttpMethodParams.USER_AGENT,"Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.1.2) Gecko/20090803 Fedora/3.5");
}
public SuNingHtmlparseUtil(){
if(httpClient == null){
httpClient = new HttpClient();
}
httpClient.getHostConfiguration().setHost("", 80, "http");
httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(50000);
httpClient.getParams().setParameter(HttpMethodParams.USER_AGENT,"Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.1.2) Gecko/20090803 Fedora/3.5");
}
public String getConnectAsString(String url,String charset) throws Exception {
GetMethod get = new GetMethod(url);
httpClient.executeMethod(get);
if("utf-8".equals(charset.toLowerCase())){
return new String(get.getResponseBodyAsString().getBytes("iso-8859-1"),charset);
}else{
return new String(get.getResponseBodyAsString().getBytes(),charset);
}
}