前言
phantom是啥?一個無頭瀏覽器。可以幹嘛?有時遇到一些動態加載的頁面,依靠JS加載html標籤,這時直接爬取不能獲得文本;還有的場景需要對頁面進行截圖,進行圖片審覈,都可以用它。
官網下載鏈接,windows和linux是不同工具來的,注意看清楚再下載。
爬取文本
其中, crawlTextCommand參數在windows下傳入的命令如下所示:
F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\\phantomjs-2.1.1-windows\\bin\\crawlText.js
/**
*
* @param url 待爬取的網站鏈接
* @param crawlTextCommand 爬取文本命令
* eg. F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\phantomjs-2.1.1-windows\bin\crawlText.js
* @return 爬取的文本內容
* @throws IOException
*/
public static String crawlText(String url, String crawlTextCommand) throws IOException {
InputStream inputStream = null;
Process process = null;
try {
Runtime runtime = Runtime.getRuntime();
String command = crawlTextCommand + url;
process = runtime.exec(command);
inputStream = process.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
StringBuilder builder = new StringBuilder();
String content;
while ((content = reader.readLine()) != null) {
builder.append(content);
}
return builder.toString();
} finally {
if (inputStream != null) {
inputStream.close();
}
if (process != null) {
process.destroy();
}
}
}
截圖
其中, screensHotCommand參數在windows下傳入的命令如下所示:
F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\phantomjs-2.1.1-windows\bin\crawlText.js
/**
*
* @param url 待截圖的網站鏈接
* @param path 圖片路徑+名稱 eg. F:\\pic\\9.png
* @param screensHotCommand 截圖命令
* eg. eg. F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\phantomjs-2.1.1-windows\bin\screensHot.js
* @return 返回圖片保存路徑
* @description 注意, 命令形式調用外部工具的時候, 都要考慮併發問題, 否則容易出現
* 部分線程可以截圖成功, 部分截圖不成功
* @throws IOException
*/
public static String screenshot(String url, String path, String screensHotCommand) throws IOException, InterruptedException {
InputStream inputStream = null;
Process process = null;
try {
Runtime runtime = Runtime.getRuntime();
String command = screensHotCommand + url + " " + path;
process = runtime.exec(command);
inputStream = process.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
while (reader.readLine() != null) ;
reader.close();
return path;
} finally {
if (inputStream != null) {
inputStream.close();
}
if (process != null) {
process.destroy();
}
}
}
使用中的問題
之前模擬過多線程爬取發現會有部分線程沒有截圖,後邊在服務調用處增加了同步的實現。
synchronized (this) {
String path = SystemCallUtil.screenshot(
url, name + suffix, phantom.getScreensHotCommand()
);
File file = null;
int tmpTry = retry;
do {
Thread.sleep(sleep); // 休息一下
file = new File(path);
} while (!file.exists() && (--tmpTry) > 0); // 重試一次
if (file == null || !file.exists()) {
// 打印日誌;
return false;
}
}