獲取頁面title和description是遇到的問題

原創

2020-07-03 09:19

前幾天公司要求我將數據庫裏的pages表裏的title和sumarry列填充一下，這個表已經通過crawler填充了頁面的url。所以我只要獲取每個頁面的url然後取得這個頁面的內容就可以很容易取得某個字段，但是當中發生了一些問題。
1.獲取頁面的代碼：

要用到以下類必須在pom.xml里加入下面的依賴：


         <dependency>
	      <groupId>nekohtml</groupId>
	      <artifactId>nekohtml</artifactId>
	      <version>0.9.5</version>
	    </dependency> 
	    <dependency>
	  	<artifactId>commons-httpclient</artifactId>
	  	<groupId>commons-httpclient</groupId>
	  	<version>3.0.1</version>
	    </dependency>
	    <dependency>
		 <groupId>com.ibm.icu</groupId>
		 <artifactId>icu4j</artifactId>
		 <version>3.8</version>
            </dependency>


判斷頁面字符集的方法：
/**
	 * Determine the page encoding from the binary stream
	 * @param is The source on which the process is executed.
	 * @return 
	 */
	public String getCharset(InputStream is){
		CharsetDetector detector;
		CharsetMatch match;
		detector = new CharsetDetector();
		try {
			BufferedInputStream inputStream = new BufferedInputStream(is);
			detector.setText(inputStream);
		} catch (Exception e1) {
			e1.printStackTrace();
		}
		detector.enableInputFilter(true);

		match = detector.detect();

		String charset = match.getName();
		return charset;
	}
CharsetDetector 類是分析頁面字符給出最可能的結果，比如百度百科的頁面編碼都是”gb2312"的，但是得出的結果爲：GB18030，也即前者編碼的超集。


獲取頁面內容代碼：
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;

public static byte[] downloadContent(String url) {
		byte[] buffer = new byte[1024 * 100];

		HttpClient httpClient = new HttpClient();
		GetMethod getMethod = new GetMethod(url);
		try {
			int rt = httpClient.executeMethod(getMethod);
			if (rt == HttpStatus.SC_OK) {
//				int count = -1;
				ByteArrayOutputStream baos = new ByteArrayOutputStream();
				InputStream responseBodyAsStream = getMethod.getResponseBodyAsStream();
				while((count = responseBodyAsStream.read(buffer, 0, buffer.length)) > -1) {
					baos.write(buffer, 0, count);
				}

				ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
				return baos.toByteArray();
			}else {
				return null;	
			}
		} catch (Exception e) {
			logger.error("error occur, while download page at the location of: " + url, e);
			return null;
		}finally {
			getMethod.releaseConnection();
		}
	}

再看看ByteArrayOutputStream的API：


ByteArrayOutputStream
This class implements an output stream in which the data is written into a byte array. The buffer automatically grows as data is written to it. The data can be retrieved using toByteArray() and toString(). 

Closing a ByteArrayOutputStream has no effect. The methods in this class can be called after the stream has been closed without generating an IOException.

[quote]
The buffer automatically grows as data is written to it.
[/quote]
由於該類的buffer會自動增長，所以如果網頁的大小超過預設的buffer大小的話它也能過這個特性來存放數據（除了網頁的內容大的超過所剩內存大小，這時候就會出現OOM異常了）

一開始是直接返回一個InputStream，即getMethod.getResponseBodyAsStream();但是在另外一個類讀這個stream的時候會報“attempted to read a closed stream"，猜想應該是另外一個引用這個InputStream時被調用方法堆棧的對這個流的reference已經釋放，所以纔會拋出這個錯誤。因爲這個原因，後來便直接在該方法中讀出頁面內容然後返回一個緩衝引用。
在返回緩衝的時候出了一個問題困擾了我好久。因爲網頁的大小會相差較大，所以100K的緩存有時候能一次性容納一個網頁，有時候必須要讀好幾次緩存才能獲取所有頁面的內容。一開始是直接return buffer，所以遇到小的頁面是，所得的結果完全正確，而遇到大的頁面超出緩存的大小時，這時候buffer的內容總是最後一次讀取的大小，而前面讀取的內容則被覆蓋了，因此經常出現取不到title的情況。而用以上方法獲取字符集的時候，也會因爲數據不全而出現時對時錯的情況。下次在獲取諸如不確定大小數據的時候，一定要小心這個問題。
後來發現有一個類似的帖子：http://dengyin2000.iteye.com/blog/47417。
還想問一個問題，在獲取description的時候，本人的代碼是通過正則表達式來作的,如下：


public String getDescription(String page){
		String description = null;

		Pattern pattern = Pattern.compile("(<meta)(.*)name *= *[\"\'] *description *[\"\'](.*)(/>)", Pattern.CASE_INSENSITIVE );
		Matcher matcher = pattern.matcher(page);

//		boolean found = false;//flag indicating whether the page has a description.
		String resultStr = null;
		while(matcher.find()){
			resultStr = matcher.group();
			Pattern patternDesc = Pattern.compile("(content *= *[\"\'])(.*)([\"\'])", Pattern.CASE_INSENSITIVE);
			Matcher matcherDesc = patternDesc.matcher(resultStr);
			while(matcherDesc.find()){
				description = matcherDesc.group(2);
			}
//             found = true;
		}


		//If no description found, then set title as its description
		/*if(!found){
			description = title;
		}*/

		return description;
	}

在html中，一個tag裏的attribute位置和個數不確定。在獲取description時，通過查找是否有name=”description“的<meta/>字符串然。找到後再來查找content的attribute。這裏用了兩次正則表達式查找。不知道各位有沒有更好的方法一個正則表達式就可以查出來的呢？

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

獲取頁面title和description是遇到的問題

Eclipse啓動參數（感覺是最好的一個）

Problems in Migration using SqlDeveloper

獲取頁面title和description是遇到的問題

Archetype List of Maven

用Maven生成Eclipse可識別的Web項目

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結