nutch 高亮和增加索引長度

高亮顯示比較簡單，網上也有很多介紹代碼。修改如下：

將 org.apache.nutch.searcher.Summary 第 54行代碼修改爲：

public String toString() { return "<span style='color:red'>" + super.toString() + "</span>"; }

增加索引長度花了我比較長的時間，不過後來發現原來有兩個參數是專門調整索引長度的，剛看代碼的時候沒有注意到，在org.apache.nutch.searcher.Summarizer 的36行左右有

/** The number of context terms to display preceding and following matches.*/
private static final int SUM_CONTEXT =
NutchConf.get().getInt("searcher.summary.context", 5);

/** The total number of terms to display in a summary.*/
private static final int SUM_LENGTH =
NutchConf.get().getInt("searcher.summary.length", 100);

這兩個是 Term 的長度，第一個參數是 SUM_CONTEXT 在摘要中間最多有 5個高亮顯示的關鍵詞（注：這裏的NutchConf.get().getInt（）第二個參數 5表示默認值是5，也就是在取得searcher.summary.context爲NULL時候給一個默認值），

第二個SUM_LENGTH 是在摘要中最多顯示 100個 Term ，這個Term 是分詞得到的結果，在後面的摘要截取算法中需要用到 Term ，不過可以通過Luncene 的保存Term的座標來實現索引關鍵詞的快速高亮顯示，這樣的好處是可以在查詢的時候不再使用分詞，以減少查詢相應時間。

不過如果分詞系統是基於詞庫的，則詞庫增長以後會有一定問題，這個以後在做專題討論。

下面帖一下改過的算法內容，顯示文字數大約在 150個左右，如果需要增加到更多，則可以修改相應的代碼。

/** Returns a summary for the given pre-tokenized text. */

public Summary getSummary(String text, Query query) throws IOException {

// Simplistic implementation. Finds the first fragments in the document

// containing any query terms.

// TODO: check that phrases in the query are matched in the fragment

Token[] tokens = getTokens(text); // parse text to token array

if (tokens.length == 0)

return new Summary();

String[] terms = query.getTerms();

HashSet highlight = new HashSet(); // put query terms in table

for (int i = 0; i < terms.length; i++)

highlight.add(terms[i]);

// Create a SortedSet that ranks excerpts according to

// how many query terms are present. An excerpt is

// a Vector full of Fragments and Highlights

SortedSet excerptSet = new TreeSet(new Comparator() {

public int compare(Object o1, Object o2) {

Excerpt excerpt1 = (Excerpt) o1;

Excerpt excerpt2 = (Excerpt) o2;

if (excerpt1 == null && excerpt2 != null) {

return -1;

} else if (excerpt1 != null && excerpt2 == null) {

return 1;

} else if (excerpt1 == null && excerpt2 == null) {

return 0;

}

int numToks1 = excerpt1.numUniqueTokens();

int numToks2 = excerpt2.numUniqueTokens();

if (numToks1 < numToks2) {

return -1;

} else if (numToks1 == numToks2) {

return excerpt1.numFragments() - excerpt2.numFragments();

} else {

return 1;

}

);

// Iterate through all terms in the document

int lastExcerptPos = 0;

for (int i = 0; i < tokens.length; i++) {

// If we find a term that's in the query...

if (highlight.contains(tokens[i].termText())) {

// Start searching at a point SUM_CONTEXT terms back,

// and move SUM_CONTEXT terms into the future.

int startToken = (i > SUM_CONTEXT) ? i-SUM_CONTEXT : 0;

int endToken = Math.min(i+SUM_CONTEXT*20, tokens.length);

int offset = tokens[startToken].startOffset();

int j = startToken;

// Iterate from the start point to the finish, adding

// terms all the way. The end of the passage is always

// SUM_CONTEXT beyond the last query-term.

Excerpt excerpt = new Excerpt();

if (i != 0) {

excerpt.add(new Summary.Ellipsis());

}

// Iterate through as long as we're before the end of

// the document and we haven't hit the max-number-of-items

// -in-a-summary.

Token a = null ;

while ((j < endToken) && (j - startToken < SUM_LENGTH)) {

// Now grab the hit-element, if present

Token t = tokens[j];

if (highlight.contains(t.termText())) {

excerpt.addToken(t.termText());

//System.out.println("Text:"+text.substring(offset, t.startOffset()) +" OffSet:"+offset +" Start:"+ t.startOffset());

excerpt.add(new Fragment(text.substring(offset, t.startOffset())));

excerpt.add(new Highlight(text.substring(t.startOffset(),

t.endOffset())));

a = (Token)t.cloneToken() ;

offset = a.endOffset();

//endToken = Math.min(j+SUM_LENGTH, tokens.length);

}

j++;

}

{

if(offset<text.length()&& Math.min(endToken,

i + SUM_LENGTH)<tokens.length && tokens[Math.min(endToken,

i + SUM_LENGTH)].endOffset()<text.length())

{

excerpt.add(new Fragment(text.substring(offset,

tokens[Math.min(endToken,

i + SUM_LENGTH)].endOffset())));

}

lastExcerptPos = endToken;

// We found the series of search-term hits and added

// them (with intervening text) to the excerpt. Now

// we need to add the trailing edge of text.

// So if (j < tokens.length) then there is still trailing

// text to add. (We haven't hit the end of the source doc.)

// Add the words since the last hit-term insert.

// if (j < tokens.length) {

// System.out.println(text.length()+" Ooffset:"+offset + " EndOff:"+ tokens[j].endOffset()+" "+text );

// excerpt.add(new Fragment(text.substring(offset,offset+tokens[j].endOffset())));

// }

// Remember how many terms are in this excerpt

excerpt.setNumTerms(j - startToken);

// Store the excerpt for later sorting

excerptSet.add(excerpt);

// Start SUM_CONTEXT places away. The next

// search for relevant excerpts begins at i-SUM_CONTEXT

i = j+SUM_CONTEXT;

}

// If the target text doesn't appear, then we just

// excerpt the first SUM_LENGTH words from the document.

if (excerptSet.size() == 0) {

Excerpt excerpt = new Excerpt();

int excerptLen = Math.min(SUM_LENGTH, tokens.length);

lastExcerptPos = excerptLen;

excerpt.add(new Fragment(text.substring(tokens[0].startOffset(), tokens[excerptLen-1].startOffset())));

excerpt.setNumTerms(excerptLen);

excerptSet.add(excerpt);

}

// Now choose the best items from the excerpt set.

// Stop when our Summary grows too large.

double tokenCount = 0;

Summary s = new Summary();

while (tokenCount <= SUM_LENGTH && excerptSet.size() > 0) {

Excerpt excerpt = (Excerpt) excerptSet.last();

excerptSet.remove(excerpt);

double tokenFraction = (1.0 * excerpt.getNumTerms()) / excerpt.numFragments();

for (Enumeration e = excerpt.elements(); e.hasMoreElements(); ) {

Fragment f = (Fragment) e.nextElement();

// Don't add fragments if it takes us over the max-limit

if ((int)(tokenCount + tokenFraction) <= SUM_LENGTH) {

s.add(f);

}

tokenCount += tokenFraction;

}

if (tokenCount > 0 && lastExcerptPos < tokens.length)

s.add(new Ellipsis());

return s;

}

Trackback: http://tb.blog.csdn.net/TrackBack.aspx?PostId=1186978

jaddy0302 發表於2006-09-06 22:50:00 IP: 221.219.255.*

代碼比較生猛，尤其是計算超出參數約束的代碼，修改需要更長的時候可以修改 int endToken = Math.min(i+SUM_CONTEXT*20, tokens.length);
不過在 nutch-default.xml中間可以通過調整 searcher.summary.context 和 searcher.summary.length

dengyf 發表於2006-09-07 23:37:00 IP: 221.222.76.*

請問怎樣nutch-0.8加中文分詞，都是修改那些文件，謝謝

dengyf 發表於2006-09-07 23:40:00 IP: 221.222.76.*

請問怎樣nutch-0.8中添加中文分詞，寫改那些文件，謝謝

jaddy0302 發表於2006-09-08 21:00:00 IP: 125.96.24.*

只需要修改一個地方：
org.apache.nutch.analysis.NutchAnalysis
文件final public Query parse() throws ParseException 方法
的85行左右，修改爲：
org.apache.lucene.analysis.TokenStream tokenizer = new com.xdtech.
util.lucene.XDChineseTokenizer(input);

nutch 高亮和增加索引長度

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

通義千問 2.5 “客串” ChatGPT4，你分的清嗎？

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

hadoop 學習

好久沒有在網上寫心情了。

nutch怎樣過濾spam信息。

開源crawler

nutch 高亮和增加索引長度

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結