通過一系列的離線活動(對於查詢用戶而言)的開展,Nutch檢索系統相對而言變得簡單了許多。在二次開發的時候,需要重點對Nutch的界面及界面顯示數據進行適當的調整。
1 摘要提取
1.1 摘要提取源碼分析
**
* Low level api to get the most relevant (formatted) sections of the document.
* 底層API,獲取文檔中最相關的(格式化)部分
* This method has been made public to allow visibility of score information held in TextFragment objects.
* Thanks to Jason Calabrese for help in redefining the interface.
* @param tokenStream
* @param text
* @param maxNumFragments
* @param mergeContiguousFragments
* @throws IOException
*/
public final TextFragment[] getBestTextFragments(
TokenStream tokenStream,
String text,
boolean mergeContiguousFragments,
int maxNumFragments)
throws IOException
{
ArrayList docFrags = new ArrayList();
StringBuffer newText=new StringBuffer();
TextFragment currentFrag = new TextFragment(newText,newText.length(), docFrags.size());
fragmentScorer.startFragment(currentFrag);
docFrags.add(currentFrag);
FragmentQueue fragQueue = new FragmentQueue(maxNumFragments);
try
{
org.apache.lucene.analysis.Token token;
String tokenText;
int startOffset;
int endOffset;
int lastEndOffset = 0;
textFragmenter.start(text);
TokenGroup tokenGroup=new TokenGroup();
token = tokenStream.next();
while ((token!= null)&&(token.startOffset()<maxDocBytesToAnalyze))
{
if((tokenGroup.numTokens>0)&&(tokenGroup.isDistinct(token)))
{
//the current token is distinct from previous tokens -
// markup the cached token group info
startOffset = tokenGroup.matchStartOffset;
endOffset = tokenGroup.matchEndOffset;
tokenText = text.substring(startOffset, endOffset);
String markedUpText=formatter.highlightTerm(encoder.encodeText(tokenText), tokenGroup);
//store any whitespace etc from between this and last group
if (startOffset > lastEndOffset)
newText.append(encoder.encodeText(text.substring(lastEndOffset, startOffset)));
newText.append(markedUpText);
lastEndOffset=Math.max(endOffset, lastEndOffset);
tokenGroup.clear();
//check if current token marks the start of a new fragment
if(textFragmenter.isNewFragment(token))
{
currentFrag.setScore(fragmentScorer.getFragmentScore());
//record stats for a new fragment
currentFrag.textEndPos = newText.length();
currentFrag =new TextFragment(newText, newText.length(), docFrags.size());
fragmentScorer.startFragment(currentFrag);
docFrags.add(currentFrag);
}
}
tokenGroup.addToken(token,fragmentScorer.getTokenScore(token));
// if(lastEndOffset>maxDocBytesToAnalyze)
// {
// break;
// }
token = tokenStream.next();
}
currentFrag.setScore(fragmentScorer.getFragmentScore());
if(tokenGroup.numTokens>0)
{
//flush the accumulated text (same code as in above loop)
startOffset = tokenGroup.matchStartOffset;
endOffset = tokenGroup.matchEndOffset;
tokenText = text.substring(startOffset, endOffset);
String markedUpText=formatter.highlightTerm(encoder.encodeText(tokenText), tokenGroup);
//store any whitespace etc from between this and last group
if (startOffset > lastEndOffset)
newText.append(encoder.encodeText(text.substring(lastEndOffset, startOffset)));
newText.append(markedUpText);
lastEndOffset=Math.max(lastEndOffset,endOffset);
}
//Test what remains of the original text beyond the point where we stopped analyzing
if (
// if there is text beyond the last token considered..
(lastEndOffset < text.length())
&&
// and that text is not too large...
(text.length()<maxDocBytesToAnalyze)
)
{
//append it to the last fragment
newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
currentFrag.textEndPos = newText.length();
//sort the most relevant sections of the text
for (Iterator i = docFrags.iterator(); i.hasNext();)
{
currentFrag = (TextFragment) i.next();
//If you are running with a version of Lucene before 11th Sept 03
// you do not have PriorityQueue.insert() - so uncomment the code below
/*
if (currentFrag.getScore() >= minScore)
{
fragQueue.put(currentFrag);
if (fragQueue.size() > maxNumFragments)
{ // if hit queue overfull
fragQueue.pop(); // remove lowest in hit queue
minScore = ((TextFragment) fragQueue.top()).getScore(); // reset minScore
}
}
*/
//The above code caused a problem as a result of Christoph Goller's 11th Sept 03
//fix to PriorityQueue. The correct method to use here is the new "insert" method
// USE ABOVE CODE IF THIS DOES NOT COMPILE!
fragQueue.insert(currentFrag);
}
//return the most relevant fragments
TextFragment frag[] = new TextFragment[fragQueue.size()];
for (int i = frag.length - 1; i >= 0; i--)
{
frag[i] = (TextFragment) fragQueue.pop();
}
//merge any contiguous fragments to improve readability
if(mergeContiguousFragments)
{
mergeContiguousFragments(frag);
ArrayList fragTexts = new ArrayList();
for (int i = 0; i < frag.length; i++)
{
if ((frag[i] != null) && (frag[i].getScore() > 0))
{
fragTexts.add(frag[i]);
}
}
frag= (TextFragment[]) fragTexts.toArray(new TextFragment[0]);
}
return frag;
}
finally
{
if (tokenStream != null)
{
try
{
tokenStream.close();
}
catch (Exception e)
{
}
}
}
}
1.2 改變摘要長度
Nutch的查詢結果中摘要長度是可以改變的,它是以配置工兵方式進行的修改,配置文件是nutch-site:xml:
<configuration>
...
<property>
<name>searcher.summary.length</name>
<value>50</value>//默認爲20
<description>
The total number of terms to display in a hit summary.
</description>
</property>
...
</configuration>
Nutch的默認配置可能是在nutch-default.xml中設置的,如果要想覆蓋它的配置只需在nutch-site.xml中添加相應的配置就好了。
2網頁快照
所謂網頁快照及搜索引擎服務器端存儲的網頁副本。Nutch通過關鍵字進行搜索網頁的時候,會查詢出這個關鍵字對應的相關信息,比如:title、url、content等等。通過URL可以鏈接到該URL對應的網頁。而網頁快照其實是Nutch爬蟲爬取下來的網頁內容。因此,當點擊網頁快照時,我們根據索引文檔的ID,去索引出原網頁內容。該源代碼在查詢服務系統中的 cache.jsp中,下面是相關代碼:
Hit hit = new Hit(Integer.parseInt(request.getParameter("idx")),
request.getParameter("id"));
HitDetails details = bean.getDetails(hit);
….
String content = new String(bean.getContent(details));
另外還涉及到Nutch 網頁快照的中文問題,中文時採用UTF-8取得內容就行了。
修改cached.jsp,把
***
else
content = new String( bean.getContent(details) );
改成
content = new String( bean.getContent(details) ,"utf-8");
如果需要對內容的顯示方面做一些修改的話,通過該頁面也可以修改。
轉自:http://hi.baidu.com/zhumulangma/blog/item/7b39adc294d13c130ff477e2.html