Lucene In Action ch6筆記(I)自定義排序,Filter和HitCollector

使用Lucene來搜索內容,搜索結果的顯示順序當然是比較重要的.Lucene中Build-in的幾個排序定義在大多數情況下是不適合我們使用的.要適合自己的應用程序的場景,就只能自定義排序功能,本節我們就來看看在Lucene中如何實現自定義排序功能.

Lucene中的自定義排序功能和Java集合中的自定義排序的實現方法差不多,都要實現一下比較接口. 在Java中只要實現Comparable接口就可以了.但是在Lucene中要實現SortComparatorSource接口和ScoreDocComparator接口.在瞭解具體實現方法之前先來看看這兩個接口的定義吧.

SortComparatorSource接口的功能是返回一個用來排序ScoreDocs的comparator(Expert: returns a comparator for sorting ScoreDocs).該接口只定義了一個方法.如下:

public  newComparator( reader,String fieldname) throws IOException

Creates a comparator for the field in the given index.

Parameters:: reader - Index to create comparator for.; fieldname - Field to create comparator for.
Returns:: Comparator of ScoreDoc objects.
Throws:: IOException - If an error occurs reading the index.

該方法只是創造一個ScoreDocComparator 實例用來實現排序.所以我們還要實現ScoreDocComparator 接口.來看看ScoreDocComparator 接口.功能是比較來兩個ScoreDoc 對象來排序(Compares two ScoreDoc objects for sorting) 裏面定義了兩個Lucene實現的靜態實例.如下:

public static final  RELEVANCE

Special comparator for sorting hits according to computed relevance (document score).

public static final  INDEXORDER

Special comparator for sorting hits according to index order (document number).

有3個方法與排序相關,需要我們實現分別如下:

public int compare(ScoreDoc i,ScoreDoc j)

Compares two ScoreDoc objects and returns a result indicating their sort order.

Parameters:: i - First ScoreDoc; j - Second ScoreDoc
Returns:: -1 if i should come before j
1 if i should come after j
0 if they are equal

public Comparable sortValue(ScoreDoc i)

Returns the value used to sort the given document. The object returned must implement the java.io.Serializable interface. This is used by multisearchers to determine how to collate results from their searchers.

Parameters:: i - Document
Returns:: Serializable object

public int sortType()

Returns the type of sort. Should return SortField.SCORE, SortField.DOC, SortField.STRING, SortField.INTEGER, SortField.FLOAT or SortField.CUSTOM. It is not valid to return SortField.AUTO. This is used by multisearchers to determine how to collate results from their searchers.

Returns:: One of the constants in SortField.

看個例子吧!

該例子爲Lucene in Action中的一個實現,用來搜索距你最近的餐館的名字. 餐館座標用字符串"x,y"來存儲.如下圖:

Figure 6.1 Which Mexican restaurant is closest to home (at 0,0) or work (at 10,10)?

此中情況下 Lucene中Build-in Sorting 實現就不可行了,看看如何自己實現吧.

01 package lia.extsearch.sorting; 02 03 import org.apache.lucene.search.SortComparatorSource; 04 import org.apache.lucene.search.ScoreDoc; 05 import org.apache.lucene.search.SortField; 06 import org.apache.lucene.search.ScoreDocComparator; 07 import org.apache.lucene.index.IndexReader; 08 import org.apache.lucene.index.TermEnum; 09 import org.apache.lucene.index.Term; 10 import org.apache.lucene.index.TermDocs; 11 12 import java.io.IOException; 13 // DistanceComparatorSource 實現了SortComparatorSource接口 14 public class DistanceComparatorSource implements SortComparatorSource { 15 // x y 用來保存座標位置 16 private int x; 17 private int y; 18 19 public DistanceComparatorSource(int x, int y) { 20 this.x = x; 21 this.y = y; 22 } 23 // 返回ScoreDocComparator 用來實現排序功能 24 public ScoreDocComparator newComparator(IndexReader reader, String fieldname) 25 throws IOException { 26 return new DistanceScoreDocLookupComparator(reader, fieldname, x, y); 27 } 28 29 //DistanceScoreDocLookupComparator 實現了ScoreDocComparator 用來排序 30 private static class DistanceScoreDocLookupComparator implements 31 ScoreDocComparator { 32 private float[] distances; // 保存每個餐館到指定點的距離 33 34 // 構造函數 , 構造函數在這裏幾乎完成所有的準備工作. 35 public DistanceScoreDocLookupComparator(IndexReader reader, 36 String fieldname, int x, int y) throws IOException { 37 38 final TermEnum enumerator = reader.terms(new Term(fieldname, "")); 39 distances = new float[reader.maxDoc()]; // 初始化distances 40 if (distances.length > 0) { 41 TermDocs termDocs = reader.termDocs(); 42 try { 43 if (enumerator.term() == null) { 44 throw new RuntimeException("no terms in field " 45 + fieldname); 46 } 47 int i = 0,j = 0; 48 do { 49 System.out.println("in do-while :" + i ++); 50 51 Term term = enumerator.term(); // 取出每一個Term 52 if (term.field() != fieldname) // 與給定的域不符合則比較下一個 53 break; 54 //Sets this to the data for the current term in a TermEnum. 55 //This may be optimized in some implementations. 56 termDocs.seek(enumerator); //參考TermDocs Doc 57 while (termDocs.next()) { 58 System.out.println(" in while :" + j ++); 59 System.out.println(" in while ,Term :" + term.toString()); 60 61 String[] xy = term.text().split(","); // 去處x y 62 int deltax = Integer.parseInt(xy[0]) - x; 63 int deltay = Integer.parseInt(xy[1]) - y; 64 // 計算距離 65 distances[termDocs.doc()] = (float) Math 66 .sqrt(deltax * deltax + deltay * deltay); 67 } 68 } while (enumerator.next()); 69 } finally { 70 termDocs.close(); 71 } 72 } 73 } 74 75 //有上面的構造函數的準備這裏就比較簡單了 76 public int compare(ScoreDoc i, ScoreDoc j) { 77 if (distances[i.doc] < distances[j.doc]) 78 return -1; 79 if (distances[i.doc] > distances[j.doc]) 80 return 1; 81 return 0; 82 } 83 84 // 返回距離 85 public Comparable sortValue(ScoreDoc i) { 86 return new Float(distances[i.doc]); 87 } 88 89 //指定SortType 90 public int sortType() { 91 return SortField.FLOAT; 92 } 93 } 94 95 public String toString() { 96 return "Distance from (" + x + "," + y + ")"; 97 } 98 99 }

這是一個實現了上面兩個接口的兩個類, 裏面帶有詳細註釋, 可以看出自定義排序並不是很難的. 該實現能否正確實現,我們來看看測試代碼能否通過吧.

001 package lia.extsearch.sorting; 002 003 import junit.framework.TestCase; 004 import org.apache.lucene.analysis.WhitespaceAnalyzer; 005 import org.apache.lucene.document.Document; 006 import org.apache.lucene.document.Field; 007 import org.apache.lucene.index.IndexWriter; 008 import org.apache.lucene.index.Term; 009 import org.apache.lucene.search.FieldDoc; 010 import org.apache.lucene.search.Hits; 011 import org.apache.lucene.search.IndexSearcher; 012 import org.apache.lucene.search.Query; 013 import org.apache.lucene.search.ScoreDoc; 014 import org.apache.lucene.search.Sort; 015 import org.apache.lucene.search.SortField; 016 import org.apache.lucene.search.TermQuery; 017 import org.apache.lucene.search.TopFieldDocs; 018 import org.apache.lucene.store.RAMDirectory; 019 020 import java.io.IOException; 021 022 import lia.extsearch.sorting.DistanceComparatorSource; 023 // 測試自定義排序的實現 024 public class DistanceSortingTest extends TestCase { 025 private RAMDirectory directory; 026 027 private IndexSearcher searcher; 028 029 private Query query; 030 031 //建立測試環境 032 protected void setUp() throws Exception { 033 directory = new RAMDirectory(); 034 IndexWriter writer = new IndexWriter(directory, 035 new WhitespaceAnalyzer(), true); 036 addPoint(writer, "El Charro", "restaurant", 1, 2); 037 addPoint(writer, "Cafe Poca Cosa", "restaurant", 5, 9); 038 addPoint(writer, "Los Betos", "restaurant", 9, 6); 039 addPoint(writer, "Nico's Taco Shop", "restaurant", 3, 8); 040 041 writer.close(); 042 043 searcher = new IndexSearcher(directory); 044 045 query = new TermQuery(new Term("type", "restaurant")); 046 } 047 048 private void addPoint(IndexWriter writer, String name, String type, int x, 049 int y) throws IOException { 050 Document doc = new Document(); 051 doc.add(Field.Keyword("name", name)); 052 doc.add(Field.Keyword("type", type)); 053 doc.add(Field.Keyword("location", x + "," + y)); 054 writer.addDocument(doc); 055 } 056 057 public void testNearestRestaurantToHome() throws Exception { 058 //使用DistanceComparatorSource來構造一個SortField 059 Sort sort = new Sort(new SortField("location", 060 new DistanceComparatorSource(0, 0))); 061 062 Hits hits = searcher.search(query, sort); // 搜索 063 064 //測試 065 assertEquals("closest", "El Charro", hits.doc(0).get("name")); 066 assertEquals("furthest", "Los Betos", hits.doc(3).get("name")); 067 } 068 069 public void testNeareastRestaurantToWork() throws Exception { 070 Sort sort = new Sort(new SortField("location", 071 new DistanceComparatorSource(10, 10))); // 工作的座標 10,10 072 073 //上面的測試實現了自定義排序,但是並不能訪問自定義排序的更詳細信息,利用 074 //TopFieldDocs 可以進一步訪問相關信息 075 TopFieldDocs docs = searcher.search(query, null, 3, sort); 076 077 assertEquals(4, docs.totalHits); 078 assertEquals(3, docs.scoreDocs.length); 079 080 //取得FieldDoc 利用FieldDoc可以取得關於排序的更詳細信息請查看FieldDoc Doc 081 FieldDoc fieldDoc = (FieldDoc) docs.scoreDocs[0]; 082 083 assertEquals("(10,10) -> (9,6) = sqrt(17)", new Float(Math.sqrt(17)), 084 fieldDoc.fields[0]); 085 086 Document document = searcher.doc(fieldDoc.doc); 087 assertEquals("Los Betos", document.get("name")); 088 089 dumpDocs(sort, docs); // 顯示相關信息 090 } 091 092 // 顯示有關排序的信息 093 private void dumpDocs(Sort sort, TopFieldDocs docs) throws IOException { 094 System.out.println("Sorted by: " + sort); 095 ScoreDoc[] scoreDocs = docs.scoreDocs; 096 for (int i = 0; i < scoreDocs.length; i++) { 097 FieldDoc fieldDoc = (FieldDoc) scoreDocs[i]; 098 Float distance = (Float) fieldDoc.fields[0]; 099 Document doc = searcher.doc(fieldDoc.doc); 100 System.out.println(" " + doc.get("name") + " @ (" 101 + doc.get("location") + ") -> " + distance); 102 } 103 } 104 }

完全通過測試,

輸入信息如下:想進一步瞭解詳細信息的可以研究一下:

in do-while :0
    in while :0
    in while ,Term :location:1,2
in do-while :1
    in while :1
    in while ,Term :location:3,8
in do-while :2
    in while :2
    in while ,Term :location:5,9
in do-while :3
    in while :3
    in while ,Term :location:9,6
in do-while :4
in do-while :0
    in while :0
    in while ,Term :location:1,2
in do-while :1
    in while :1
    in while ,Term :location:3,8
in do-while :2
    in while :2
    in while ,Term :location:5,9
in do-while :3
    in while :3
    in while ,Term :location:9,6
in do-while :4
Sorted by: <custom:"location": Distance from (10,10)>
Los Betos @ (9,6) -> 4.1231055
Cafe Poca Cosa @ (5,9) -> 5.0990195
Nico's Taco Shop @ (3,8) -> 7.28011

如果要想取得測試的詳細參考信息可以參考testNeareastRestaurantToWork 方法的實現.

有上面可以看出要自定義實現排序並不是很難的.

下面來看看HitCollector.

一般情況下搜索結果只顯示最重要的一些結果,但有時用戶可能想顯示所有匹配的搜索結果而不訪問其內容.這中情況下使用自定義的HitCollector是高效的實現.

下面來看看一個測試例子.在該例子中我們實現了BookLinkCollector一個自定義的HitCollector,裏面有一個Map 保存了符合查詢條件的 URL 和相應的booktitle ,HitCollector中有個函數要實現 collect:其doc如下:

public abstract void collect(int doc, float score)

Called once for every non-zero scoring document, with the document number and its score.

If, for example, an application wished to collect all of the hits for a query in a BitSet, then it might:

   Searcher searcher = new IndexSearcher(indexReader);
   final BitSet bits = new BitSet(indexReader.maxDoc());
   searcher.search(query, new HitCollector() {
       public void collect(int doc, float score) {
         bits.set(doc);
       }
     });

Note: This is called in an inner search loop. For good search performance, implementations of this method should not call Searchable.doc(int) or IndexReader.document(int) on every document number encountered. Doing so can slow searches by an order of magnitude or more.

Note: The score passed to this method is a raw score. In other words, the score will not necessarily be a float whose value is between 0 and 1.

下面來看看BookLinkCollector的實現:

01 package lia.extsearch.hitcollector; 02 03 import org.apache.lucene.document.Document; 04 import org.apache.lucene.search.HitCollector; 05 import org.apache.lucene.search.IndexSearcher; 06 07 import java.io.IOException; 08 import java.util.Collections; 09 import java.util.HashMap; 10 import java.util.Map; 11 // 自定義BookLinkCollector的實現,比較簡單 12 public class BookLinkCollector extends HitCollector { 13 private IndexSearcher searcher; 14 // 保存 URL 和 Title的Map 15 private HashMap documents = new HashMap(); 16 17 public BookLinkCollector(IndexSearcher searcher) { 18 this.searcher = searcher; 19 } 20 21 //實現的接口的方法 22 public void collect(int id, float score) { 23 try { 24 Document doc = searcher.doc(id); 25 documents.put(doc.get("url"), doc.get("title")); 26 System.out.println(doc.get("title") + ":" + score); 27 } catch (IOException e,) { 28 // ignore 29 } 30 } 31 32 public Map getLinks() { 33 return Collections.unmodifiableMap(documents); 34 } 35 }

測試代碼:

01 package lia.extsearch.hitcollector; 02 03 import lia.common.LiaTestCase; 04 import lia.extsearch.hitcollector.BookLinkCollector; 05 import org.apache.lucene.index.Term; 06 import org.apache.lucene.search.IndexSearcher; 07 import org.apache.lucene.search.TermQuery; 08 import org.apache.lucene.search.Hits; 09 10 import java.util.Map; 11 12 public class HitCollectorTest extends LiaTestCase { 13 14 public void testCollecting() throws Exception { 15 TermQuery query = new TermQuery(new Term("contents", "junit")); 16 IndexSearcher searcher = new IndexSearcher(directory); 17 18 // BookLinkCollector 需要一個參數 searcher 19 BookLinkCollector collector = new BookLinkCollector(searcher); 20 searcher.search(query, collector); // 搜索 21 22 Map linkMap = collector.getLinks(); 23 //測試 24 assertEquals("Java Development with Ant", linkMap 25 .get("http://www.manning.com/antbook")); 26 27 28 Hits hits = searcher.search(query); 29 dumpHits(hits); 30 31 searcher.close(); 32 } 33 }

該實現是比較簡單的,要進一步瞭解其用法請參考Lucene in Action 或者我的Blog.

III. 自定義Filter的實現

有了上面實現的Sort代碼自定義實現Filter也是很簡單的只要實現Filter接口的一個方法就可以了該方法如下:

public abstract BitSet bits(IndexReader reader)
                     throws IOException

Returns a BitSet with true for documents which should be permitted in search results, and false for those that should not.

來看個例子:

01 package lia.extsearch.filters; 02 03 import org.apache.lucene.index.IndexReader; 04 import org.apache.lucene.index.Term; 05 import org.apache.lucene.index.TermDocs; 06 import org.apache.lucene.search.Filter; 07 08 import java.io.IOException; 09 import java.util.BitSet; 10 11 import lia.extsearch.filters.SpecialsAccessor; 12 13 public class SpecialsFilter extends Filter { 14 // 訪問isbns 的接口解耦便於重用 15 private SpecialsAccessor accessor; 16 17 public SpecialsFilter(SpecialsAccessor accessor) { 18 this.accessor = accessor; 19 } 20 21 // 覆蓋該方法實現自定義Filter 22 /** 23 * Returns a BitSet with true for documents which should be permitted in 24 * search results, and false for those that should not 25 */ 26 public BitSet bits(IndexReader reader) throws IOException { 27 BitSet bits = new BitSet(reader.maxDoc()); 28 29 String[] isbns = accessor.isbns(); 30 31 int[] docs = new int[1]; 32 int[] freqs = new int[1]; 33 34 for (int i = 0; i < isbns.length; i++) { 35 String isbn = isbns[i]; 36 if (isbn != null) { 37 TermDocs termDocs = reader.termDocs(new Term("isbn", isbn)); 38 int count = termDocs.read(docs, freqs); 39 if (count == 1) { 40 bits.set(docs[0]); 41 42 } 43 } 44 } 45 46 return bits; 47 } 48 49 public String toString() { 50 return "SpecialsFilter"; 51 } 52 }

用到了如下接口

1 package lia.extsearch.filters; 2 3 // 定義一個取得過慮參考信息的接口 4 public interface SpecialsAccessor { 5 String[] isbns(); 6 }

和Mock Object實現

01 package lia.extsearch.filters; 02 03 //一個Mock object的實現 04 public class MockSpecialsAccessor implements SpecialsAccessor { 05 private String[] isbns; 06 07 public MockSpecialsAccessor(String[] isbns) { 08 this.isbns = isbns; 09 } 10 11 public String[] isbns() { 12 return isbns; 13 } 14 }

測試代碼如下:

01 package lia.extsearch.filters; 02 03 import lia.common.LiaTestCase; 04 import org.apache.lucene.search.Filter; 05 import org.apache.lucene.search.Hits; 06 import org.apache.lucene.search.WildcardQuery; 07 import org.apache.lucene.search.FilteredQuery; 08 import org.apache.lucene.search.TermQuery; 09 import org.apache.lucene.search.BooleanQuery; 10 import org.apache.lucene.search.RangeQuery; 11 import org.apache.lucene.search.IndexSearcher; 12 import org.apache.lucene.search.Query; 13 import org.apache.lucene.index.Term; 14 15 //測試自定義Filter 16 public class SpecialsFilterTest extends LiaTestCase { 17 private Query allBooks; 18 19 private IndexSearcher searcher; 20 21 // 建立測試環境 22 protected void setUp() throws Exception { 23 super.setUp(); 24 25 allBooks = new RangeQuery(new Term("pubmonth", "190001"), new Term( 26 "pubmonth", "200512"), true); 27 searcher = new IndexSearcher(directory); 28 } 29 30 // 測試 31 public void testCustomFilter() throws Exception { 32 String[] isbns = new String[] { "0060812451", "0465026567" }; 33 34 SpecialsAccessor accessor = new MockSpecialsAccessor(isbns); 35 Filter filter = new SpecialsFilter(accessor); 36 Hits hits = searcher.search(allBooks, filter); 37 assertEquals("the specials", isbns.length, hits.length()); 38 } 39 40 // Using the new FilteredQuery, though, you can apply a 41 // Filter to a particular query clause of a BooleanQuery. 42 // FilteredQuery爲1.4新加入的詳細情況請參考Lucene in action 和FilteredQuery的doc 43 public void testFilteredQuery() throws Exception { 44 String[] isbns = new String[] { "0854402624" }; // Steiner 45 46 SpecialsAccessor accessor = new MockSpecialsAccessor(isbns); 47 Filter filter = new SpecialsFilter(accessor); 48 49 WildcardQuery educationBooks = new WildcardQuery(new Term("category", 50 "*education*")); 51 FilteredQuery edBooksOnSpecial = new FilteredQuery(educationBooks, 52 filter); 53 54 TermQuery logoBooks = new TermQuery(new Term("subject", "logo")); 55 56 BooleanQuery logoOrEdBooks = new BooleanQuery(); 57 logoOrEdBooks.add(logoBooks, false, false); 58 logoOrEdBooks.add(edBooksOnSpecial, false, false); 59 60 Hits hits = searcher.search(logoOrEdBooks); 61 System.out.println(logoOrEdBooks.toString()); 62 assertEquals("Papert and Steiner", 2, hits.length()); 63 } 64 }

自定義排序,Filter和HitCollector

Lucene In Action ch6筆記(I)自定義排序,Filter和HitCollector

教你如何在jsp中進行分頁控制

自定義排序,Filter和HitCollector

ASP二級分類聯動菜單

正則表達式快速入門

我們對Google可能太過不厚道了

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結