lucene4.2自帶demo

lucene是做什麼的網上可以搜到很多資料，就不多說了。我想說了有一下幾點

1.爲什麼不直接用數據庫而選用lucene

因爲lucene是全文搜索引擎，所以它比較擅長從一個詞語中反過來找到那個詞在哪篇文章中，是反着的，假如用數據，從2000個字中like那個字段效率很低，而lucene通過生成索引反過來的方式，這樣可以提高查詢的效率。

2.建立索引主要涉及到的方法和類

爲了對文檔進行索引，Lucene 提供了五個基礎的類，他們分別是 Document, Field, IndexWriter, Analyzer, Directory。下面我們分別介紹一下這五個類的用途：

Document

Document 是用來描述文檔的，這裏的文檔可以指一個 HTML 頁面，一封電子郵件，或者是一個文本文件。一個 Document 對象由多個 Field 對象組成的。可以把一個 Document 對象想象成數據庫中的一個記錄，而每個 Field 對象就是記錄的一個字段。

Field

Field 對象是用來描述一個文檔的某個屬性的，比如一封電子郵件的標題和內容可以用兩個 Field 對象分別描述。

Analyzer

在一個文檔被索引之前，首先需要對文檔內容進行分詞處理，這部分工作就是由 Analyzer 來做的。Analyzer 類是一個抽象類，它有多個實現。針對不同的語言和應用需要選擇適合的 Analyzer。Analyzer 把分詞後的內容交給 IndexWriter 來建立索引。

不同的需求需要選擇適合自己的分詞器分詞器請看 http://approximation.iteye.com/blog/345885

IndexWriter

IndexWriter 是 Lucene 用來創建索引的一個核心的類，他的作用是把一個個的 Document 對象加到索引中來。

Directory

這個類代表了 Lucene 的索引的存儲的位置，這是一個抽象類，它目前有兩個實現，第一個是 FSDirectory，它表示一個存儲在文件系統中的索引的位置。第二個是 RAMDirectory，它表示一個存儲在內存當中的索引的位置。

熟悉了建立索引所需要的這些類後，我們就開始對某個目錄下面的文本文件建立索引了，清單 1 給出了對某個目錄下的文本文件建立索引的源代碼。

3.查找所涉及的方法和類

利用 Lucene 進行搜索就像建立索引一樣也是非常方便的。在上面一部分中，我們已經爲一個目錄下的文本文檔建立好了索引，現在我們就要在這個索引上進行搜索以找到包含某個關鍵詞或短語的文檔。Lucene 提供了幾個基礎的類來完成這個過程，它們分別是呢 IndexSearcher, Term, Query, TermQuery, Hits. 下面我們分別介紹這幾個類的功能。

Query

這是一個抽象類，他有多個實現，比如 TermQuery, BooleanQuery, PrefixQuery. 這個類的目的是把用戶輸入的查詢字符串封裝成 Lucene 能夠識別的 Query。

Term

Term 是搜索的基本單位，一個 Term 對象有兩個 String 類型的域組成。生成一個 Term 對象可以有如下一條語句來完成：Term term = new Term(“fieldName”,”queryWord”); 其中第一個參數代表了要在文檔的哪一個 Field 上進行查找，第二個參數代表了要查詢的關鍵詞。

例如我刪了我數據庫裏的一條記錄我想從索引裏刪除這條記錄

Term term = new Term("userid",11110);
readers.deleteDocuments(term);

TermQuery

TermQuery 是抽象類 Query 的一個子類，它同時也是 Lucene 支持的最爲基本的一個查詢類。生成一個 TermQuery 對象由如下語句完成： TermQuery termQuery = new TermQuery(new Term(“fieldName”,”queryWord”)); 它的構造函數只接受一個參數，那就是一個 Term 對象。

IndexSearcher

IndexSearcher 是用來在建立好的索引上進行搜索的。它只能以只讀的方式打開一個索引，所以可以有多個 IndexSearcher 的實例在一個索引上進行操作。

Hits

Hits 是用來保存搜索的結果的。

下面是我從lucene官網下載下的例子maven配置如下：

	<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-core</artifactId>
			<version>4.2.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-queries</artifactId>
			<version>4.2.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-analyzers</artifactId>
			<version>3.6.2</version>
		</dependency>

		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-analyzers-common</artifactId>
			<version>4.2.0</version>
		</dependency>
		<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-queryparser</artifactId>
	<version>4.2.0</version>
</dependency>

生成索引的代碼

package com.my.lucene2;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.wltea.analyzer.lucene.IKAnalyzer;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

public class IndexFiles {
  
  private IndexFiles() {}

  public static void main(String[] args) {
	  
	 // 生成索引的位置
    String indexPath = "D:\\test\\bb\\index";
    // 將要生成索引的文件目錄
    String docsPath = "D:\\test\\aa\\";

    final File docDir = new File(docsPath);
    if (!docDir.exists() || !docDir.canRead()) {
      System.exit(1);
    }
    
    Date start = new Date();
    try {
      System.out.println("Indexing to directory '" + indexPath + "'...");
      //Directory rdir =new  RAMDirectory(); 把建立的索引放入內存
      // 建立磁盤索引放入磁盤
      Directory dir = FSDirectory.open(new File(indexPath));
     // Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);  lucene自帶的標準分詞
      Analyzer analyzer = new IKAnalyzer(); //  二元ik分詞
      // 配置建立索引
      IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_42, analyzer);
      boolean create = true;
      if (create) {
        //新創建索引
        iwc.setOpenMode(OpenMode.CREATE);
      } else {
        // 增加索引
        iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
      }

      // Optional: for better indexing performance, if you
      // are indexing many documents, increase the RAM
      // buffer.  But if you do this, increase the max heap
      // size to the JVM (eg add -Xmx512m or -Xmx1g):
      //
      // iwc.setRAMBufferSizeMB(256.0);
      // 創建索引對象
      IndexWriter writer = new IndexWriter(dir, iwc);
      indexDocs(writer, docDir);

      // NOTE: if you want to maximize search performance,
      // you can optionally call forceMerge here.  This can be
      // a terribly costly operation, so generally it's only
      // worth it when your index is relatively static (ie
      // you're done adding documents to it):
      //
      // writer.forceMerge(1);

      writer.close();

      Date end = new Date();
      System.out.println(end.getTime() - start.getTime() + " total milliseconds");

    } catch (IOException e) {
      System.out.println(" caught a " + e.getClass() +
       "\n with message: " + e.getMessage());
    }
  }
  static void indexDocs(IndexWriter writer, File file)
    throws IOException {
    // do not try to index files that cannot be read
    if (file.canRead()) {
      if (file.isDirectory()) {
        String[] files = file.list();
        // an IO error could occur
        if (files != null) {
          for (int i = 0; i < files.length; i++) {
            indexDocs(writer, new File(file, files[i]));
          }
        }
      } else {

        FileInputStream fis;
        try {
          fis = new FileInputStream(file);
        } catch (FileNotFoundException fnfe) {
          // at least on windows, some temporary files raise this exception with an "access denied" message
          // checking if the file can be read doesn't help
          return;
        }

        try {

          // make a new, empty document
          Document doc = new Document();

          // Add the path of the file as a field named "path".  Use a
          // field that is indexed (i.e. searchable), but don't tokenize 
          // the field into separate words and don't index term frequency
          // or positional information:
          Field pathField = new StringField("path", file.getPath(), Field.Store.YES);
          System.out.println("sss " + pathField);
          doc.add(pathField);

          // Add the last modified date of the file a field named "modified".
          // Use a LongField that is indexed (i.e. efficiently filterable with
          // NumericRangeFilter).  This indexes to milli-second resolution, which
          // is often too fine.  You could instead create a number based on
          // year/month/day/hour/minutes/seconds, down the resolution you require.
          // For example the long value 2011021714 would mean
          // February 17, 2011, 2-3 PM.
          doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));

          // Add the contents of the file to a field named "contents".  Specify a Reader,
          // so that the text of the file is tokenized and indexed, but not stored.
          // Note that FileReader expects the file to be in UTF-8 encoding.
          // If that's not the case searching for special characters will fail.
          doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "gbk"))));
          doc.add(new StringField("test", "雪含心", Field.Store.YES));
          // doc.add 我們可以根究需求去增加字段例如我以增加一個userid，這樣沒刪除一個user我就能更新索引
          if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
            // New index, so we just add the document (no old document can be there):
            System.out.println("adding " + file);
            writer.addDocument(doc);
          } else {
            // Existing index (an old copy of this document may have been indexed) so 
            // we use updateDocument instead to replace the old one matching the exact 
            // path, if present:
            System.out.println("updating " + file);
            writer.updateDocument(new Term("path", file.getPath()), doc);
          }
          
        } finally {
          fis.close();
        }
      }
    }
  }
}

搜索的代碼

package com.my.lucene2;

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
//import org.apache.lucene.index.StoredDocument;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.wltea.analyzer.lucene.IKAnalyzer;

/** Simple command-line based search demo. */
public class SearchFiles {

  private SearchFiles() {}

  /** Simple command-line based search demo. */
  public static void main(String[] args) throws Exception {
    String usage =
      "Usage:\tjava org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]\n\nSee http://lucene.apache.org/core/4_1_0/demo/ for details.";
    if (args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) {
      System.out.println(usage);
      System.exit(0);
    }

    String index = "D:\\test\\bb\\index\\";
    String field = "contents";    // 查找的字段名
    String queries = "D:\\test\\bb\\index\\bb.txt";
    int repeat = 0;
    boolean raw = false;
    String queryString = null;
    int hitsPerPage = 10;    // 分頁用
    // 讀索引
    IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
    IndexSearcher searcher = new IndexSearcher(reader);
    // 標準分詞
    //Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
    Analyzer analyzer =new IKAnalyzer();
    // 從一個文本里讀出我們要搜索的內容，這個也可以寫成死的
    BufferedReader in = null;
    if (queries != null) {
    	File file = new File(queries);
      in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), "gbk"));
    } else {
      in = new BufferedReader(new InputStreamReader(System.in, "gbk"));
    }
    // 生成解析器
    QueryParser parser = new QueryParser(Version.LUCENE_42, field, analyzer);
    while (true) {
      if (queries == null && queryString == null) {                        // prompt the user
        System.out.println("Enter query: ");
      }
     // 讀出寫入的要查找的內容賦值給queryString
      String line = queryString != null ? queryString : in.readLine();

      if (line == null || line.length() == -1) {
        break;
      }

      line = line.trim();
      if (line.length() == 0) {
        break;
      }
      // 查找
      Query query = parser.parse(line);
      System.out.println("Searching for: " + query.toString(field));
       // 如果repeate大於0取出查出結果的前100條數據 這個沒有意義，demo裏面這麼寫的
      if (repeat > 0) {                           // repeat & time as benchmark
        Date start = new Date();
        for (int i = 0; i < repeat; i++) {
          searcher.search(query, null, 100);
        }
        Date end = new Date();
        System.out.println("Time: "+(end.getTime()-start.getTime())+"ms");
      }

      doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null);

      if (queryString != null) {
        break;
      }
    }
    reader.close();
  }

  /**
   * This demonstrates a typical paging search scenario, where the search engine presents 
   * pages of size n to the user. The user can then go to the next page if interested in
   * the next hits.
   * 
   * When the query is executed for the first time, then only enough results are collected
   * to fill 5 result pages. If the user wants to page beyond this limit, then the query
   * is executed another time and all hits are collected.
   * 
   */
  public static void doPagingSearch(BufferedReader in, IndexSearcher searcher, Query query, 
                                     int hitsPerPage, boolean raw, boolean interactive) throws IOException {
 
    // Collect enough docs to show 5 pages
    TopDocs results = searcher.search(query, 5 * hitsPerPage);
    // 查找出來的所有文檔
    ScoreDoc[] hits = results.scoreDocs;
    // 總條數
    int numTotalHits = results.totalHits;
    System.out.println(numTotalHits + " total matching documents");

    int start = 0;
    int end = Math.min(numTotalHits, hitsPerPage);
        
    while (true) {
      if (end > hits.length) {
        System.out.println("Only results 1 - " + hits.length +" of " + numTotalHits + " total matching documents collected.");
        System.out.println("Collect more (y/n) ?");
        String line = in.readLine();
        if (line.length() == 0 || line.charAt(0) == 'n') {
          break;
        }
        
        hits = searcher.search(query, numTotalHits).scoreDocs;
      }
      
      end = Math.min(hits.length, start + hitsPerPage);
      
      for (int i = start; i < end; i++) {
        if (raw) {                              // output raw format
          System.out.println("doc="+hits[i].doc+" score="+hits[i].score);
          continue;
        }

        Document doc = searcher.doc(hits[i].doc);
        // 查找到匹配的文檔
        String path = doc.get("path");
        // print the filed 雪含心
        System.out.println("the content is ....." + doc.get("test"));
        if (path != null) {
          System.out.println((i+1) + ". " + path);
          String title = doc.get("title");
          if (title != null) {
            System.out.println("   Title: " + doc.get("title"));
          }
        } else {
          System.out.println((i+1) + ". " + "No path for this document");
        }
                  
      }

      if (!interactive || end == 0) {
        break;
      }

      if (numTotalHits >= end) {
        boolean quit = false;
        while (true) {
          System.out.print("Press ");
          if (start - hitsPerPage >= 0) {
            System.out.print("(p)revious page, ");  
          }
          if (start + hitsPerPage < numTotalHits) {
            System.out.print("(n)ext page, ");
          }
          System.out.println("(q)uit or enter number to jump to a page.");
          
          String line = in.readLine();
          if (line.length() == 0 || line.charAt(0)=='q') {
            quit = true;
            break;
          }
          if (line.charAt(0) == 'p') {
            start = Math.max(0, start - hitsPerPage);
            break;
          } else if (line.charAt(0) == 'n') {
            if (start + hitsPerPage < numTotalHits) {
              start+=hitsPerPage;
            }
            break;
          } else {
            int page = Integer.parseInt(line);
            if ((page - 1) * hitsPerPage < numTotalHits) {
              start = (page - 1) * hitsPerPage;
              break;
            } else {
              System.out.println("No such page");
            }
          }
        }
        if (quit) break;
        end = Math.min(numTotalHits, start + hitsPerPage);
      }
    }
  }
}

這個代碼的流程是從一個目錄下讀出所有的文件然後建立索引，再從一個文件裏讀出一個詞去搜索，建立索引也可以從數據庫裏讀取信息

總結：我對lucene瞭解不是多麼深入，希望加深對lucene的瞭解，以後學學solr，lucene的應用場景和分詞技巧還是要好好研究的

lucene4.2自帶demo

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

Netty筆記二(發送對象--服務端客戶端附可運行源碼)

lucene4.2自帶demo

java多線程-生產者消費者

mmpeg轉碼

Spring Data MongoDB hello world example

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結