lucene入门

Lucene是apache下的一个开放源代码的全文检索引擎工具包。提供了完整的查询引擎和索引引擎，部分文本分析引擎。Lucene的目的时为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索功能。

Lucene实现全文检索的流程

绿色表示索引过程，对要搜索的原始内容进行索引构建一个索引库，索引过程包括：确定原始内容即要搜索的内容-->采集文档-->创建文档-->分析文档-->索引文档

红色表示搜索过程，从索引库中搜索内容，搜索过程包括：用户通过搜索界面-->创建查询-->执行搜索，从索引库搜索-->渲染搜索结果

一、创建索引

1、获取原始文档

原始文档时指要索引和搜索的内容。原始内容包括互联网上的网页、数据库中的数据等。在Internet上采集信息的软件通常就称为爬虫或蜘蛛，信息采集工具lucene没有提供，需要自己编写或通过一些开源软件实现信息采集。

2、创建文档对象

在索引前需要将原始内容创建成文档（document），文档中包括一个一个的域（Field，相当于文件属性，如文件名，文件大小，文件内容，文件路径等），域中存储内容。每个document可以有多个Field，不同的document可以有不同的Field，同一个document可以有相同的Field（域名和域值相同）。每个文档都有一个唯一编号，就是文档id，我们不能更改。

3、分析文档

将原始内容创建为包含域（Field）的文档（document），需要再对域中的内容进行分析，分析的过程是经过对原始文档提取单词、将字母转为小写、去除标点符号、去除停用词等过程生成最终的语汇单元，可以将语汇单元理解为一个一个的单词。每个单词叫做一个term，term中包含两部分。一部分是文档的域名，另一部分是单词的内容。例如：文件名中包含的apache和文件内容中包含的apache是不同的term

4、创建索引

对所有文档分析得出的语汇单元进行索引，索引的目的时为了搜索，最终要实现只搜索被索引的语汇单元从而找到document（文档）。

创建索引是对语汇单元索引，通过词语找文档，这种索引的结构叫倒排索引结构。

二、在java中使用lucene

1、配置开发环境，导入jar包

2、创建索引库代码

	public void test1() throws IOException {
		//第一步、创建IndexWriter对象
		/**
		 * 1)创建indexWriter对象所需要的directory目录，用于指定索引库存放的位置
		 * 2)创建官方推荐解析器，用于创建IndexWriterConfig对象
		 * 3)创建IndexWriter对象需要的IndexWriterConfig对象
		 * 4)创建IndexWriter对象
		 */
		Directory directory = FSDirectory.open(new File("D:\\temp\\index"));
		Analyzer analyzer = new IKAnalyzer();
		IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);//创建IndexWriterConfig对象时需要传入版本号和分词器
		IndexWriter index = new IndexWriter(directory, config);
		//第二步、创建Field对象，并将field对象添加到document对象中
		//获取文件夹对象，对文件夹下的文件进行遍历创建索引库
		File f = new File("D:\\searchsource");
		File[] listfiles = f.listFiles();
		for (File file : listfiles) {
			//第三步、创建document对象
			Document document = new Document();
			//文件名称
			String file_name = file.getName();
			Field fileNameField = new TextField("fileName", file_name, Store.YES);
			//文件大小
			long file_size = FileUtils.sizeOf(file);  
			Field fileSizeField = new LongField("fileSize", file_size, Store.YES);
			//文件路径
			String file_path = file.getPath();
			Field filePathField = new StoredField("filePath", file_path);
			//文件内容
			String file_content = FileUtils.readFileToString(file);
			Field fileContentField = new TextField("fileContent", file_content,Store.NO);
			//将这些Field域存放到document对象中
			document.add(fileNameField);
			document.add(fileSizeField);
			document.add(filePathField);
			document.add(fileContentField);
			//第四步、使用indexwriter对象将document对象写入索引库，此过程进行索引创建。并将索引和document对象写入索引库
			index.addDocument(document);
		}
		//第五步、关闭IndexWriter对象
		index.close();
	}

关于field域的属性

是否分析：是否对域的内容进行分词处理。前提是我们要对域的内容进行查询。

是否索引：将Field分析后的词或整个Field值进行索引，只有索引方可搜索到。

比如：商品名称、商品简介分析后进行索引，订单号、身份证号不用分析但也要索引，这些将来都要作为查询条件。

是否存储：将Field值存储在文档中，存储在文档中的Field才可以从Document中获取

比如：商品名称、订单号，凡是将来要从Document中获取的Field都要存储。

Field类	数据类型	是否分析	是否索引	是否存储	说明
StringField(FieldName, FieldValue,Store.YES))	字符串	N	Y	Y或N	这个Field用来构建一个字符串Field，但是不会进行分析，会将整个串存储在索引中，比如(订单号,姓名等) 是否存储在文档中用Store.YES或Store.NO决定
LongField(FieldName, FieldValue,Store.YES)	Long型	Y	Y	Y或N	这个Field用来构建一个Long数字型Field，进行分析和索引，比如(价格) 是否存储在文档中用Store.YES或Store.NO决定
StoredField(FieldName, FieldValue)	重载方法，支持多种类型	N	N	Y	这个Field用来构建不同类型Field 不分析，不索引，但要Field存储在文档中
TextField(FieldName, FieldValue, Store.NO) 或 TextField(FieldName, reader)	字符串或流	Y	Y	Y或N	如果是一个Reader, lucene猜测内容比较多,会采用Unstored的策略.

使用Luke工具可以查看索引库中的详细信息

3、查询索引代码

	public void testIndexReader() throws IOException {
//		第一步：创建一个Directory对象，也就是索引库存放的位置。
		Directory directory = FSDirectory.open(new File("D:\\temp\\index"));
		
/*		索引库的位置如果在内存中
		Directory directory2 = new RAMDirectory();*/
		
//		第二步：创建一个indexReader对象，需要指定Directory对象。
		IndexReader indexReader = DirectoryReader.open(directory);
//		第三步：创建一个indexsearcher对象，需要指定IndexReader对象
		IndexSearcher searcher = new IndexSearcher(indexReader);
//		第四步：创建一个TermQuery对象，指定查询的域和查询的关键词。
		Term term = new Term("fileName","全文检索");//指定term的域名和域值
		Query query = new TermQuery(term);
//		第五步：执行查询。
		TopDocs topDocs = searcher.search(query,3);
//		第六步：返回查询结果。遍历查询结果并输出。
		ScoreDoc[] scoreDocs = topDocs.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			int doc = scoreDoc.doc;  //获取文档id
			//获取文档对象
			Document document = indexReader.document(doc);
			//文件名
			String fileName = document.get("fileName");
			//文件大小
			String fileSize = document.get("fileSize");
			//文件路径
			String filePath = document.get("filePath");
			//文件内容
			String fileContent = document.get("fileContent");
			System.out.println(fileName+"...."+fileSize+"...."+filePath+"...."+fileContent);
		}
//		第七步：关闭IndexReader对象
		indexReader.close();
	}

indexSearcher 搜索方法

indexSearcher.search(query, n)：根据Query搜索，返回评分最高的n条记录

indexSearcher.search(query, filter, n)：根据Query搜索，添加过滤策略，返回评分最高的n条记录

indexSearcher.search(query, n, sort)：根据Query搜索，添加排序策略，返回评分最高的n条记录

indexSearcher.search(booleanQuery, filter, n, sort)：根据Query搜索，添加过滤策略，添加排序策略，返回评分最高的n条记录

lucene默认使用的分词器对英文支持很好，对中文支持很差，所以需要使用第三方的分词器。这里使用IKAnalyzer

注意：搜索使用的分析器要和创建索引使用的分析器一致。

4、索引库的维护

1）删除全部索引

	public void test3() throws IOException {
		IndexWriter writer = getIndexWriter();
		writer.deleteAll();
		writer.close();
	}

2）按条件删除索引库内容

	public void test4() throws IOException {
		IndexWriter writer = getIndexWriter();
		Query query = new TermQuery(new Term("fileName","apache"));
		writer.deleteDocuments(query);
		writer.close();
	}

3）对索引库进行修改操作，原理就是先删除后添加。

	public void test5() throws IOException {
		IndexWriter writer = getIndexWriter();
		Document document = new Document();
		TextField text1 = new TextField("fileN","test1",Store.YES);
		TextField text2 = new TextField("fileC","test2",Store.YES);
		document.add(text1);
		document.add(text2);
		writer.updateDocument(new Term("fileName","lucene"),document,new IKAnalyzer());
		writer.close();
	}

5、索引库查询

1）、使用query的子类查询

** MatchAllDocsQuery 查询索引目录中的所有文档

	public void testMatchAllDocsQuery() throws Exception {
		IndexSearcher searcher = getIndexSearcher();
		Query query = new MatchAllDocsQuery();
		TopDocs topDocs = searcher.search(query, 10);
		ScoreDoc[] scoreDocs = topDocs.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			int doc = scoreDoc.doc;
			Document document = searcher.getIndexReader().document(doc);
			String fileName = document.get("fileName");
			System.out.println(fileName);
			String filePath = document.get("filePath");
			System.out.println(filePath);
			String fileSize = document.get("fileSize");
			System.out.println(fileSize);
			String fileContent = document.get("fileContent");
			System.out.println(fileContent);
			System.out.println("--------------------");
		}
		searcher.getIndexReader().close();
	}

** TermQuery 指定要查询的域和要查询的关键词。

		Term term = new Term("fileName","全文检索");//指定term的域名和域值
		Query query = new TermQuery(term);

** NumericRangeQuery 根据数值范围查询

	public void test6() throws Exception {
		IndexSearcher searcher = getIndexSearcher();
		Query query = NumericRangeQuery.newLongRange("fileSize", 47L, 200L, false, true);
		printResult(searcher, query);
	}

这里根据文件大小范围进行查询，创建query对象时第一个参数是field域名，第二个参数是下边界值，第三个参数是上边界值，第四个是否包括这个最小值，第五个是否包括这个最大值。

** BooleanQuery 组合条件查询

	public void testBooleanQuery() throws Exception {
		IndexSearcher searcher = getIndexSearcher();
		BooleanQuery query = new BooleanQuery();
		Query query2 = new TermQuery(new Term("fileName","apache"));
		Query query3 = new TermQuery(new Term("fileName","lucene"));
		query.add(query2,Occur.SHOULD);
		query.add(query3,Occur.MUST);
		printResult(searcher, query);
	}

Occur.MUST：必须满足此条件，相当于and

Occur.SHOULD：应该满足，但是不满足也可以，相当于or

Occur.MUST_NOT：必须不满足。相当于not

2）、使用queryparser查询

语法域名：域值

需要加入queryParser的jar包

	public void testQueryParser() throws Exception {
		//参数1、默认查询的域
		//参数2、采用的分析器
		QueryParser queryParse = new QueryParser("fileName", new IKAnalyzer());
		//MatchAllDocsQuery底层使用的就是*:*
		Query query = queryParse.parse("*:*");
		IndexSearcher searcher = getIndexSearcher();
		printResult(searcher, query);
		//关闭资源
		searcher.getIndexReader().close();
	}

通过QueryParser也可以创建Query，QueryParser提供一个Parse方法，此方法可以直接根据查询语法来查询。Query对象执行的查询语法可通过System.out.println(query);查询。

需要使用到分析器。建议创建索引时使用的分析器和查询索引时使用的分析器要一致。

查询语法

1、基础的查询语法，关键词查询：

域名+“：”+搜索的关键字

例如：content:java

2、范围查询

域名+“:”+[最小值TO 最大值]

例如：size:[1 TO 1000]

范围查询在lucene中支持数值类型，不支持字符串类型。在solr中支持字符串类型。

3、组合条件查询

1）+条件1 +条件2：两个条件之间是并且的关系and

例如：+filename:apache +content:apache

2）+条件1 条件2：必须满足第一个条件，应该满足第二个条件

例如：+filename:apache content:apache

3）条件1 条件2：两个条件满足其一即可。

例如：filename:apache content:apache

4）-条件1条件2：必须不满足条件1，要满足条件2

例如：-filename:apache content:apache

Occur.MUST 查询条件必须满足，相当于and

+（加号）

Occur.SHOULD 查询条件可选，相当于or

空（不用符号）

Occur.MUST_NOT 查询条件不能满足，相当于not非

-（减号）

第二种写法：

条件1 AND 条件2

条件1 OR 条件2

条件1 NOT 条件2

MultiFieldQueryParser进行查询

MultiFieldQueryParser和QueryParser相比增加了默认的域是一个数组，也就是说可以有多个默认的域。（此功能比较鸡肋，可以直接使用QueryParser对象完成这样的功能）

	public void testMultiFieldQueryParser() throws Exception {
		//参数1、默认查询的域(数组)
		//参数2、采用的分析器
		String[] fields = {"fileName","fileContent"};
		MultiFieldQueryParser multiFieldQuery = new MultiFieldQueryParser(fields, new IKAnalyzer());
		Query query = multiFieldQuery.parse("java");
		IndexSearcher searcher = getIndexSearcher();
		printResult(searcher, query);
		//关闭资源
		searcher.getIndexReader().close();
	}

stop the world

发布了40 篇原创文章 · 获赞 22 · 访问量 6万+

私信关注

linux安装cuda和cudnn

模拟手机设备：使用 Playwright 实现移动端自动化测试

Mellanox网卡开启SR-IOV

全面系统的AI学习路径，帮助普通人也能玩转AI

HTML 00 Tutorial

uni-app实现上拉加载

vue3编译优化之“静态提升”

又是一个月-20240513

flask 如何保证返回json有序

linux服务器设置ssh免密

虛擬機hyper-v安裝win7操作系統網絡連接紅叉失敗

redis的quicklist結構(轉)

java使用Files和Paths獲取文件的屬性

java讀取jar文件中的資源文件

maven使用本地jar包並將jar包打包進項目

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結