solr-8.0.0 學solr, 看我就夠了單機\集羣\中文分詞\spring data solr

歡迎轉載，轉載請註明網址:https://blog.csdn.net/qq_41910280

簡介：solr-8.0.0入門教程, 包括solr cloud的安裝部署以及使用、中文分詞器配置、基於spring boot的spring data solr的使用。

文章目錄

solr-8.0.0 學solr, 看我就夠了單機\集羣\中文分詞\spring data solr

1. 環境

　　OS： Windows 10
　　solr: 8.0.0
　　IK-Analyzer: 7.7.1(原2012基礎上增加動態詞典功能, 支持Lucene7.7.1)
　　spring data solr：4.0.5
　　spring boot: 2.1.3.RELEASE

2. solr cloud使用

solr cloud部署與啓動
　　solr解壓後, 進入bin目錄, 啓動命令solr.cmd start -e cloud

當看到如下信息表示solr實例啓動成功, 不過solr cloud還沒搭完

接下來會打印出集羣的配置, 包括爲我們啓動zookeeper(端口9983)並註冊節點, 以及分片和備份的選項, 如果你學過kafka, 那就so easy了.
這裏我們唯一沒有使用默認值的就是configuration, 我們在這裏選擇sample_techproducts_configs

接下來訪問http://localhost:8983/solr/#/吧
還是熟悉的頁面…不截圖了
如果你訪問http://localhost:8983/solr/#/~cloud?view=graph 你將看到

綠色代表active, N代表NRT

Solr cloud提供了三種分片模式,分別爲：

NRT: This is the default. A NRT replica (NRT = NearRealTime) maintains a transaction log and writes new documents to it’s indexes locally. Any replica of this type is eligible to become a leader. Traditionally, this was the only type supported by Solr.

TLOG: This type of replica maintains a transaction log but does not index document changes locally. This type helps speed up indexing since no commits need to occur in the replicas. When this type of replica needs to update its index, it does so by replicating the index from the leader. This type of replica is also eligible to become a shard leader; it would do so by first processing its transaction log. If it does become a leader, it will behave the same as if it was a NRT type of replica.

PULL: This type of replica does not maintain a transaction log nor index document changes locally. It only replicates the index from the shard leader. It is not eligible to become a shard leader and doesn’t participate in shard leader election at all.

這三種模式的主要分別是，NRT可以做主片，可以使用近實時索引（支持SOFT COMMIT），同步索引靠數據轉發；TLOG也可以做主片，當爲主片是和NRT一致，不能近實時索引，從片需要和主片同步的時候，只是從從片同步索引文件；PULL不能做主片，僅從主片同步索引文件。
創建分片的時候，副本默認使用的NRT模式。

Solr cloud可推薦使用的分片組合方式：

1、全部NRT：適用於小到中級的集羣；更新吞吐量不太高的大型集羣；

2、全部TLOG：不需要實時索引；每一個分片的副本數較多；同時需要所有分片都能切換爲主片；

3、TLOG+PULL：不需要實時索引；每一個分片的副本數較多；提高查詢能力，能夠容忍短時的過期數據。

創建並導入數據建立索引
Linux下執行bin/post -c gettingstarted example/exampledocs/*
Windows下執行java -jar -Dc=gettingstarted -Dauto example\exampledocs\post.jar example\exampledocs*

我們的solr有數據了!
你可以訪問http://localhost:8983/solr/#/gettingstarted/query 進入UI查詢, 點擊Execute Query按鈕, 結果如下

共有52條記錄( “numFound”:52 )
或者你可以直接訪問http://localhost:8983/solr/gettingstarted/select?q=*%3A* (%3A是”:”的url編碼), 上圖標紅的地方分別是查詢條件和分頁條件, 例如我們將q改爲foundation, 結果如下

如果你在fl中輸入id, 結果將只會保留id字段
如果你將q定義爲name:*, 你就看到31條結果, 過濾掉了21 (52-31)條記錄
如果你將q定義爲name:game, 結果如下(可以看出是不區分大小寫的)

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":11,
    "params":{
      "q":"name:Game",
      "_":"1553663260779"}},
  "response":{"numFound":2,"start":0,"maxScore":1.1696943,"docs":[
      {
        "id":"0812550706",
        "cat":["book"],
        "name":"Ender's Game",
        "price":6.99,
        "price_c":"6.99,USD",
        "inStock":true,
        "author":"Orson Scott Card",
        "author_s":"Orson Scott Card",
        "series_t":"Ender",
        "sequence_i":1,
        "genre_s":"scifi",
        "_version_":1629127895195582464,
        "price_c____l_ns":699},
      {
        "id":"0553573403",
        "cat":["book"],
        "name":"A Game of Thrones",
        "price":7.99,
        "price_c":"7.99,USD",
        "inStock":true,
        "author":"George R.R. Martin",
        "author_s":"George R.R. Martin",
        "series_t":"A Song of Ice and Fire",
        "sequence_i":1,
        "genre_s":"fantasy",
        "_version_":1629127895133716480,
        "price_c____l_ns":799}]
  }}

你可以使用空格來搜索多個短語, 你可以試試name:"Samsung+Game"和name:"Samsung"的區別
你還可以試試+electronics +music和electronics+music的區別, 注意沒有指定name, 這裏+表示必須包含 -可以表示必須不包含
更多內容請參見https://lucene.apache.org/solr/guide/7_7/query-syntax-and-parsing.html#query-syntax-and-parsing

其他重要命令, 後面我們會用到
刪除某集合：
bin/solr delete -c yourCollection
創建一個新的集合：
bin/solr create -c -s 2 -rf 2
要停止我們啓動的兩個Solr節點，請發出以下命令：
bin/solr stop -all
啓動並連接到ZooKeeper
solr start -c -p 8983 -s …/example/cloud/node1/solr
solr start -c -p 7574 -s …/example/cloud/node2/solr -z localhost:9983
刪除特定文檔
bin/post -c yourCollection -d “123”
“delete-by-query”命令
bin/post -c localDocs -d “:”

接下來我們來學習修改scheme吧!
(solr有2個重要配置: schema.xml和solrconfig.xml)
使用如下命令創建一個新的collection並指定分區數和副本數, 因爲沒有執行configset 所以會採用默認的_deafult
bin/solr create -c films -s 2 -rf 2
打開example\films\films,json, 你可以看到第一個films的name是” .45”, solr自動猜測類型的功能可能會將name屬性當做浮點數, 之後的插入的數據就會出現異常, 如果不希望這種事情發生, 我們最好手動指定類型(你也可以通過配置文件或者cmd命令去配置, 也可以通過java客戶端配置, 部分後面會講, 配置文件可以去了解一下), 如下

創建複製字段如下: 該字段將從所有字段中獲取所有數據並將其索引到一個名爲的字段中_text_
在我們查詢之前的documents時，我們不必指定要搜索的字段，因爲我們使用的配置sample_techproducts_configs設置爲將字段複製到text字段中，並且當沒有其他字段時，該字段是默認字段在查詢中定義

我們開始導入數據吧
java -jar -Dc=films -Dauto example\exampledocs\post.jar example\films*.json
哇哦, 1100條數據
將q輸入comedy, 你將查詢到417條記錄

Facet功能

勾選facet, field改爲genre_str, 執行查詢, 你將看到每個genre使用的數量, 下面是結果的一部分

"genre_str":[
        "Drama",552,
        "Comedy",389,
        "Romance Film",270,
        "Thriller",259,
        "Action Film",196,
        "Crime Fiction",170,
        "World cinema",167,
        "Indie film",144,
        "Horror",120,

這是一個重要的功能, 劃分facet除了離散之外我們還可以改爲範圍(UI不支持, 可以使用curl), 這更常用, 後面有java實現
你可以在https://lucene.apache.org/solr/guide/7_7/faceting.html#faceting 學習更多facet的用法

…

3. 中文分詞器

中文分詞器下載地址
配置步驟:
1.將jar包放入Solr服務的Jetty或Tomcat的webapp/WEB-INF/lib/目錄下；
2.將resources目錄下的5個配置文件放入solr服務的Jetty或Tomcat的webapp/WEB-INF/classes/目錄下；
① IKAnalyzer.cfg.xml
② ext.dic
③ stopword.dic
④ ik.conf
⑤ dynamicdic.txt
3.配置Solr的managed-schema，添加ik分詞器，示例如下；

<!-- ik分詞器 -->
<fieldType name="text_ik" class="solr.TextField">
  <analyzer type="index">
      <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false" conf="ik.conf"/>
      <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
      <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true" conf="ik.conf"/>
      <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

那麼我修改之前sample_techproducts_configs中的managed-schema, 然後刪掉之前的gettingstarted, 重新創建並重新導入數據吧!
當然可以這麼做, 但是也許我們應該有更聰明的做法, 下面我介紹直接不需要刪除甚至不需要重啓就可以更新scheme的方法

先了解一下windows上solr自帶的zookeeper怎麼用吧, 給個例子
執行solr zk -z localhost:9983 ls /configs

(我們也可以通過server\scripts\cloud-scripts\zkcli.bat來操作zookeeper)
上傳配置的命令如下:

D:\develop\solr\solr-8.0.0\bin>solr zk upconfig -n gettingstarted -d D:\develop\solr\solr-8.0.0\server\solr\configsets\sample_techproducts_configs -z localhost:9983
INFO  - 2019-03-27 17:51:06.912; org.apache.solr.util.configuration.SSLCredentialProviderFactory; Processing SSL Credential Provider chain: env;sysprop
Uploading D:\develop\solr\solr-8.0.0\server\solr\configsets\sample_techproducts_configs\conf for config gettingstarted to ZooKeeper at localhost:9983

我們成功了, 沒有reload, 更沒有restart !!! (新增servlet容器的jar和配置文件還是要重啓)

4. spring data solr

solrJ教程 https://lucene.apache.org/solr/guide/7_7/using-solrj.html#using-solrj
啓動 solr start, 添加core如下

添加失敗….好吧, 我在調試許久之後發現只能用命令的方式創建
solr create -c mycore
之後在server\solr\mycore\conf\managed-schema中添加(裏面的幾個字段我們稍後會用到)

<!-- 測試spring data solr -->
	<field name="item_name" type="text_ik" indexed="true" stored="true"/>
	<field name="item_price" type="pdouble" indexed="true" stored="true"/>
	<field name="item_category" type="string" indexed="true" stored="true" />
	<field name="item_brand" type="string" indexed="true" stored="true" />
	
	<field name="item_keywords" type="text_ik" indexed="true" stored="false" multiValued="true"/>
	<copyField source="item_category" dest="item_keywords"/>
	<copyField source="item_brand" dest="item_keywords"/>
	<copyField source="item_name" dest="item_keywords"/>
	
	<dynamicField name="item_spec_*" type="string" indexed="true" stored="true" />	
	
	<!-- IK分詞器 -->
	 <fieldType name="text_ik" class="solr.TextField">
	  <analyzer type="index">
		  <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false" conf="ik.conf"/>
		  <filter class="solr.LowerCaseFilterFactory"/>
	  </analyzer>
	  <analyzer type="query">
		  <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true" conf="ik.conf"/>
		  <filter class="solr.LowerCaseFilterFactory"/>
	  </analyzer>
	</fieldType>

之後我們試試這個core裏有沒有我們添加的text_ik吧

對比一下我試的另一個ik吧, 顯然我們選用的ik分詞效果更好, 爲我們的明智歡呼吧 !

創建項目, 版本開篇說過我就不墨跡了
Domain類
其中specMap上的兩個註解表明這個字段是動態域(又叫動態字段)
另外, 非常重要的一點除了solr scheme中的field配置外, 通過domain字段上的@Index可以重定義字段屬性

    package com.example.domain;
    
    import lombok.AllArgsConstructor;
    import lombok.Data;
    import lombok.NoArgsConstructor;
    import org.apache.solr.client.solrj.beans.Field;
    import org.springframework.data.solr.core.mapping.Dynamic;
    import org.springframework.data.solr.core.mapping.SolrDocument;
    
    import java.io.Serializable;
    import java.util.Map;
    
    @AllArgsConstructor
    @Data
    @NoArgsConstructor
    public class Item implements Serializable {
        @Field
        private long id;
    
        @Field("item_name")
        private String name;
    
        @Field("item_price")
        private double price;
    
        @Field("item_category")
        private String category;
    
        @Field("item_brand")
        private String brand;
    
        @Dynamic
        @Field("item_spec_*")
        private Map<String, String> specMap;
    
    }

application.yml配置

spring:
  data:
    solr:
      host: http://localhost:8983/solr

刪除全部數據的方法, 方便測試, 我們可能會用到

<delete><query>*:*</query></delete>
<commit/>

代碼
這是solrTemplate(solrclient用法類似)的部分使用方法, 直接看代碼和註釋就行了
(CRUD+FACET+GROUP+highlight 高亮有問題希望知道的告知一下, 交流纔是王道, 不要把自己的路走窄了)

    @RunWith(SpringRunner.class)
    @SpringBootTest
    public class SolrStartApplicationTests {
        @Autowired
        private SolrTemplate solrTemplate;
    
        @Test
        public void contextLoads() {
            Item item = new Item();
            item.setName("小米9 故宮特別版");
            item.setBrand("小米");
            item.setPrice(9973.0 / 3);
            item.setCategory("手機");
            item.setId(UUID.randomUUID().getLeastSignificantBits());
            Map<String, String> map = new HashMap<>();
            map.put("屏幕尺寸", "5寸,6寸");
            map.put("RAM大小", "8G,12G");
            map.put("ROM大小", "128G,256G");
            item.setSpecMap(map);
    
            Item item2 = new Item();
            item2.setName("華爲P30 pro 保時捷版");
            item2.setBrand("華而不實, 爲所欲爲");
            item2.setPrice(9973.0);
            item2.setCategory("手機");
            item2.setId(UUID.randomUUID().getLeastSignificantBits());
            map.put("屏幕尺寸", "6寸,7寸");
            map.put("RAM大小", "6G,8G");
            map.put("ROM大小", "64G,128G");
            item2.setSpecMap(map);
    
            solrTemplate.saveBeans("mycore", Arrays.asList(item, item2));
            solrTemplate.commit("mycore");
        }
    
        @Test
        public void testCRUD() {
            // 查詢
            Optional<Item> itemOptional = solrTemplate.getById("mycore", "-5362329406904687672", Item.class);
            Item item = itemOptional.get();
            System.out.println(item);
            // 修改
            item.setPrice(998);
            solrTemplate.saveBean("mycore", item);
            solrTemplate.commit("mycore");
            // 部分修改
            PartialUpdate update = new PartialUpdate("id", "-5329718594856453427");
            update.add("item_category", "除了手機外, 小米還有暖寶寶的功能哦 ~ ");
            solrTemplate.saveBean("mycore", update);
            solrTemplate.commit("mycore");
            // 條件查詢或者刪除
            Query query = new SimpleQuery();
            Criteria criteria = new Criteria("item_brand").contains("小米");
            criteria.and("item_name").contains("小米");
            query.addCriteria(criteria);
            Page<Item> items = solrTemplate.query("mycore", query, Item.class);
            for (Item item1 : items) {
                System.out.println(item1);
            }
            System.out.println("--------------");
            // 還記得之前的複製字段嗎?
            Query query2 = new SimpleQuery();
            Criteria criteria2 = new Criteria("item_keywords").contains("手機");
            query2.addCriteria(criteria2);
            Page<Item> items2 = solrTemplate.query("mycore", query2, Item.class);
            for (Item item2 : items2) {
                System.out.println(item2);
            }
            // 刪除
            solrTemplate.delete("mycore", query, Item.class);
            solrTemplate.commit("mycore");
        }
    
    
    @Test
        public void testFacetAndGroupAndHighLight() {
            // facet
            // 離散
    //        FacetQuery query = new SimpleFacetQuery(new Criteria(Criteria.WILDCARD).expression(Criteria.WILDCARD))
    //                .setFacetOptions(new FacetOptions().addFacetOnField("item_category").setFacetLimit(5));
    //        FacetPage<Item> page = solrTemplate.queryForFacetPage("mycore", query, Item.class);
    //        for (Page<FacetFieldEntry> facetResultPage : page.getFacetResultPages()) {
    //            facetResultPage.forEach(facetFieldEntry -> System.out.println(facetFieldEntry.getValue()
    //                    + " : " + facetFieldEntry.getValueCount()));
    //        }
            // 範圍
    //        FacetQuery query = new SimpleFacetQuery(new SimpleStringCriteria("*:*"))
    //                .setFacetOptions(new FacetOptions().addFacetByRange(
    //                        new FacetOptions.FieldWithNumericRangeParameters("item_price", 0, 10000,
    //                                1000) .setHardEnd(true)
    //                                .setInclude(FacetParams.FacetRangeInclude.ALL)).setFacetMinCount(1).setFacetLimit(10));
    //        FacetPage<Item> page = solrTemplate.queryForFacetPage("mycore", query, Item.class);
    //        for (FacetFieldEntry facetFieldEntry : page.getRangeFacetResultPage("item_price")) {
    //            System.out.println(facetFieldEntry.getValue() + " : " + facetFieldEntry.getValueCount());
    //        }
    
            // group(group拿到的查詢結果是分組的, 而facet拿到的分組信息和結果是分開的)
            /*
            結果分組有如下3類 可以同時使用多種分類方式 這裏僅僅使用其中一種
            Field field = new SimpleField("popularity");
            Function func = ExistsFunction.exists("description");
            Query query = new SimpleQuery("inStock:true");
             */
    //        Field field = new SimpleField("item_category");
    //        SimpleQuery query = new SimpleQuery(new SimpleStringCriteria("*:*"));
    //        GroupOptions groupOptions =
    //                new GroupOptions().addGroupByField(field).setOffset(0).setLimit(10);// 每組只查詢前10個
    //        query.setGroupOptions(groupOptions);
    //
    //        GroupPage<Item> page = solrTemplate.queryForGroupPage("mycore", query, Item.class);
    //        GroupResult<Item> fieldGroup = page.getGroupResult(field);
    //        Page<GroupEntry<Item>> groupEntries = fieldGroup.getGroupEntries();
    //        for (GroupEntry<Item> groupEntry : groupEntries) {
    //            System.out.println(groupEntry.getGroupValue() + "-----------------");
    //            groupEntry.getResult().forEach(System.out::println);
    //            System.out.println("----------------------------------");
    //        }
    
            // highlight 關鍵字高亮顯示 準確來說叫突出顯示
            String keyword = "手機真好";
    //        SimpleHighlightQuery query = new SimpleHighlightQuery(new SimpleStringCriteria(
    //                "item_keywords:" + keyword));
            HighlightQuery query = new SimpleHighlightQuery();
            Criteria criteria=new Criteria("item_keywords").is(keyword);
            query.addCriteria(criteria);
    
            HighlightOptions highlightOptions =
                    new HighlightOptions().addHighlightParameter("hl.highlightMultiTerm ","true").addField("item_category")/*
                    .addFields(Arrays.asList
                    ("item_category","item_name","item_brand"))*/.setSimplePrefix(
                            "<b>").setSimplePostfix("</b>");
            // 可以用addHighlightParameter()選擇要高亮顯示的字段
            query.setHighlightOptions(highlightOptions);
            HighlightPage<Item> page = solrTemplate.queryForHighlightPage("mycore", query,
                    Item.class);
            for (HighlightEntry<Item> h : page.getHighlighted()) {
                Item item = h.getEntity();
                List<HighlightEntry.Highlight> highlights = h.getHighlights();
                for (HighlightEntry.Highlight highlight : highlights) {
                    System.out.print(highlight.getField() + " === ");
                    highlight.getSnipplets().forEach(s -> System.out.print(s + " | "));
                }
                System.out.println(item);
            }
    
        }

有沒有發現明明我們都已經註解了字段名還要手動去寫(複製字段動態字段除外), 接下來介紹 solr repository, 可以簡化操作, 學過JPA的應該比較好理解

先寫一個接口繼承SolrCrudRepository


public abstract interface ItemRepository extends SolrCrudRepository<Item, Long> {
}

啓動類上加上@EnableSolrRepositories(“com.example.dao”) (可省略, sb框架啓動時會自動嘗試幾十個環境的配置)
實體類Item上加上@SolrDocument(collection = “mycore”)

測試代碼
在接口中新增方法:

    @Query("item_keywords:*?0* AND item_price:[?1 TO ?2]")
    public List<Item> findByQueryAnnotation(String keyword, double bottomPrice, double topPrice,
                                            PageRequest pageRequest);
    
    @Highlight(prefix = "<b>", postfix = "</b>",fields = {"item_category","item_brand","item_name"})
    @Query("item_keywords:?0")
    public HighlightPage<Item> findByItemKeywods(String keyword, Pageable pageable);

@Test
public void testRepository() {
    // 分頁查詢
    Page<Item> all = repository.findAll(PageRequest.of(0, 20));
    for (Item item : all) {
        System.out.println(item);
    }
    // 排序 可以指定多個屬性, 也可以and多個sort
    Sort sort = new Sort(Sort.Direction.ASC, "price");// solr repository使用的是java字段名
    Iterable<Item> sortAll = repository.findAll(sort);
    for (Item item : sortAll) {
        System.out.println(item);
    }
    // 關鍵字匹配 加 範圍篩選 加 分頁 加 排序 (不加排序會按query匹配得分排序)
    List<Item> querys = repository.findByQueryAnnotation("水果", 0, 1.0 / 0,// 1.0/0表示無限大
            PageRequest.of(1, 10, sort));
    for (Item aQuery : querys) {
        System.out.println(aQuery);
    }
    // 高亮查詢
    HighlightPage<Item> items = repository.findByItemKeywods("手機", PageRequest.of(0, 10));

    // 增刪改: 略

}

想了解Spring data solr更多內容請閱讀 https://docs.spring.io/spring-data/solr/docs/4.0.5.RELEASE/reference/html/#solr.misc

5. 遺留問題

動態字段插入中文會變成下劃線
似乎不再支持BigDecimal (以前使用solr4的時候可以直接轉double)
highlight的問題

希望有知道的大佬不吝賜教。

神奇的小尾巴：
本人郵箱：[email protected] [email protected]
[email protected]　歡迎交流，共同進步。
歡迎轉載，轉載請註明本網址。

solr-8.0.0 學solr, 看我就夠了單機\集羣\中文分詞\spring data solr