Elasticsearch核心技術與實戰學習筆記第三章 13使用分析器進行分詞

一序

本文屬於極客時間Elasticsearch核心技術與實戰學習筆記系列。

二理論

1Analysis 與 Analyzer

你只能搜索在索引中出現的詞條，所以索引文本和查詢字符串必須標準化爲相同的格式。分詞和標準化的過程稱爲分析。

Analysis 文本分析是把全文本轉換一系列單詞（term/token)的過程，也叫分詞
，Analysis 是通過 Analyzer 來實現的。 Elasticsearch 有多種內置的分析器，如果不滿足也可以根據自己的需求定製化分析器，除了在數據寫入時轉換詞條，匹配 Query 語句時候也需要用相同的分析器對查詢語句進行分析。

2、Analyzer 的組成

Character Filters (針對原始文本處理，例如，可以使用字符過濾器將印度阿拉伯數字（）轉換爲其等效的阿拉伯語-拉丁語（0123456789）)
Tokenizer（按照規則切分爲單詞）,將把文本 "Quick brown fox!" 轉換成 terms [Quick, brown, fox!],tokenizer 還記錄文本單詞位置以及偏移量。
Token Filter(將切分的的單詞進行加工、小寫、刪除 stopwords，增加同義詞）

3 內置的分詞器

三操作：

Standard Analyzer

默認分詞器
按詞分類
小寫處理

#standard
GET _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出：

{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

太長了，下面篩選結果。

Simple Analyzer

按照非字母切分，非字母則會被去除
小寫處理

#simpe
GET _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出：

[the,quick,brown,foxes,jumped,over,the,lazy,dog,s,bone]

Stop Analyzer

小寫處理
停用詞過濾（the，a, is)

GET _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出：

[quick,brown,foxes,jumped,over,lazy,dog,s,bone]

Whitespace Analyzer

按空格切分

#stop
GET _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出：

[The,2,QUICK,Brown-Foxes,jumped,over,the,lazy,dog's,bone.]

Keyword Analyzer

不分詞，當成一整個 term 輸出

#keyword
GET _analyze
{
  "analyzer": "keyword",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出：

[The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.]

atter Analyzer

通過正則表達式進行分詞
默認是 \W+(非字母進行分隔)

GET _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出：

[the,2,quick,brown,foxes,jumped,over,the,lazy,dog,s,bone]

Language Analyzer

#english
GET _analyze
{
  "analyzer": "english",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出：

[2,quick,brown,fox,jump,over,the,lazy,dog,bone]

中文分詞要比英文分詞難，英文都以空格分隔，中文理解通常需要上下文理解纔能有正確的理解，

比如 [蘋果，不大好吃]和[蘋果，不大，好吃]，這兩句意思就不一樣。

試驗下ik分詞器.

[root@bad4478163c8 elasticsearch]# bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
-> Downloading https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
[=================================================] 100%?? 
Exception in thread "main" java.lang.IllegalArgumentException: Plugin [analysis-ik] was built for Elasticsearch version 7.1.0 but version 7.2.0 is running
	at org.elasticsearch.plugins.PluginsService.verifyCompatibility(PluginsService.java:346)
	at org.elasticsearch.plugins.InstallPluginCommand.loadPluginInfo(InstallPluginCommand.java:718)
	at org.elasticsearch.plugins.InstallPluginCommand.installPlugin(InstallPluginCommand.java:793)
	at org.elasticsearch.plugins.InstallPluginCommand.install(InstallPluginCommand.java:776)
	at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:231)
	at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:216)
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
	at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:77)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
	at org.elasticsearch.cli.Command.main(Command.java:90)
	at org.elasticsearch.plugins.PluginCli.main(PluginCli.java:47)

一定要注意版本，否則就出現上面那種錯誤。

對於docker啓動的es來說,要先進入docker exec -it es72_01 bash

因爲是集羣模式，所以對別的節點也是一樣安裝。

安裝完成後重啓集羣。docker-compose restart

訪問：

http://localhost:9200/_cat/plugins

輸出：

es72_01 analysis-ik 7.2.0
es72_02 analysis-ik 7.2.0

現在驗證下分詞器

POST _analyze
{
  "analyzer": "standard",
  "text": "他說的確實在理”"
}

輸出：

[他，說，的，確，實，在，理]

換成ik分詞器：

POST _analyze
{
"analyzer": "ik_smart",
"text": "他說的確實在理”"
}

{
  "tokens" : [
    {
      "token" : "他",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "說",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "的確",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "實",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "在理",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

可見好一些，當然中文分詞器很多。老師講的是通過 ICU Analyzer 對中文分詞的效果，可以自己選擇。

四、總結

本篇對 Analyzer 進行詳細講解，ES 內置分詞器是如何工作的。如果是中文分詞器，還有許多值得學習的地方。能支持自定義分詞，結合業務場景來落地。

Elasticsearch核心技術與實戰學習筆記第三章 13使用分析器進行分詞

一序

二理論

1Analysis 與 Analyzer

2、Analyzer 的組成

3 內置的分詞器

三操作：

四、總結

Elasticsearch核心技術與實戰學習筆記 49 | 對象及Nested對象

Elasticsearch核心技術與實戰學習筆記 29 | 單字符串多字段查詢：Multi Match

Elasticsearch核心技術與實戰學習筆記 36 | 配置跨集羣搜索

jackson 解析json報錯：Cannot deserialize instance of `java.lang.String` out of START_OBJECT token

Elasticsearch核心技術與實戰學習筆記 34 | Term&Phrase Suggester

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Elasticsearch核心技術與實戰學習筆記 第三章 13使用分析器進行分詞

一 序

二 理論

1Analysis 與 Analyzer

2、Analyzer 的組成

3 內置的分詞器

三 操作：

四、總結

Elasticsearch核心技術與實戰學習筆記第三章 13使用分析器進行分詞

一序

二理論

三操作：