Elasticsearch核心技術與實戰學習筆記 第三章 13使用分析器進行分詞

一 序

  本文屬於極客時間Elasticsearch核心技術與實戰學習筆記系列。

 二  理論

1Analysis 與 Analyzer

 你只能搜索在索引中出現的詞條,所以索引文本和查詢字符串必須標準化爲相同的格式。分詞和標準化的過程稱爲分析。

Analysis 文本分析是把全文本轉換一系列單詞(term/token)的過程,也叫分詞
,Analysis 是通過 Analyzer 來實現的。 Elasticsearch 有多種 內置的分析器,如果不滿足也可以根據自己的需求定製化分析器,除了在數據寫入時轉換詞條,匹配 Query 語句時候也需要用相同的分析器對查詢語句進行分析。

2、Analyzer 的組成

  • Character Filters (針對原始文本處理,例如,可以使用字符過濾器將印度阿拉伯數字( )轉換爲其等效的阿拉伯語-拉丁語(0123456789))
  • Tokenizer(按照規則切分爲單詞),將把文本 "Quick brown fox!" 轉換成 terms [Quick, brown, fox!],tokenizer 還記錄文本單詞位置以及偏移量。
  • Token Filter(將切分的的單詞進行加工、小寫、刪除 stopwords,增加同義詞)

3 內置的分詞器

  

 

三 操作:

Standard Analyzer

  • 默認分詞器
  • 按詞分類
  • 小寫處理
#standard
GET _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出:

{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

太長了,下面篩選結果。 

Simple Analyzer

  • 按照非字母切分,非字母則會被去除
  • 小寫處理
#simpe
GET _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出:

[the,quick,brown,foxes,jumped,over,the,lazy,dog,s,bone]

Stop Analyzer

  • 小寫處理
  • 停用詞過濾(the,a, is)
GET _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出:

[quick,brown,foxes,jumped,over,lazy,dog,s,bone]

Whitespace Analyzer

  • 按空格切分
#stop
GET _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出:

[The,2,QUICK,Brown-Foxes,jumped,over,the,lazy,dog's,bone.]

Keyword Analyzer

  • 不分詞,當成一整個 term 輸出
#keyword
GET _analyze
{
  "analyzer": "keyword",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出:

[The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.]

atter Analyzer

  • 通過正則表達式進行分詞
  • 默認是 \W+(非字母進行分隔)
GET _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出:

[the,2,quick,brown,foxes,jumped,over,the,lazy,dog,s,bone]

Language Analyzer

#english
GET _analyze
{
  "analyzer": "english",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出:

[2,quick,brown,fox,jump,over,the,lazy,dog,bone]

中文分詞要比英文分詞難,英文都以空格分隔,中文理解通常需要上下文理解纔能有正確的理解,

比如 [蘋果,不大好吃]和[蘋果,不大,好吃],這兩句意思就不一樣。

 

試驗下ik分詞器.

[root@bad4478163c8 elasticsearch]# bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
-> Downloading https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
[=================================================] 100%?? 
Exception in thread "main" java.lang.IllegalArgumentException: Plugin [analysis-ik] was built for Elasticsearch version 7.1.0 but version 7.2.0 is running
	at org.elasticsearch.plugins.PluginsService.verifyCompatibility(PluginsService.java:346)
	at org.elasticsearch.plugins.InstallPluginCommand.loadPluginInfo(InstallPluginCommand.java:718)
	at org.elasticsearch.plugins.InstallPluginCommand.installPlugin(InstallPluginCommand.java:793)
	at org.elasticsearch.plugins.InstallPluginCommand.install(InstallPluginCommand.java:776)
	at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:231)
	at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:216)
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
	at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:77)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
	at org.elasticsearch.cli.Command.main(Command.java:90)
	at org.elasticsearch.plugins.PluginCli.main(PluginCli.java:47)

一定要注意版本,否則就出現上面那種錯誤。

對於docker啓動的es來說,要先進入docker exec -it es72_01 bash

 

因爲是集羣模式,所以對別的節點也是一樣安裝。

安裝完成後重啓集羣。docker-compose restart 

訪問:

http://localhost:9200/_cat/plugins

輸出:

es72_01 analysis-ik 7.2.0
es72_02 analysis-ik 7.2.0

現在驗證下分詞器

POST _analyze
{
  "analyzer": "standard",
  "text": "他說的確實在理”"
}

輸出:

[他,說,的,確,實,在,理]

換成ik分詞器:

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "他說的確實在理”"
}

{
  "tokens" : [
    {
      "token" : "他",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "說",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "的確",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "實",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "在理",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

可見好一些,當然中文分詞器很多。老師講的是通過 ICU Analyzer 對中文分詞的效果, 可以自己選擇。

四、總結

本篇對 Analyzer 進行詳細講解,ES 內置分詞器是如何工作的。如果是中文分詞器,還有許多值得學習的地方。能支持自定義分詞,結合業務場景來落地。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章