一 序
本文屬於極客時間Elasticsearch核心技術與實戰學習筆記系列。
二 理論
1Analysis 與 Analyzer
你只能搜索在索引中出現的詞條,所以索引文本和查詢字符串必須標準化爲相同的格式。分詞和標準化的過程稱爲分析。
Analysis 文本分析是把全文本轉換一系列單詞(term/token)的過程,也叫分詞
,Analysis 是通過 Analyzer 來實現的。 Elasticsearch 有多種 內置的分析器,如果不滿足也可以根據自己的需求定製化分析器,除了在數據寫入時轉換詞條,匹配 Query 語句時候也需要用相同的分析器對查詢語句進行分析。
2、Analyzer 的組成
- Character Filters (針對原始文本處理,例如,可以使用字符過濾器將印度阿拉伯數字( )轉換爲其等效的阿拉伯語-拉丁語(0123456789))
- Tokenizer(按照規則切分爲單詞),將把文本 "Quick brown fox!" 轉換成 terms [Quick, brown, fox!],tokenizer 還記錄文本單詞位置以及偏移量。
- Token Filter(將切分的的單詞進行加工、小寫、刪除 stopwords,增加同義詞)
3 內置的分詞器
三 操作:
Standard Analyzer
- 默認分詞器
- 按詞分類
- 小寫處理
#standard
GET _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
輸出:
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "brown",
"start_offset" : 16,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "in",
"start_offset" : 48,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "the",
"start_offset" : 51,
"end_offset" : 54,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "<ALPHANUM>",
"position" : 12
}
]
}
太長了,下面篩選結果。
Simple Analyzer
- 按照非字母切分,非字母則會被去除
- 小寫處理
#simpe
GET _analyze
{
"analyzer": "simple",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
輸出:
[the,quick,brown,foxes,jumped,over,the,lazy,dog,s,bone]
Stop Analyzer
- 小寫處理
- 停用詞過濾(the,a, is)
GET _analyze
{
"analyzer": "stop",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
輸出:
[quick,brown,foxes,jumped,over,lazy,dog,s,bone]
Whitespace Analyzer
- 按空格切分
#stop
GET _analyze
{
"analyzer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
輸出:
[The,2,QUICK,Brown-Foxes,jumped,over,the,lazy,dog's,bone.]
Keyword Analyzer
- 不分詞,當成一整個 term 輸出
#keyword
GET _analyze
{
"analyzer": "keyword",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
輸出:
[The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.]
atter Analyzer
- 通過正則表達式進行分詞
- 默認是 \W+(非字母進行分隔)
GET _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
輸出:
[the,2,quick,brown,foxes,jumped,over,the,lazy,dog,s,bone]
Language Analyzer
#english
GET _analyze
{
"analyzer": "english",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
輸出:
[2,quick,brown,fox,jump,over,the,lazy,dog,bone]
中文分詞要比英文分詞難,英文都以空格分隔,中文理解通常需要上下文理解纔能有正確的理解,
比如 [蘋果,不大好吃]和[蘋果,不大,好吃],這兩句意思就不一樣。
試驗下ik分詞器.
[root@bad4478163c8 elasticsearch]# bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
-> Downloading https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
[=================================================] 100%??
Exception in thread "main" java.lang.IllegalArgumentException: Plugin [analysis-ik] was built for Elasticsearch version 7.1.0 but version 7.2.0 is running
at org.elasticsearch.plugins.PluginsService.verifyCompatibility(PluginsService.java:346)
at org.elasticsearch.plugins.InstallPluginCommand.loadPluginInfo(InstallPluginCommand.java:718)
at org.elasticsearch.plugins.InstallPluginCommand.installPlugin(InstallPluginCommand.java:793)
at org.elasticsearch.plugins.InstallPluginCommand.install(InstallPluginCommand.java:776)
at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:231)
at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:216)
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:77)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
at org.elasticsearch.cli.Command.main(Command.java:90)
at org.elasticsearch.plugins.PluginCli.main(PluginCli.java:47)
一定要注意版本,否則就出現上面那種錯誤。
對於docker啓動的es來說,要先進入docker exec -it es72_01 bash
因爲是集羣模式,所以對別的節點也是一樣安裝。
安裝完成後重啓集羣。docker-compose restart
訪問:
http://localhost:9200/_cat/plugins
輸出:
es72_01 analysis-ik 7.2.0 es72_02 analysis-ik 7.2.0
現在驗證下分詞器
POST _analyze
{
"analyzer": "standard",
"text": "他說的確實在理”"
}
輸出:
[他,說,的,確,實,在,理]
換成ik分詞器:
POST _analyze
{
"analyzer": "ik_smart",
"text": "他說的確實在理”"
}
{
"tokens" : [
{
"token" : "他",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "說",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "的確",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "實",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "在理",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 4
}
]
}
可見好一些,當然中文分詞器很多。老師講的是通過 ICU Analyzer 對中文分詞的效果, 可以自己選擇。
四、總結
本篇對 Analyzer 進行詳細講解,ES 內置分詞器是如何工作的。如果是中文分詞器,還有許多值得學習的地方。能支持自定義分詞,結合業務場景來落地。