我們常常會遇到問題,爲什麼指定的文檔沒有被搜索到。很多情況下, 這都歸因於映射的定義和分析例程配置存在問題。
一:分詞流程
整個流程大概是:單詞 ====》Character Filter 預處理 =====》tokenizer分詞 ====》 token filter對分詞進行再處理。
- 單詞或文檔先經過Character Filters;Character Filters的作用就是對文本進行一個預處理,例如把文本中所有“&”換成“and”,把“?”去掉等等操作。
- 之後就進入了十分重要的tokenizers模塊了,Tokenizers的作用是進行分詞,例如,“tom is a good doctor .”。經過Character Filters去掉句號“.”(假設)後,分詞器Tokenizers會將這個文本分出很多詞來:“tom”、“is”、“a”、“good”、“doctor”。
- 經過分詞之後的集合,最後會進入Token Filter詞單元模塊進行處理,此模塊的作用是對已經分詞後的集合(tokens)單元再進行操作,例如把“tom”再次拆分成“t”、“o”、“m”等操作。最後得出來的結果集合,就是最終的集合
二:Custom Analyzer 自定義分詞器
簡而言之,是自定義的analyzer。允許多個零到多個tokenizer,零到多個 Char
Filters
. custom analyzer 的名字不能以 "_"開頭.
The following are settings that can be set for a custom
analyzer
type:
Setting | Description |
---|---|
|
通用的或者註冊的tokenizer. |
|
通用的或者註冊的 token filters. |
|
通用的或者註冊的 character filters. |
|
距離查詢時,最大允許查詢的距離,默認是100 |
settings:
index :
analysis :
analyzer :
myAnalyzer2 :
type : custom
tokenizer : myTokenizer1
filter : [myTokenFilter1, myTokenFilter2]
char_filter : [my_html]
position_increment_gap: 256
tokenizer :
myTokenizer1 :
type : standard
max_token_length : 900
filter :
myTokenFilter1 :
type : stop
stopwords : [stop1, stop2, stop3, stop4]
myTokenFilter2 :
type : length
min : 0
max : 2000
char_filter :
my_html :
type : html_strip
escaped_tags : [xxx, yyy]
read_ahead : 1024
三:ES內置的一些analyzer
| analyzer | logical name | description |
| ----------------------|:-------------:| :-----------------------------------------|
| standard analyzer | standard | standard tokenizer, standard filter, lower case filter, stop filter |
| simple analyzer | simple | lower case tokenizer |
| stop analyzer | stop | lower case tokenizer, stop filter |
| keyword analyzer | keyword | 不分詞,內容整體作爲一個token(not_analyzed) |
| pattern analyzer | whitespace | 正則表達式分詞,默認匹配\W+ |
| language analyzers | [lang](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html) | 各種語言 |
| snowball analyzer | snowball | standard tokenizer, standard filter, lower case filter, stop filter, snowball filter |
| custom analyzer | custom | 一個Tokenizer, 零個或多個Token Filter, 零個或多個Char Filter |
tokenizer:ES內置的tokenizer列表。
| tokenizer | logical name | description |
| ----------------------|:-------------:| :-------------------------------------|
| standard tokenizer | standard | |
| edge ngram tokenizer | edgeNGram | |
| keyword tokenizer | keyword | 不分詞 |
| letter analyzer | letter | 按單詞分 |
| lowercase analyzer | lowercase | letter tokenizer, lower case filter |
| ngram analyzers | nGram | |
| whitespace analyzer | whitespace | 以空格爲分隔符拆分 |
| pattern analyzer | pattern | 定義分隔符的正則表達式 |
| uax email url analyzer| uax_url_email | 不拆分url和email |
| path hierarchy analyzer| path_hierarchy| 處理類似`/path/to/somthing`樣式的字符串|
token filter:ES內置的token filter列表。
| token filter | logical name | description |
| ----------------------|:-------------:| :-------------------------------------|
| standard filter | standard | |
| ascii folding filter | asciifolding | |
| length filter | length | 去掉太長或者太短的 |
| lowercase filter | lowercase | 轉成小寫 |
| ngram filter | nGram | |
| edge ngram filter | edgeNGram | |
| porter stem filter | porterStem | 波特詞幹算法 |
| shingle filter | shingle | 定義分隔符的正則表達式 |
| stop filter | stop | 移除 stop words |
| word delimiter filter | word_delimiter| 將一個單詞再拆成子分詞 |
| stemmer token filter | stemmer | |
| stemmer override filter| stemmer_override| |
| keyword marker filter | keyword_marker| |
| keyword repeat filter | keyword_repeat| |
| kstem filter | kstem | |
| snowball filter | snowball | |
| phonetic filter | phonetic | [插件](https://github.com/elasticsearch/elasticsearch-analysis-phonetic) |
| synonym filter | synonyms | 處理同義詞 |
| compound word filter | dictionary_decompounder, hyphenation_decompounder | 分解複合詞 |
| reverse filter | reverse | 反轉字符串 |
| elision filter | elision | 去掉縮略語 |
| truncate filter | truncate | 截斷字符串 |
| unique filter | unique | |
| pattern capture filter| pattern_capture| |
| pattern replace filte | pattern_replace| 用正則表達式替換 |
| trim filter | trim | 去掉空格 |
| limit token count filter| limit | 限制token數量 |
| hunspell filter | hunspell | 拼寫檢查 |
| common grams filter | common_grams | |
| normalization filter | arabic_normalization, persian_normalization | |
character filter:ES內置的character filter列表
| character filter | logical name | description |
| --------------------------|:-------------:| :-------------------------|
| mapping char filter | mapping | 根據配置的映射關係替換字符 |
| html strip char filter | html_strip | 去掉HTML元素 |
| pattern replace char filter| pattern_replace| 用正則表達式處理字符串 |