ES默認提供了八種內置的analyzer,針對不同的場景可以使用不同的analyzer;
1、fingerprint analyzer
1.1、fingerprint類型及分詞效果
fingerprint analyzer實現了fingerprinting算法(OpenRefine項目中使用);使用該analyzer場景下文本會被轉爲小寫格式,經過規範化(normalize)處理之後移除擴展字符,然後再經過排序,刪除重複數據組合爲單個token;如果配置了停用詞則停用詞也將會被移除
//測試fingerprint analyzer默認分詞效果
//請求參數
POST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes,is this déjàvu?"
}
//分詞結果
{
"tokens" : [
{
"token" : "dejavu is this yes",
"start_offset" : 0,
"end_offset" : 23,
"type" : "fingerprint",
"position" : 0
}
]
}
以上句子通過分詞之後得到的詞(term)爲:
[dejavu is this yes]
1.2、fingerprint類型可配置參數
序號 | 參數 | 參數說明 |
---|---|---|
1 | separator | 連接多個詞(term)的字符,默認爲空格 |
2 | max_output_size | token允許的最大值,超過該值將直接被丟棄,默認值爲255 |
3 | stopwords | 預定義的停用詞,可以爲0個或多個,例如_english_或數組類型值,默認值爲_none_ |
4 | stopwords_path | 停用詞文件路徑 |
//自定義fingerprint analyzer並指定停用詞
PUT custom_fingerprint_stop_index
{
"settings": {
"analysis": {
"analyzer": {
"fingerprint_analyzer":{
"type":"fingerprint",
"stopwords":"_english_"
}
}
}
}
}
//請求參數
POST custom_fingerprint_stop_index/_analyze
{
"analyzer": "fingerprint_analyzer",
"text": "Yes yes,is this déjàvu?"
}
//分詞返回
{
"tokens" : [
{
"token" : "dejavu yes",
"start_offset" : 0,
"end_offset" : 23,
"type" : "fingerprint",
"position" : 0
}
]
}
以上句子通過分詞之後得到的詞(term)爲:
[dejavu yes]
1.3、fingerprint analyzer的組成定義
序號 | 子構件 | 構件說明 |
---|---|---|
1 | Tokenizer | standard tokenizer |
2 | Token Filters | lowercase token filter,stop token filter(默認禁用),ascii folding,fingerprint |
如果希望自定義一個與fingerprint類似的analyzer,只需要在原定義中配置可配置參數即可,其它的可以完全照搬fingerprint的配置,如下示例:
//自定義fingerprint analyzer
PUT custom_redefine_fingerprint_index
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_fingerprint": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"fingerprint"
]
}
}
}
}
}
//請求參數
POST custom_redefine_fingerprint_index/_analyze
{
"analyzer": "rebuilt_fingerprint",
"text": "Yes yes,is this déjàvu?"
}
//分詞結果
{
"tokens" : [
{
"token" : "dejavu is this yes",
"start_offset" : 0,
"end_offset" : 23,
"type" : "fingerprint",
"position" : 0
}
]
}
以上句子通過分詞之後得到的詞(term)爲:
[dejavu is this yes]
2、language analyzer
2.1、language類型及分詞效果
language analyzers是特定類型語言的分詞器,默認提供了多種語言分詞器(絕大部分是拉丁語系),以下舉幾例:english,french,italian,russian,turkish等
2.2、language類型可配置參數
任何language類型均支持stopwords,故而可配置以下三個參數
序號 | 參數 | 參數說明 |
---|---|---|
1 | stopwords | 預定義的停用詞,可以爲0個或多個,例如_english_或數組類型值 |
2 | stopwords_path | 停用詞文件路徑 |
3 | stem_exclusion | 部分語言支持在詞幹提取時忽略小寫格式的單詞 |
1)、自定義english類型analyzer
//自定義english類型analyzer
PUT custom_redefine_english_index
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": [
"example"
]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"rebuilt_english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
//請求參數
POST custom_redefine_english_index/_analyze
{
"analyzer": "rebuilt_english",
"text": "look at this example"
}
//分詞結果
{
"tokens" : [
{
"token" : "look",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "example",
"start_offset" : 13,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
2)、自定義french類型analyzer
//自定義french類型analyzer
PUT custom_redefine_french_index
{
"settings": {
"analysis": {
"filter": {
"french_elision": {
"type": "elision",
"articles_case": true,
"articles": [
"l",
"m",
"t",
"qu",
"n",
"s",
"j",
"d",
"c",
"jusqu",
"quoiqu",
"lorsqu",
"puisqu"
]
},
"french_stop": {
"type": "stop",
"stopwords": "_french_"
},
"french_keywords": {
"type": "keyword_marker",
"keywords": [
"Example"
]
},
"french_stemmer": {
"type": "stemmer",
"language": "light_french"
}
},
"analyzer": {
"rebuilt_french": {
"tokenizer": "standard",
"filter": [
"french_elision",
"lowercase",
"french_stop",
"french_keywords",
"french_stemmer"
]
}
}
}
}
}