ES內置分詞器之fingerprint/language_8_2_4

ES默認提供了八種內置的analyzer,針對不同的場景可以使用不同的analyzer;

1、fingerprint analyzer

1.1、fingerprint類型及分詞效果

fingerprint analyzer實現了fingerprinting算法(OpenRefine項目中使用);使用該analyzer場景下文本會被轉爲小寫格式,經過規範化(normalize)處理之後移除擴展字符,然後再經過排序,刪除重複數據組合爲單個token;如果配置了停用詞則停用詞也將會被移除

//測試fingerprint analyzer默認分詞效果
//請求參數
POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes,is this déjàvu?"
}
//分詞結果
{
  "tokens" : [
    {
      "token" : "dejavu is this yes",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "fingerprint",
      "position" : 0
    }
  ]
}

以上句子通過分詞之後得到的詞(term)爲:
[dejavu is this yes]

1.2、fingerprint類型可配置參數

序號 參數 參數說明
1 separator 連接多個詞(term)的字符,默認爲空格
2 max_output_size token允許的最大值,超過該值將直接被丟棄,默認值爲255
3 stopwords 預定義的停用詞,可以爲0個或多個,例如_english_或數組類型值,默認值爲_none_
4 stopwords_path 停用詞文件路徑
//自定義fingerprint analyzer並指定停用詞
PUT custom_fingerprint_stop_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "fingerprint_analyzer":{
          "type":"fingerprint",
          "stopwords":"_english_"
        }
      }
    }
  }
}
//請求參數
POST custom_fingerprint_stop_index/_analyze
{
  "analyzer": "fingerprint_analyzer",
  "text": "Yes yes,is this déjàvu?"
}
//分詞返回
{
  "tokens" : [
    {
      "token" : "dejavu yes",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "fingerprint",
      "position" : 0
    }
  ]
}

以上句子通過分詞之後得到的詞(term)爲:
[dejavu yes]

1.3、fingerprint analyzer的組成定義

序號 子構件 構件說明
1 Tokenizer standard tokenizer
2 Token Filters lowercase token filter,stop token filter(默認禁用),ascii folding,fingerprint

如果希望自定義一個與fingerprint類似的analyzer,只需要在原定義中配置可配置參數即可,其它的可以完全照搬fingerprint的配置,如下示例:

//自定義fingerprint analyzer
PUT custom_redefine_fingerprint_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_fingerprint": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "fingerprint"
          ]
        }
      }
    }
  }
}
//請求參數
POST custom_redefine_fingerprint_index/_analyze
{
  "analyzer": "rebuilt_fingerprint",
  "text": "Yes yes,is this déjàvu?"
}

//分詞結果
{
  "tokens" : [
    {
      "token" : "dejavu is this yes",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "fingerprint",
      "position" : 0
    }
  ]
}

以上句子通過分詞之後得到的詞(term)爲:
[dejavu is this yes]

2、language analyzer

2.1、language類型及分詞效果

language analyzers是特定類型語言的分詞器,默認提供了多種語言分詞器(絕大部分是拉丁語系),以下舉幾例:english,french,italian,russian,turkish等

2.2、language類型可配置參數

任何language類型均支持stopwords,故而可配置以下三個參數

序號 參數 參數說明
1 stopwords 預定義的停用詞,可以爲0個或多個,例如_english_或數組類型值
2 stopwords_path 停用詞文件路徑
3 stem_exclusion 部分語言支持在詞幹提取時忽略小寫格式的單詞

1)、自定義english類型analyzer

//自定義english類型analyzer
PUT custom_redefine_english_index
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "english_keywords": {
          "type": "keyword_marker",
          "keywords": [
            "example"
          ]
        },
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "rebuilt_english": {
          "tokenizer": "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}

//請求參數
POST custom_redefine_english_index/_analyze
{
  "analyzer": "rebuilt_english",
  "text": "look at this example"
}

//分詞結果
{
  "tokens" : [
    {
      "token" : "look",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 13,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

2)、自定義french類型analyzer

//自定義french類型analyzer
PUT custom_redefine_french_index
{
  "settings": {
    "analysis": {
      "filter": {
        "french_elision": {
          "type": "elision",
          "articles_case": true,
          "articles": [
            "l",
            "m",
            "t",
            "qu",
            "n",
            "s",
            "j",
            "d",
            "c",
            "jusqu",
            "quoiqu",
            "lorsqu",
            "puisqu"
          ]
        },
        "french_stop": {
          "type": "stop",
          "stopwords": "_french_"
        },
        "french_keywords": {
          "type": "keyword_marker",
          "keywords": [
            "Example"
          ]
        },
        "french_stemmer": {
          "type": "stemmer",
          "language": "light_french"
        }
      },
      "analyzer": {
        "rebuilt_french": {
          "tokenizer": "standard",
          "filter": [
            "french_elision",
            "lowercase",
            "french_stop",
            "french_keywords",
            "french_stemmer"
          ]
        }
      }
    }
  }
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章