ES内置分词器之fingerprint/language_8_2_4

ES默认提供了八种内置的analyzer,针对不同的场景可以使用不同的analyzer;

1、fingerprint analyzer

1.1、fingerprint类型及分词效果

fingerprint analyzer实现了fingerprinting算法(OpenRefine项目中使用);使用该analyzer场景下文本会被转为小写格式,经过规范化(normalize)处理之后移除扩展字符,然后再经过排序,删除重复数据组合为单个token;如果配置了停用词则停用词也将会被移除

//测试fingerprint analyzer默认分词效果
//请求参数
POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes,is this déjàvu?"
}
//分词结果
{
  "tokens" : [
    {
      "token" : "dejavu is this yes",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "fingerprint",
      "position" : 0
    }
  ]
}

以上句子通过分词之后得到的词(term)为:
[dejavu is this yes]

1.2、fingerprint类型可配置参数

序号 参数 参数说明
1 separator 连接多个词(term)的字符,默认为空格
2 max_output_size token允许的最大值,超过该值将直接被丢弃,默认值为255
3 stopwords 预定义的停用词,可以为0个或多个,例如_english_或数组类型值,默认值为_none_
4 stopwords_path 停用词文件路径
//自定义fingerprint analyzer并指定停用词
PUT custom_fingerprint_stop_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "fingerprint_analyzer":{
          "type":"fingerprint",
          "stopwords":"_english_"
        }
      }
    }
  }
}
//请求参数
POST custom_fingerprint_stop_index/_analyze
{
  "analyzer": "fingerprint_analyzer",
  "text": "Yes yes,is this déjàvu?"
}
//分词返回
{
  "tokens" : [
    {
      "token" : "dejavu yes",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "fingerprint",
      "position" : 0
    }
  ]
}

以上句子通过分词之后得到的词(term)为:
[dejavu yes]

1.3、fingerprint analyzer的组成定义

序号 子构件 构件说明
1 Tokenizer standard tokenizer
2 Token Filters lowercase token filter,stop token filter(默认禁用),ascii folding,fingerprint

如果希望自定义一个与fingerprint类似的analyzer,只需要在原定义中配置可配置参数即可,其它的可以完全照搬fingerprint的配置,如下示例:

//自定义fingerprint analyzer
PUT custom_redefine_fingerprint_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_fingerprint": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "fingerprint"
          ]
        }
      }
    }
  }
}
//请求参数
POST custom_redefine_fingerprint_index/_analyze
{
  "analyzer": "rebuilt_fingerprint",
  "text": "Yes yes,is this déjàvu?"
}

//分词结果
{
  "tokens" : [
    {
      "token" : "dejavu is this yes",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "fingerprint",
      "position" : 0
    }
  ]
}

以上句子通过分词之后得到的词(term)为:
[dejavu is this yes]

2、language analyzer

2.1、language类型及分词效果

language analyzers是特定类型语言的分词器,默认提供了多种语言分词器(绝大部分是拉丁语系),以下举几例:english,french,italian,russian,turkish等

2.2、language类型可配置参数

任何language类型均支持stopwords,故而可配置以下三个参数

序号 参数 参数说明
1 stopwords 预定义的停用词,可以为0个或多个,例如_english_或数组类型值
2 stopwords_path 停用词文件路径
3 stem_exclusion 部分语言支持在词干提取时忽略小写格式的单词

1)、自定义english类型analyzer

//自定义english类型analyzer
PUT custom_redefine_english_index
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "english_keywords": {
          "type": "keyword_marker",
          "keywords": [
            "example"
          ]
        },
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "rebuilt_english": {
          "tokenizer": "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}

//请求参数
POST custom_redefine_english_index/_analyze
{
  "analyzer": "rebuilt_english",
  "text": "look at this example"
}

//分词结果
{
  "tokens" : [
    {
      "token" : "look",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 13,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

2)、自定义french类型analyzer

//自定义french类型analyzer
PUT custom_redefine_french_index
{
  "settings": {
    "analysis": {
      "filter": {
        "french_elision": {
          "type": "elision",
          "articles_case": true,
          "articles": [
            "l",
            "m",
            "t",
            "qu",
            "n",
            "s",
            "j",
            "d",
            "c",
            "jusqu",
            "quoiqu",
            "lorsqu",
            "puisqu"
          ]
        },
        "french_stop": {
          "type": "stop",
          "stopwords": "_french_"
        },
        "french_keywords": {
          "type": "keyword_marker",
          "keywords": [
            "Example"
          ]
        },
        "french_stemmer": {
          "type": "stemmer",
          "language": "light_french"
        }
      },
      "analyzer": {
        "rebuilt_french": {
          "tokenizer": "standard",
          "filter": [
            "french_elision",
            "lowercase",
            "french_stop",
            "french_keywords",
            "french_stemmer"
          ]
        }
      }
    }
  }
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章