Elasticsearch 自定义过滤器示例

HTML strip Character Filter

删除HTML从文本元素,并替换HTML实体与他们的解码值(例如,更换&用&)。html_strip使用的是Lucene的HTMLStripCharFilter。

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

/*输出*/
{
  "tokens" : [
    {
      "token" : """I'm so happy!""",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "word",
      "position" : 0
    }
  ]
}
添加分析器

这个API示例,是创建一个索引,使用html_strip配置一个自定义分析器

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  }
}
参数

escaped_tags,可选数组。不包含尖括号(< >)的HTML标签数组。从文本中剥离HTML时,过滤器会跳过这些HTML元素。例如,值为[ "p" ]跳过<p>HTML标签。

请求示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_custom_html_strip_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_custom_html_strip_char_filter": {
          "type": "html_strip",
          "escaped_tags": [
            "b"
          ]
        }
      }
    }
  }
}

标准分词器

标准分词器,基于语法分词,(基于Unicode标准附件#29中指定的Unicode文本分段算法 ),并且适用于大多数语言。

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
/*输出*/
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "QUICK",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "Brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "Foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "dog's",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "bone",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}
参数

max_token_length, 最大标记长度,如果标记长度超过此长度,则将其根据max_token_length分割,默认为255.

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Lowercase token filter 小写标记过滤器

将文本转化为小写,例如,你可以使用lowercase过滤器,将 THE Lazy DoG转化为the lazy dog。除了默认过滤器外,lowercase令牌过滤器还提供对Lucene语言特定的小写过滤器(希腊语,爱尔兰语和土耳其语)的访问权限。

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}
/*输出*/
[ the, quick, fox, jumps ]
创建分析器
PUT lowercase_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "whitespace_lowercase" : {
                    "tokenizer" : "whitespace",
                    "filter" : ["lowercase"]
                }
            }
        }
    }
}
参数

language,(可选,字符串)要使用的特定于语言的小写标记过滤器。有效值包括:
1. greek,使用Lucene的 GreekLowerCaseFilter
2. irish,使用Lucene的 IrishLowerCaseFilter
3. turkish,使用Lucene的 TurkishLowerCaseFilter
如果未指定,则默认为Lucene的 LowerCaseFilter

自定义

要自定义lowercase过滤器,需要先复制它以创建新的自定义标记过滤器的基础。您可以使用其可配置参数来修改过滤器。
例如,以下请求lowercase使用过滤器,为希腊语创建一个过滤器。

PUT custom_lowercase_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "greek_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["greek_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        }
      }
    }
  }
}

组合使用

设置typecustom,声明定义一个自定义的分析器(type还可以设置为standardsimple)。这个示例使用了标记生成器,标记过滤器和字符过滤器及其默认配置,但是可以创建每个标记器的配置版本并在自定义分析器中使用它们。

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}

一个更复杂的例子

  1. character filterMapping Character Filter替换字符串,下面的示例::) 转化为_happy_:( 转化为_sad_
  2. tokenizerPattern Tokenizer分词器,配置为按标点符号分割
  3. Token FiltersLowercase Token FilterStop Token Filter(配置为使用英语停用词的预定义列表)
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": { 
          "type": "custom",
          "char_filter": [
            "emoticons"
          ],
          "tokenizer": "punctuation",
          "filter": [
            "lowercase",
            "english_stop"
          ]
        }
      },
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { 
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { 
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) person, and you?"
}
/*输出*/
[ i'm, _happy_, person, you ]
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章