Elasticsearch 自定义过滤器示例

原創

wei_bo_cai

2020-06-22 21:18

Elasticsearch 自定义过滤器示例

HTML strip Character Filter

标准分词器

参数

Lowercase token filter 小写标记过滤器

HTML strip Character Filter

删除HTML从文本元素，并替换HTML实体与他们的解码值（例如，更换&用&）。html_strip使用的是Lucene的HTMLStripCharFilter。

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

/*输出*/
{
  "tokens" : [
    {
      "token" : """I'm so happy!""",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "word",
      "position" : 0
    }
  ]
}

添加分析器

这个API示例，是创建一个索引，使用html_strip配置一个自定义分析器

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  }
}

参数

escaped_tags，可选数组。不包含尖括号（< >）的HTML标签数组。从文本中剥离HTML时，过滤器会跳过这些HTML元素。例如，值为[ "p" ]跳过<p>HTML标签。

请求示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_custom_html_strip_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_custom_html_strip_char_filter": {
          "type": "html_strip",
          "escaped_tags": [
            "b"
          ]
        }
      }
    }
  }
}

标准分词器

标准分词器，基于语法分词，（基于Unicode标准附件＃29中指定的Unicode文本分段算法），并且适用于大多数语言。

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
/*输出*/
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "QUICK",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "Brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "Foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "dog's",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "bone",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

参数

max_token_length，最大标记长度，如果标记长度超过此长度，则将其根据max_token_length分割，默认为255.

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Lowercase token filter 小写标记过滤器

将文本转化为小写，例如，你可以使用lowercase过滤器，将 THE Lazy DoG转化为the lazy dog。除了默认过滤器外，lowercase令牌过滤器还提供对Lucene语言特定的小写过滤器（希腊语，爱尔兰语和土耳其语）的访问权限。

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}
/*输出*/
[ the, quick, fox, jumps ]

创建分析器

PUT lowercase_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "whitespace_lowercase" : {
                    "tokenizer" : "whitespace",
                    "filter" : ["lowercase"]
                }
            }
        }
    }
}

参数

language，（可选，字符串）要使用的特定于语言的小写标记过滤器。有效值包括：
1. greek，使用Lucene的 GreekLowerCaseFilter
2. irish，使用Lucene的 IrishLowerCaseFilter
3. turkish，使用Lucene的 TurkishLowerCaseFilter
如果未指定，则默认为Lucene的 LowerCaseFilter。

自定义

要自定义lowercase过滤器，需要先复制它以创建新的自定义标记过滤器的基础。您可以使用其可配置参数来修改过滤器。
例如，以下请求lowercase使用过滤器，为希腊语创建一个过滤器。

PUT custom_lowercase_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "greek_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["greek_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        }
      }
    }
  }
}

组合使用

设置type为custom，声明定义一个自定义的分析器（type还可以设置为standard，simple）。这个示例使用了标记生成器，标记过滤器和字符过滤器及其默认配置，但是可以创建每个标记器的配置版本并在自定义分析器中使用它们。

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}

一个更复杂的例子

character filter：Mapping Character Filter替换字符串，下面的示例：:) 转化为_happy_ ，:( 转化为_sad_
tokenizer：Pattern Tokenizer分词器，配置为按标点符号分割
Token Filters：Lowercase Token Filter 和 Stop Token Filter（配置为使用英语停用词的预定义列表）

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": { 
          "type": "custom",
          "char_filter": [
            "emoticons"
          ],
          "tokenizer": "punctuation",
          "filter": [
            "lowercase",
            "english_stop"
          ]
        }
      },
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { 
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { 
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) person, and you?"
}
/*输出*/
[ i'm, _happy_, person, you ]

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Elasticsearch 自定义过滤器示例

Elasticsearch 自定义过滤器示例

HTML strip Character Filter

添加分析器

参数

标准分词器

参数

Lowercase token filter 小写标记过滤器

创建分析器

参数

自定义

组合使用

一个更复杂的例子

python gdal 安装使用（Windows， python 3.6.8）

Elasticsearch Analyze API

Elasticsearch Exists query – 存在查詢

python elasticsearch_dsl 分頁，聚合，多關鍵字查詢

nginx connect() to 127.0.0.1:8000 failed (13: Permission denied)

Elasticsearch Common options -- 常用選項

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結