Elasticsearch 自定義過濾器示例

HTML strip Character Filter

刪除HTML從文本元素,並替換HTML實體與他們的解碼值(例如,更換&用&)。html_strip使用的是Lucene的HTMLStripCharFilter。

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

/*輸出*/
{
  "tokens" : [
    {
      "token" : """I'm so happy!""",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "word",
      "position" : 0
    }
  ]
}
添加分析器

這個API示例,是創建一個索引,使用html_strip配置一個自定義分析器

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  }
}
參數

escaped_tags,可選數組。不包含尖括號(< >)的HTML標籤數組。從文本中剝離HTML時,過濾器會跳過這些HTML元素。例如,值爲[ "p" ]跳過<p>HTML標籤。

請求示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_custom_html_strip_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_custom_html_strip_char_filter": {
          "type": "html_strip",
          "escaped_tags": [
            "b"
          ]
        }
      }
    }
  }
}

標準分詞器

標準分詞器,基於語法分詞,(基於Unicode標準附件#29中指定的Unicode文本分段算法 ),並且適用於大多數語言。

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
/*輸出*/
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "QUICK",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "Brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "Foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "dog's",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "bone",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}
參數

max_token_length, 最大標記長度,如果標記長度超過此長度,則將其根據max_token_length分割,默認爲255.

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Lowercase token filter 小寫標記過濾器

將文本轉化爲小寫,例如,你可以使用lowercase過濾器,將 THE Lazy DoG轉化爲the lazy dog。除了默認過濾器外,lowercase令牌過濾器還提供對Lucene語言特定的小寫過濾器(希臘語,愛爾蘭語和土耳其語)的訪問權限。

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}
/*輸出*/
[ the, quick, fox, jumps ]
創建分析器
PUT lowercase_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "whitespace_lowercase" : {
                    "tokenizer" : "whitespace",
                    "filter" : ["lowercase"]
                }
            }
        }
    }
}
參數

language,(可選,字符串)要使用的特定於語言的小寫標記過濾器。有效值包括:
1. greek,使用Lucene的 GreekLowerCaseFilter
2. irish,使用Lucene的 IrishLowerCaseFilter
3. turkish,使用Lucene的 TurkishLowerCaseFilter
如果未指定,則默認爲Lucene的 LowerCaseFilter

自定義

要自定義lowercase過濾器,需要先複製它以創建新的自定義標記過濾器的基礎。您可以使用其可配置參數來修改過濾器。
例如,以下請求lowercase使用過濾器,爲希臘語創建一個過濾器。

PUT custom_lowercase_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "greek_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["greek_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        }
      }
    }
  }
}

組合使用

設置typecustom,聲明定義一個自定義的分析器(type還可以設置爲standardsimple)。這個示例使用了標記生成器,標記過濾器和字符過濾器及其默認配置,但是可以創建每個標記器的配置版本並在自定義分析器中使用它們。

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}

一個更復雜的例子

  1. character filterMapping Character Filter替換字符串,下面的示例::) 轉化爲_happy_:( 轉化爲_sad_
  2. tokenizerPattern Tokenizer分詞器,配置爲按標點符號分割
  3. Token FiltersLowercase Token FilterStop Token Filter(配置爲使用英語停用詞的預定義列表)
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": { 
          "type": "custom",
          "char_filter": [
            "emoticons"
          ],
          "tokenizer": "punctuation",
          "filter": [
            "lowercase",
            "english_stop"
          ]
        }
      },
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { 
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { 
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) person, and you?"
}
/*輸出*/
[ i'm, _happy_, person, you ]
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章