Elasticsearch 自定義過濾器示例

原創

wei_bo_cai

2020-06-22 21:18

Elasticsearch 自定義過濾器示例

HTML strip Character Filter

標準分詞器

參數

Lowercase token filter 小寫標記過濾器

HTML strip Character Filter

刪除HTML從文本元素，並替換HTML實體與他們的解碼值（例如，更換&用&）。html_strip使用的是Lucene的HTMLStripCharFilter。

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

/*輸出*/
{
  "tokens" : [
    {
      "token" : """I'm so happy!""",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "word",
      "position" : 0
    }
  ]
}

添加分析器

這個API示例，是創建一個索引，使用html_strip配置一個自定義分析器

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  }
}

參數

escaped_tags，可選數組。不包含尖括號（< >）的HTML標籤數組。從文本中剝離HTML時，過濾器會跳過這些HTML元素。例如，值爲[ "p" ]跳過<p>HTML標籤。

請求示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_custom_html_strip_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_custom_html_strip_char_filter": {
          "type": "html_strip",
          "escaped_tags": [
            "b"
          ]
        }
      }
    }
  }
}

標準分詞器

標準分詞器，基於語法分詞，（基於Unicode標準附件＃29中指定的Unicode文本分段算法），並且適用於大多數語言。

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
/*輸出*/
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "QUICK",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "Brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "Foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "dog's",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "bone",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

參數

max_token_length，最大標記長度，如果標記長度超過此長度，則將其根據max_token_length分割，默認爲255.

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Lowercase token filter 小寫標記過濾器

將文本轉化爲小寫，例如，你可以使用lowercase過濾器，將 THE Lazy DoG轉化爲the lazy dog。除了默認過濾器外，lowercase令牌過濾器還提供對Lucene語言特定的小寫過濾器（希臘語，愛爾蘭語和土耳其語）的訪問權限。

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}
/*輸出*/
[ the, quick, fox, jumps ]

創建分析器

PUT lowercase_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "whitespace_lowercase" : {
                    "tokenizer" : "whitespace",
                    "filter" : ["lowercase"]
                }
            }
        }
    }
}

參數

language，（可選，字符串）要使用的特定於語言的小寫標記過濾器。有效值包括：
1. greek，使用Lucene的 GreekLowerCaseFilter
2. irish，使用Lucene的 IrishLowerCaseFilter
3. turkish，使用Lucene的 TurkishLowerCaseFilter
如果未指定，則默認爲Lucene的 LowerCaseFilter。

自定義

要自定義lowercase過濾器，需要先複製它以創建新的自定義標記過濾器的基礎。您可以使用其可配置參數來修改過濾器。
例如，以下請求lowercase使用過濾器，爲希臘語創建一個過濾器。

PUT custom_lowercase_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "greek_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["greek_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        }
      }
    }
  }
}

組合使用

設置type爲custom，聲明定義一個自定義的分析器（type還可以設置爲standard，simple）。這個示例使用了標記生成器，標記過濾器和字符過濾器及其默認配置，但是可以創建每個標記器的配置版本並在自定義分析器中使用它們。

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}

一個更復雜的例子

character filter：Mapping Character Filter替換字符串，下面的示例：:) 轉化爲_happy_ ，:( 轉化爲_sad_
tokenizer：Pattern Tokenizer分詞器，配置爲按標點符號分割
Token Filters：Lowercase Token Filter 和 Stop Token Filter（配置爲使用英語停用詞的預定義列表）

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": { 
          "type": "custom",
          "char_filter": [
            "emoticons"
          ],
          "tokenizer": "punctuation",
          "filter": [
            "lowercase",
            "english_stop"
          ]
        }
      },
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { 
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { 
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) person, and you?"
}
/*輸出*/
[ i'm, _happy_, person, you ]

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Elasticsearch 自定義過濾器示例

Elasticsearch 自定義過濾器示例

HTML strip Character Filter

添加分析器

參數

標準分詞器

參數

Lowercase token filter 小寫標記過濾器

創建分析器

參數

自定義

組合使用

一個更復雜的例子

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

linux安裝cuda和cudnn

Mellanox網卡開啓SR-IOV

模擬手機設備：使用 Playwright 實現移動端自動化測試

HTML 00 Tutorial

全面系統的AI學習路徑，幫助普通人也能玩轉AI

從零開始：使用 Playwright 腳本錄製實現自動化測試

uni-app實現上拉加載

Elasticsearch Analyze API

Elasticsearch Exists query – 存在查詢

python elasticsearch_dsl 分頁，聚合，多關鍵字查詢

nginx connect() to 127.0.0.1:8000 failed (13: Permission denied)

Elasticsearch Common options -- 常用選項

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結