Elasticsearch 自定義過濾器示例
HTML strip Character Filter
刪除HTML從文本元素,並替換HTML實體與他們的解碼值(例如,更換&用&)。html_strip
使用的是Lucene的HTMLStripCharFilter。
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"text": "<p>I'm so <b>happy</b>!</p>"
}
/*輸出*/
{
"tokens" : [
{
"token" : """I'm so happy!""",
"start_offset" : 0,
"end_offset" : 32,
"type" : "word",
"position" : 0
}
]
}
添加分析器
這個API示例,是創建一個索引,使用html_strip配置一個自定義分析器
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"html_strip"
]
}
}
}
}
}
參數
escaped_tags
,可選數組。不包含尖括號(< >
)的HTML標籤數組。從文本中剝離HTML時,過濾器會跳過這些HTML元素。例如,值爲[ "p" ]
跳過<p>
HTML標籤。
請求示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_custom_html_strip_char_filter"
]
}
},
"char_filter": {
"my_custom_html_strip_char_filter": {
"type": "html_strip",
"escaped_tags": [
"b"
]
}
}
}
}
}
標準分詞器
標準分詞器,基於語法分詞,(基於Unicode標準附件#29中指定的Unicode文本分段算法 ),並且適用於大多數語言。
POST _analyze
{
"tokenizer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
/*輸出*/
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "QUICK",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "Brown",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "Foxes",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "jumped",
"start_offset" : 24,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 31,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "the",
"start_offset" : 36,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "lazy",
"start_offset" : 40,
"end_offset" : 44,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "dog's",
"start_offset" : 45,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "bone",
"start_offset" : 51,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
參數
max_token_length
, 最大標記長度,如果標記長度超過此長度,則將其根據max_token_length
分割,默認爲255.
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Lowercase token filter 小寫標記過濾器
將文本轉化爲小寫,例如,你可以使用lowercase
過濾器,將 THE Lazy DoG
轉化爲the lazy dog
。除了默認過濾器外,lowercase
令牌過濾器還提供對Lucene語言特定的小寫過濾器(希臘語,愛爾蘭語和土耳其語)的訪問權限。
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["lowercase"],
"text" : "THE Quick FoX JUMPs"
}
/*輸出*/
[ the, quick, fox, jumps ]
創建分析器
PUT lowercase_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"whitespace_lowercase" : {
"tokenizer" : "whitespace",
"filter" : ["lowercase"]
}
}
}
}
}
參數
language
,(可選,字符串)要使用的特定於語言的小寫標記過濾器。有效值包括:
1. greek
,使用Lucene的 GreekLowerCaseFilter
2. irish
,使用Lucene的 IrishLowerCaseFilter
3. turkish
,使用Lucene的 TurkishLowerCaseFilter
如果未指定,則默認爲Lucene的 LowerCaseFilter
。
自定義
要自定義lowercase
過濾器,需要先複製它以創建新的自定義標記過濾器的基礎。您可以使用其可配置參數來修改過濾器。
例如,以下請求lowercase
使用過濾器,爲希臘語創建一個過濾器。
PUT custom_lowercase_example
{
"settings": {
"analysis": {
"analyzer": {
"greek_lowercase_example": {
"type": "custom",
"tokenizer": "standard",
"filter": ["greek_lowercase"]
}
},
"filter": {
"greek_lowercase": {
"type": "lowercase",
"language": "greek"
}
}
}
}
}
組合使用
設置type
爲custom
,聲明定義一個自定義的分析器(type
還可以設置爲standard
,simple
)。這個示例使用了標記生成器,標記過濾器和字符過濾器及其默認配置,但是可以創建每個標記器的配置版本並在自定義分析器中使用它們。
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà vu</b>?"
}
一個更復雜的例子
character filter
:Mapping Character Filter
替換字符串,下面的示例::)
轉化爲_happy_
,:(
轉化爲_sad_
tokenizer
:Pattern Tokenizer
分詞器,配置爲按標點符號分割Token Filters
:Lowercase Token Filter
和Stop Token Filter
(配置爲使用英語停用詞的預定義列表)
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons"
],
"tokenizer": "punctuation",
"filter": [
"lowercase",
"english_stop"
]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": {
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
/*輸出*/
[ i'm, _happy_, person, you ]