Elasticsearch Built-in analyzer reference 内置分析器

摘要

这些系统内置分析器，都是通过组合 tokenizer， token filter实现，它们之间由包含、重叠、互斥等关系，根据实际需求选择你所需要的分析器，或是重新自定义分析器，以期达到最佳状态。默认的标准分析器，能满足大多数需求。

标准分析器 standard analyzer

默认分析器，根据Unicode文本分段算法的定义，分析器将文本划分为单词边界上的多个术语。它删除大多数标点符号，小写术语，并支持停用词。

配置参数

参数	含义
`max_token_length`	最大令牌长度。如果看到令牌超过此长度，则将使用`max_token_length`间隔分割。默认为255
`stopwords`	，预定义的停用词列表，例如`_english_`或包含停用词列表的数组。默认为`_none_`。
`stopwords_path`	，包含停用词的文件的路径。

配置示例

对比"max_token_length": 5 和 "max_token_length": 4，输出结果，体会参数含义

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
/*输出*/
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

/*如果max_token_length为4*，则有以下输出*/
[2, quic, k, brow, n, foxe, s, jump, ed, lazy, dog, s, bone]

定义

标准分析器包括：

tokenizer：标准分词器
token filter：小写标记分词器，停用标记分词器（默认情况下禁用）

如果需要在配置参数之外自定义分析器，这通常需要通过添加标记过滤器，将其重新创建为分析器并进行修改。这将重新创建内置标准分析器，您可以将其用作起点：

PUT /standard_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            /*添加标记过滤器*/
            "custom token filter"      
          ]
        }
      }
    }
  }
}

简单分析器 sample analyzer

以非字母字符，作为分割器，将字符串分割为单个术语。简单分析器不需要配置参数。

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

/*输出*/
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

定义

tokenizer：小写分词器

如果需要自定义分析器，这通常需要通过添加标记过滤器将其重新创建为分析器并进行修改。这将重新创建内置分析器，您可以将其用作进一步定制的起点：

PUT /simple_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_simple": {
          "tokenizer": "lowercase",
          "filter": [     
          /*添加标记过滤器*/
            "custom token filter"      
          ]
        }
      }
    }
  }
}

空白字符分析器 whitespace analyzer

以任意空白字符作为分隔符，将字符串分割为单个术语。这个分析器不需要配置参数。

POST _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

/*输出*/
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

定义

tokenizer：空白字符分词器

PUT /simple_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_whitespace": {
          "tokenizer": "whitespace",
          "filter": [     
          /*添加标记过滤器*/
            "custom token filter"      
          ]
        }
      }
    }
  }
}

停用分析器 stop analyzer

在简单分析器的基础上，添加了对停用字的支持，默认使用_english_停用字。

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
/*输出*/
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

配置参数

参数	含义
`stopwords`	，预定义的停用词列表，例如`_english_`或包含停用词列表的数组。默认为`_none_`。
`stopwords_path`	，包含停用词的文件的路径。

配置示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
/*输出*/
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

定义

标准分析器包括：

tokenizer：标准分词器
token filter：停用标记分词器

PUT /stop_example
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        }
      },
      "analyzer": {
        "rebuilt_stop": {
          "tokenizer": "lowercase",
          "filter": [
            "english_stop",
            /*添加标记过滤器*/
            "custom token filter"             
          ]
        }
      }
    }
  }
}

关键词分析器 Keyword Analyzer

将整个文本视为一个术语

POST _analyze
{
  "analyzer": "keyword",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

/*输出*/
[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]

定义

tokenizer：关键词分词器
如果需要自定义分析器，这通常需要通过添加标记过滤器将其重新创建为分析器并进行修改。这将重新创建内置分析器，您可以将其用作进一步定制的起点：

PUT /keyword_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_keyword": {
          "tokenizer": "keyword",
          "filter": [         
          ]
        }
      }
    }
  }
}

正则表达式分析器 pattern analyzer

使用一个正则表达式，作为分割字符串的算法，正则表达式应该是匹配标记的分隔符，而不是匹配标记本身，默认为\W+，匹配任意非字母字符。

当心错误的正则表达式，模式分析器使用 Java正则表达式。编写不正确的正则表达式可能会非常缓慢地运行，甚至抛出StackOverflowError并导致正在运行的节点突然退出。

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

/*输出*/
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

配置参数

参数	含义
`pattern`	一个Java的正则表达式，则默认为`\W+`。
`flags`	Java正则表达式标志。标记应以管道分隔。
`lowercase`	，术语是否小写。默认为`true`。
`stopwords`	，预定义的停用词列表，例如`_english_`或包含停用词列表的数组。默认为`_none_`。
`stopwords_path`	，包含停用词的文件的路径。

配置示例

在此示例中，我们将分析器配置为将电子邮件地址拆分为非单词字符或下划线（\W|_），并小写结果：

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_email_analyzer",
  "text": "[email protected]"
}
/*输出*/
[ john, smith, foo, bar, com ]

更复杂的示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET my_index/_analyze
{
  "analyzer": "camel",
  "text": "MooseX::FTPClass2_beta"
}

/*输出*/
[ moose, x, ftp, class, 2, beta ]

解释

 ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )

定义

tokenizer：正则表达式分词器
token filter：小写标记分词器，停用标记分词器（默认情况下禁用）
如果需要自定义分析器，这通常需要通过添加标记过滤器将其重新创建为分析器并进行修改。这将重新创建内置分析器，您可以将其用作进一步定制的起点：

PUT /pattern_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "split_on_non_word": {
          "type":       "pattern",
          "pattern":    "\\W+" 
        }
      },
      "analyzer": {
        "rebuilt_pattern": {
          "tokenizer": "split_on_non_word",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}

语言分析器 language analyzer

一组旨在分析特定语言的分析器。将支持以下类型：arabic,armenian,basque,bengali,brazilian,bulgarian,catalan,cjk,czech,danish,dutch,english,estonian,finnish,french,galician,german,greek,hindi,hungarian,indonesian,irish,italian,latvian,lithuanian,norwegian,persian,portuguese,romanian,russian,sorani,spanish,swedish,turkish,thai.,

Elasticsearch Built-in analyzer reference 内置分析器