ES內置分詞器之standard/simple_8_2_1

ES默認提供了八種內置的analyzer,針對不同的場景可以使用不同的analyzer;

1、standard analyzer

1.1、standard類型及分詞效果

在未顯式指定analyzer的情況下standard analyzer爲默認analyzer,其提供基於語法進行分詞(基於Unicode文本分段算法)且在多數語言當中表現都不錯;

//測試standard analyzer默認分詞效果
//請求參數
POST _analyze
{
  "analyzer": "standard",
  "text": "transimission control protocol is a transport layer protocol"
}

//結果返回
{
  "tokens" : [
    {
      "token" : "transimission",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "control",
      "start_offset" : 14,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "protocol",
      "start_offset" : 22,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "transport",
      "start_offset" : 36,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "layer",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "protocol",
      "start_offset" : 52,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

以上句子通過分詞之後得到的關鍵詞爲:
[transmission, control, protocol, is, a, transport, layer, protocol]

1.2、standard類型可配置參數

序號 參數 參數說明
1 max_token_length 原始字符串拆分出的單個token所允許的最大長度,若拆分出的token查詢超過最大值則按照最大值位置進行拆分,多餘的作爲另外的token,默認值爲255;
2 stopwords 預定義的停用詞,可以爲0個或多個,例如_english_或數組類型值,默認值爲_none_;
3 stopwords_path 停用詞文件路徑;

以下實例配置max_token_length參數

//standard參數配置定義
PUT standard_analyzer_token_length_conf_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_analyzer":{
          "type":"standard",
          "max_token_length":5,
          "stopwords":"_english_"
        }
      }
    }
  }
}

//測試standard可配置參數
POST standard_analyzer_token_length_conf_index/_analyze
{
  "analyzer": "english_analyzer",
  "text": "transimission control protocol is a transport layer protocol"
}

//測試結果返回
{
  "tokens" : [
    {
      "token" : "trans",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "imiss",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "ion",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "contr",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "ol",
      "start_offset" : 19,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "proto",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "col",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "trans",
      "start_offset" : 36,
      "end_offset" : 41,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "port",
      "start_offset" : 41,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "layer",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "proto",
      "start_offset" : 52,
      "end_offset" : 57,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "col",
      "start_offset" : 57,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 13
    }
  ]
}

以上句子通過分詞之後得到的關鍵詞爲:
[trans, imiss, ion, contr, ol, proto, col, trans, port, layer, proto, col]

1.3、standard analyzer的組成定義

序號 子構件 構件說明
1 Tokenizer standard tokenizer
2 Token Filters lowercase token filter,stop token filter(默認禁用)

如果希望自定義一個與standard類似的analyzer,只需要在原定義中配置可配置參數即可,其它的可以完全照搬standard的配置,如下示例:

//測試自定義analyzer
PUT custom_rebuild_standard_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuild_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":["lowercase"]
        }
      }
    }
  }
}

//測試請求參數
POST custom_rebuild_standard_analyzer_index/_analyze
{
  "text": "transimission control protocol is a transport layer protocol"
}


//測試結果返回
{
  "tokens" : [
    {
      "token" : "transimission",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "control",
      "start_offset" : 14,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "protocol",
      "start_offset" : 22,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "transport",
      "start_offset" : 36,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "layer",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "protocol",
      "start_offset" : 52,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

自定義的standard若希望使用內置standard的配置參數,必須保證type類型爲standard,否則配置的參數無效,示例如下:

//自定義analyzer
PUT custom_rebuild_standard_analyzer_index_1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuild_analyzer":{
        //此處的type若爲standard,則max_token_length有效,反之若爲custom則無效
          "type":"custom",
          "tokenizer":"standard",
          "max_token_length":8,
          "filter":["lowercase"]
        }
      }
    }
  }
}

//測試驗證
POST custom_rebuild_standard_analyzer_index_1/_analyze
{
  "analyzer": "rebuild_analyzer", 
  "text": "transimission control protocol is a transport layer protocol"
}

以上示例均可自行驗證

2、simple analyzer

2.1、simple類型及分詞效果

simple類型分詞器是根據非字母字符對文本進行拆分,且將處理的所有關鍵詞轉換成小寫格式

//測試standard analyzer默認分詞效果
//請求參數
POST _analyze
{
  "analyzer": "simple",
  "text": "Transimission Control Protocol is a transport layer protocol"
}

//結果返回
{
  "tokens" : [
    {
      "token" : "transimission",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "control",
      "start_offset" : 14,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "protocol",
      "start_offset" : 22,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "transport",
      "start_offset" : 36,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "layer",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "protocol",
      "start_offset" : 52,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

以上句子通過分詞之後得到的關鍵詞爲:
[transmission, control, protocol, is, a, transport, layer, protocol]

2.2、默認standard analyzer的組成定義

序號 子構件 構件說明
1 Tokenizer lowercase tokenizer

如果希望自定義一個與simple類似的analyzer,只需要在在自定義analyzer時指定type爲custom,其它的可以完全照搬simple的配置,如下示例:

//測試自定義analyzer
PUT custom_rebuild_simple_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuild_simple":{
          "tokenizer":"lowercase",
          "filter":[]
        }
      }
    }
  }
}

//測試請求參數
POST custom_rebuild_simple_analyzer_index/_analyze
{
  "text": "transimission control protocol is a transport layer protocol"
}


//測試結果返回
{
  "tokens" : [
    {
      "token" : "transimission",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "control",
      "start_offset" : 14,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "protocol",
      "start_offset" : 22,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "transport",
      "start_offset" : 36,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "layer",
      "start_offset" : 46,
      "end_offset" : 51,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "protocol",
      "start_offset" : 52,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章