索引字段類型參數_7_4_5

norms參數

norms參數會存儲各種normalization因子用於查詢時計算文檔相對查詢字段的相關分;
norms雖然對於相關分計算有幫助,但需要額外的磁盤空間進行存儲(一般每個文檔的每個字段會額外佔用一個字節的空間,即使該字段沒有值也同樣需要一字節空間),故而如果沒有針對特定字段計算分數的必要,可以將該字段置爲false,特別是針對只用於排序或聚合的字段;
norms可以針對已存在值的字段進行設置,不過在禁用之後將不可再重新啓用;

PUT param_norms_index
{
  "mappings": {
    "properties": {
      "desc":{
        "type": "text",
        "norms":false
      }
    }
  }
}

//報錯,不允許將禁用的再重新啓用,Mapper for [desc] conflicts with existing mapping:\n[mapper [desc] has different [norms] values, cannot change from disable to enabled]
PUT param_norms_index/_mapping
{
  "properties":{
    "desc":{
      "type":"text",
      "norms":true
    }
  }
}

norms在禁用之後不會立即刪除,不過隨着文檔的增加,舊段合併到新段,這些norms參數才被移除;由於某些文檔不再有norms參數,這可能導致在前後針對相同的文檔計算分數存在不一致的情況;

null_value參數

null值在es中是不可建立索引和查詢,一個字段設爲null(空數組或者值爲null的數組)將被視爲該字段沒有值;
null_value參數允許顯式指定值爲null時字段的默認值以使字段可以建立索引及可被查詢;
需要注意的是指定的null_value的值需要與字段類型一致,否則將會報異常;
null_value只會影響字段爲null時的索引,不會改定_source的json值;

//定義create_time字段且定義其null_value
PUT param_null_value_index
{
  "mappings": {
    "properties": {
      "create_time":{
        "type": "date",
        "null_value": "2020-05-30"
      }
    }
  }
}

//create_time字段不會被替換爲null_value值
PUT param_null_value_index/_doc/1
{
  "create_time":"2021-01-01"
}

//create_time字段不會被替換爲null_value值
PUT param_null_value_index/_doc/2
{
  "desc":"day day up"
}

//create_time字段將被替換爲null_value值
PUT param_null_value_index/_doc/3
{
  "desc":"歷史記錄",
  "create_time":null
}

//create_time字段不會被替換爲null_value值
PUT param_null_value_index/_doc/4
{
  "desc":"歷史記錄1",
  "create_time":[]
}

//create_time字段將被替換爲null_value值
PUT param_null_value_index/_doc/5
{
  "desc":"歷史記錄2",
  "create_time":[null,null]
}

GET param_null_value_index/_search
{
  "query": {
    "range": {
      "create_time": {
        "gte": "2020-05-10",
        "lte": "2021-05-10"
      }
    }
  }
}

查詢結果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "param_null_value_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "create_time" : "2021-01-01"
        }
      },
      {
        "_index" : "param_null_value_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "desc" : "歷史記錄",
          "create_time" : null
        }
      },
      {
        "_index" : "param_null_value_index",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 1.0,
        "_source" : {
          "desc" : "歷史記錄2",
          "create_time" : [
            null,
            null
          ]
        }
      }
    ]
  }
}

position_increment_gap參數

text類型字段在分詞之後es將記錄字段中每個詞的位置(順序記錄)用於短語查詢(phrase query);當對有多個值的text類型字段進行索引時會在不同值之間添加一個僞間隙以防止短語查詢時不同值之間的跨值匹配;
參數position_increment_gap用來配置間隙值,默認值爲100;

PUT param_increment_gap_index/_doc/1
{
  "names":["John Abraham","Lincoln Smith"]
}

//無法查詢匹配結果
GET param_increment_gap_index/_search
{
  "query": {
    "match_phrase": {
      "names": "Abraham Lincoln"
    }
  }
}

//匹配到結果,因爲slop值不小於position_increment_gap的默認值
GET param_increment_gap_index/_search
{
  "query": {
    "match_phrase": {
      "names": {
        "query": "Abraham Lincoln",
        "slop": 100
      }
    }
  }
}


//另外一種場景,跨值查詢(跳過中間值"Lincoln Smith")
PUT param_increment_gap_index/_doc/1
{
  "names":["John Abraham","Lincoln Smith","Adware Kelin"]
}

GET param_increment_gap_index/_search
{
  "query": {
    "match_phrase": {
      "names": {
        "query": "Abraham Adware",
        "slop": 202
      }
    }
  }
}

position_increment_gap值可在映射時指定

//設置position_increment_gap值爲0
PUT param_increment_gap_map_index
{
  "mappings": {
    "properties": {
      "names":{
        "type": "text",
        "position_increment_gap": 0
      }
    }
  }
}

PUT param_increment_gap_map_index/_doc/1
{
  "names":["John Abraham","Lincoln Smith"]
}

//因爲position_increment_gap值爲0,此處查詢不再需要指定slop
GET param_increment_gap_map_index/_search
{
  "query": {
    "match_phrase": {
      "names": "Abraham Lincoln"
    }
  }
}

properties參數

對索引字段進行類型映射時用於指定字段,object字段和nested字段包含子字段(描述這些字段時也需使用該參數),這些類型可以是任意類型,properties參數可在以下位置出現:
1)、創建索引時顯式定義;
2)、使用mapping api新增或更新時顯式定義;
3)、爲文檔建立索引時動態映射新的字段時;

//properties可以在頂層定義,定義object/nested類型的字段
PUT param_properties_index
{
  "mappings": {
    "properties": {
      "manager": {
        "properties": {
          "age": {
            "type": "integer"
          },
          "name": {
            "type": "text"
          }
        }
      },
      "employees": {
        "type": "nested",
        "properties": {
          "age": {
            "type": "integer"
          },
          "name": {
            "type": "text"
          }
        }
      }
    }
  }
}

PUT param_properties_index/_doc/1
{
  "region": "CHINA",
  "manage": {
    "age": 30,
    "name": "mana"
  },
  "employees": [
    {
      "age": 24,
      "name": "emp1"
    },
    {
      "age": 26,
      "name": "emp2"
    }
  ]
}


//內部類型可以查詢及聚合等操作
GET param_properties_index/_search
{
  "query": {
    "match": {
      "manage.name": "mana"
    }
  },
  "aggs": {
    "employees": {
      "nested": {
        "path": "employees"
      },
      "aggs": {
        "emp_age": {
          "histogram": {
            "field": "employees.age",
            "interval": 5
          }
        }
      }
    }
  }
}

search_analyzer參數

一般情況下,建立索引時的analyzer與查詢時的analyzer應該是同一個,保證查詢的分詞與倒排索引中存儲的格式一致;
但是有些時候指定其他的analyzer也是有意義的,例如使用edge_ngram分詞器進行自動填充;
默認情況下,查詢所使用的分詞器就是定義索引時指定的,不過查詢使用的analyzer可以使用search_analyzer指定;

//自定義filter-autocomplete_filter,自定義analyzer-autocomplete,設置索引時analyzer爲autocomplete,索引時analyzer爲standard
PUT param_search_analyzer_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter":{
          "type":"edge_ngram",
          "min_gram":1,
          "max_gram":20
        }
      },
      "analyzer": {
        "autocomplete":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":["lowercase","autocomplete_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text":{
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "standard"
      }
    }
  }
}

//索引時分詞器將text字段切分成索引詞[q,qu,qui,quic,quick,b,br,bro,brow,brown,f,fo,fox]
PUT param_search_analyzer_index/_doc/1
{
  "text":"Quick Brown Fox"
}

//分詞規則同上
PUT param_search_analyzer_index/_doc/2
{
  "text":"Quick to do"
}

//分詞規則同上
PUT param_search_analyzer_index/_doc/3
{
  "text":"Quick get brand"
}


GET param_search_analyzer_index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "Quick Br",
        "operator": "and"
      }
    }
  }
}

GET param_search_analyzer_index/_search
{
  "query": {
    "match": {
      "text": "Quick Br"
    }
  }
}

similarity參數

es允許配置自定義評分算法或在各個字段上單獨設置similarity參數,similarity參數提供一種簡單方式配置評分算法,默認是BM25,還可選擇TF/IDF和boolean;
相似性算法對於text類型字段最有用,不過其它類型字段也可使用;
可以通過調整內置的相似性參數來配置自定義的相似度算法;
es提供了幾種內置開箱即用的相似度算法:

序號 算法 說明
1 BM25 Okapi BM25算法,es和Lucene中默認的算法;
2 classic TD/IDF算法,以前是es和Lucene中默認的算法,7.0.0版本已經過期;
3 boolean 簡單的布爾相似度,用於非全文排名場景,其計算的分數基於查詢詞是否匹配,布爾相似度確定查詢詞分數等於查詢boost值;

similarity參數在新字段首次創建時在字段級別設置:

//分別定義default_field字段和boolean_similarity_field字段,若指定classic類型的算法在7.x版本將報錯
//The [classic] similarity may not be used anymore. Please use the [BM25] similarity or build a custom [scripted] similarity instead.
PUT param_similarity_index
{
  "mappings": {
    "properties": {
      "default_field":{
        "type": "text"
      },
      "boolean_similarity_field":{
        "type": "text",
        "similarity": "boolean"
      }
    }
  }
}

//指定default_field與boolean_similarity_field相同字段值,再通過查詢查看不同的相似度算法計算的分數
PUT param_similarity_index/_doc/1
{
  "default_field":"Elasticsearch allows you to configure a scoring algorithm or similarity per field",
  "boolean_similarity_field":"Elasticsearch allows you to configure a scoring algorithm or similarity per field"
}

(1a)、請求參數

GET param_similarity_index/_search
{
  "query": {
    "match": {
      "default_field": "Elasticsearch"
    }
  }
}

(1b)、返回結果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "param_similarity_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "default_field" : "Elasticsearch allows you to configure a scoring algorithm or similarity per field",
          "boolean_similarity_field" : "Elasticsearch allows you to configure a scoring algorithm or similarity per field"
        }
      }
    ]
  }
}

(2a)、請求參數

GET param_similarity_index/_search
{
  "query": {
    "match": {
      "boolean_similarity_field": "Elasticsearch"
    }
  }
}

(2b)、返回結果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "param_similarity_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "default_field" : "Elasticsearch allows you to configure a scoring algorithm or similarity per field",
          "boolean_similarity_field" : "Elasticsearch allows you to configure a scoring algorithm or similarity per field"
        }
      }
    ]
  }
}

store參數

默認情況下,字段值建立索引後可被查詢,但此時字段值還未被存儲,這意味這些字段可被查詢,但是原始的字段值無法查詢到;
一般情況下這也沒什麼問題,因爲在_source字段中的字段值默認會被存儲;如果希望查詢時不返回整個_source字段值,可以使用_source過濾功能;
在特定場景下設置store參數是有意義的,假如有個文檔包含若干字段,但是其中有些字段特別長,在查詢的時候也不需要,這時可以在需要返回字段設置store參數:

tips:store字段在映射時指定後將不可更改,否則將拋出異常Mapper for [content] conflicts with existing mapping:\n[mapper [content] has different [store] values]

PUT param_store_index
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text",
        "store": true
      },
      "date":{
        "type": "date",
        "store": true
      },
      "content":{
        "type": "text"
      }
    }
  }
}

PUT param_store_index/_doc/1
{
  "title":"param_store_index",
  "date":"2020-05-31",
  "content":"A very long content field..."
}

PUT param_store_index/_doc/2
{
  "title":"param_store_index_1",
  "date":"2020-05-30",
  "content":"A very long content field..."
}

GET param_store_index/_search
{
  "stored_fields": ["title","date","content"]
}

term_vector參數

詞元向量(方向及容量)包含了文本分析產生的詞元(term),包含以下部分:
1)、詞元列表;
2)、每個詞元的位置或順序;
3)、詞元在原始字段中的相對起始位置;
4)、負載–與每個term關聯的用戶自定義二進制數據;
這些term vector將會被存儲以用於檢索特定的文檔;

term_vector參數接受的參數值:

序號 參數值 說明
1 no 沒有term vector會被存儲;
2 yes 僅僅字段中的term會被存儲;
3 with_positions term和term位置會被存儲;
4 with_offsets term和term字符位置會被存儲;
5 with_positions_offsets term、term位置、term字符位置會被存儲
6 with_positions_payloads term、term位置、負載會被存儲
7 with_positions_offsets_payloads term、term位置、erm字符位置、負載會被存儲

設置with_position_offsets會使字段的索引大小加倍;

PUT param_term_vector_index
{
  "mappings": {
    "properties": {
      "text":{
        "type": "text",
        "term_vector": "with_positions_offsets"
      }
    }
  }
}

PUT param_term_vector_index/_doc/1
{
  "text":"Quick brown fox"
}

//因爲配置了term_vector,可以使得高亮語法查詢效率更高
GET param_term_vector_index/_search
{
  "query": {
    "match": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章