norms参数
norms参数会存储各种normalization因子用于查询时计算文档相对查询字段的相关分;
norms虽然对于相关分计算有帮助,但需要额外的磁盘空间进行存储(一般每个文档的每个字段会额外占用一个字节的空间,即使该字段没有值也同样需要一字节空间),故而如果没有针对特定字段计算分数的必要,可以将该字段置为false,特别是针对只用于排序或聚合的字段;
norms可以针对已存在值的字段进行设置,不过在禁用之后将不可再重新启用;
PUT param_norms_index
{
"mappings": {
"properties": {
"desc":{
"type": "text",
"norms":false
}
}
}
}
//报错,不允许将禁用的再重新启用,Mapper for [desc] conflicts with existing mapping:\n[mapper [desc] has different [norms] values, cannot change from disable to enabled]
PUT param_norms_index/_mapping
{
"properties":{
"desc":{
"type":"text",
"norms":true
}
}
}
norms在禁用之后不会立即删除,不过随着文档的增加,旧段合并到新段,这些norms参数才被移除;由于某些文档不再有norms参数,这可能导致在前后针对相同的文档计算分数存在不一致的情况;
null_value参数
null值在es中是不可建立索引和查询,一个字段设为null(空数组或者值为null的数组)将被视为该字段没有值;
null_value参数允许显式指定值为null时字段的默认值以使字段可以建立索引及可被查询;
需要注意的是指定的null_value的值需要与字段类型一致,否则将会报异常;
null_value只会影响字段为null时的索引,不会改定_source的json值;
//定义create_time字段且定义其null_value
PUT param_null_value_index
{
"mappings": {
"properties": {
"create_time":{
"type": "date",
"null_value": "2020-05-30"
}
}
}
}
//create_time字段不会被替换为null_value值
PUT param_null_value_index/_doc/1
{
"create_time":"2021-01-01"
}
//create_time字段不会被替换为null_value值
PUT param_null_value_index/_doc/2
{
"desc":"day day up"
}
//create_time字段将被替换为null_value值
PUT param_null_value_index/_doc/3
{
"desc":"历史记录",
"create_time":null
}
//create_time字段不会被替换为null_value值
PUT param_null_value_index/_doc/4
{
"desc":"历史记录1",
"create_time":[]
}
//create_time字段将被替换为null_value值
PUT param_null_value_index/_doc/5
{
"desc":"历史记录2",
"create_time":[null,null]
}
GET param_null_value_index/_search
{
"query": {
"range": {
"create_time": {
"gte": "2020-05-10",
"lte": "2021-05-10"
}
}
}
}
查询结果
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "param_null_value_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"create_time" : "2021-01-01"
}
},
{
"_index" : "param_null_value_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"desc" : "历史记录",
"create_time" : null
}
},
{
"_index" : "param_null_value_index",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"desc" : "历史记录2",
"create_time" : [
null,
null
]
}
}
]
}
}
position_increment_gap参数
text类型字段在分词之后es将记录字段中每个词的位置(顺序记录)用于短语查询(phrase query);当对有多个值的text类型字段进行索引时会在不同值之间添加一个伪间隙以防止短语查询时不同值之间的跨值匹配;
参数position_increment_gap用来配置间隙值,默认值为100;
PUT param_increment_gap_index/_doc/1
{
"names":["John Abraham","Lincoln Smith"]
}
//无法查询匹配结果
GET param_increment_gap_index/_search
{
"query": {
"match_phrase": {
"names": "Abraham Lincoln"
}
}
}
//匹配到结果,因为slop值不小于position_increment_gap的默认值
GET param_increment_gap_index/_search
{
"query": {
"match_phrase": {
"names": {
"query": "Abraham Lincoln",
"slop": 100
}
}
}
}
//另外一种场景,跨值查询(跳过中间值"Lincoln Smith")
PUT param_increment_gap_index/_doc/1
{
"names":["John Abraham","Lincoln Smith","Adware Kelin"]
}
GET param_increment_gap_index/_search
{
"query": {
"match_phrase": {
"names": {
"query": "Abraham Adware",
"slop": 202
}
}
}
}
position_increment_gap值可在映射时指定
//设置position_increment_gap值为0
PUT param_increment_gap_map_index
{
"mappings": {
"properties": {
"names":{
"type": "text",
"position_increment_gap": 0
}
}
}
}
PUT param_increment_gap_map_index/_doc/1
{
"names":["John Abraham","Lincoln Smith"]
}
//因为position_increment_gap值为0,此处查询不再需要指定slop
GET param_increment_gap_map_index/_search
{
"query": {
"match_phrase": {
"names": "Abraham Lincoln"
}
}
}
properties参数
对索引字段进行类型映射时用于指定字段,object字段和nested字段包含子字段(描述这些字段时也需使用该参数),这些类型可以是任意类型,properties参数可在以下位置出现:
1)、创建索引时显式定义;
2)、使用mapping api新增或更新时显式定义;
3)、为文档建立索引时动态映射新的字段时;
//properties可以在顶层定义,定义object/nested类型的字段
PUT param_properties_index
{
"mappings": {
"properties": {
"manager": {
"properties": {
"age": {
"type": "integer"
},
"name": {
"type": "text"
}
}
},
"employees": {
"type": "nested",
"properties": {
"age": {
"type": "integer"
},
"name": {
"type": "text"
}
}
}
}
}
}
PUT param_properties_index/_doc/1
{
"region": "CHINA",
"manage": {
"age": 30,
"name": "mana"
},
"employees": [
{
"age": 24,
"name": "emp1"
},
{
"age": 26,
"name": "emp2"
}
]
}
//内部类型可以查询及聚合等操作
GET param_properties_index/_search
{
"query": {
"match": {
"manage.name": "mana"
}
},
"aggs": {
"employees": {
"nested": {
"path": "employees"
},
"aggs": {
"emp_age": {
"histogram": {
"field": "employees.age",
"interval": 5
}
}
}
}
}
}
search_analyzer参数
一般情况下,建立索引时的analyzer与查询时的analyzer应该是同一个,保证查询的分词与倒排索引中存储的格式一致;
但是有些时候指定其他的analyzer也是有意义的,例如使用edge_ngram分词器进行自动填充;
默认情况下,查询所使用的分词器就是定义索引时指定的,不过查询使用的analyzer可以使用search_analyzer指定;
//自定义filter-autocomplete_filter,自定义analyzer-autocomplete,设置索引时analyzer为autocomplete,索引时analyzer为standard
PUT param_search_analyzer_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter":{
"type":"edge_ngram",
"min_gram":1,
"max_gram":20
}
},
"analyzer": {
"autocomplete":{
"type":"custom",
"tokenizer":"standard",
"filter":["lowercase","autocomplete_filter"]
}
}
}
},
"mappings": {
"properties": {
"text":{
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
//索引时分词器将text字段切分成索引词[q,qu,qui,quic,quick,b,br,bro,brow,brown,f,fo,fox]
PUT param_search_analyzer_index/_doc/1
{
"text":"Quick Brown Fox"
}
//分词规则同上
PUT param_search_analyzer_index/_doc/2
{
"text":"Quick to do"
}
//分词规则同上
PUT param_search_analyzer_index/_doc/3
{
"text":"Quick get brand"
}
GET param_search_analyzer_index/_search
{
"query": {
"match": {
"text": {
"query": "Quick Br",
"operator": "and"
}
}
}
}
GET param_search_analyzer_index/_search
{
"query": {
"match": {
"text": "Quick Br"
}
}
}
similarity参数
es允许配置自定义评分算法或在各个字段上单独设置similarity参数,similarity参数提供一种简单方式配置评分算法,默认是BM25,还可选择TF/IDF和boolean;
相似性算法对于text类型字段最有用,不过其它类型字段也可使用;
可以通过调整内置的相似性参数来配置自定义的相似度算法;
es提供了几种内置开箱即用的相似度算法:
序号 | 算法 | 说明 |
---|---|---|
1 | BM25 | Okapi BM25算法,es和Lucene中默认的算法; |
2 | classic | TD/IDF算法,以前是es和Lucene中默认的算法,7.0.0版本已经过期; |
3 | boolean | 简单的布尔相似度,用于非全文排名场景,其计算的分数基于查询词是否匹配,布尔相似度确定查询词分数等于查询boost值; |
similarity参数在新字段首次创建时在字段级别设置:
//分别定义default_field字段和boolean_similarity_field字段,若指定classic类型的算法在7.x版本将报错
//The [classic] similarity may not be used anymore. Please use the [BM25] similarity or build a custom [scripted] similarity instead.
PUT param_similarity_index
{
"mappings": {
"properties": {
"default_field":{
"type": "text"
},
"boolean_similarity_field":{
"type": "text",
"similarity": "boolean"
}
}
}
}
//指定default_field与boolean_similarity_field相同字段值,再通过查询查看不同的相似度算法计算的分数
PUT param_similarity_index/_doc/1
{
"default_field":"Elasticsearch allows you to configure a scoring algorithm or similarity per field",
"boolean_similarity_field":"Elasticsearch allows you to configure a scoring algorithm or similarity per field"
}
(1a)、请求参数
GET param_similarity_index/_search
{
"query": {
"match": {
"default_field": "Elasticsearch"
}
}
}
(1b)、返回结果
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "param_similarity_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"default_field" : "Elasticsearch allows you to configure a scoring algorithm or similarity per field",
"boolean_similarity_field" : "Elasticsearch allows you to configure a scoring algorithm or similarity per field"
}
}
]
}
}
(2a)、请求参数
GET param_similarity_index/_search
{
"query": {
"match": {
"boolean_similarity_field": "Elasticsearch"
}
}
}
(2b)、返回结果
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "param_similarity_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"default_field" : "Elasticsearch allows you to configure a scoring algorithm or similarity per field",
"boolean_similarity_field" : "Elasticsearch allows you to configure a scoring algorithm or similarity per field"
}
}
]
}
}
store参数
默认情况下,字段值建立索引后可被查询,但此时字段值还未被存储,这意味这些字段可被查询,但是原始的字段值无法查询到;
一般情况下这也没什么问题,因为在_source字段中的字段值默认会被存储;如果希望查询时不返回整个_source字段值,可以使用_source过滤功能;
在特定场景下设置store参数是有意义的,假如有个文档包含若干字段,但是其中有些字段特别长,在查询的时候也不需要,这时可以在需要返回字段设置store参数:
tips:store字段在映射时指定后将不可更改,否则将抛出异常Mapper for [content] conflicts with existing mapping:\n[mapper [content] has different [store] values]
PUT param_store_index
{
"mappings": {
"properties": {
"title":{
"type": "text",
"store": true
},
"date":{
"type": "date",
"store": true
},
"content":{
"type": "text"
}
}
}
}
PUT param_store_index/_doc/1
{
"title":"param_store_index",
"date":"2020-05-31",
"content":"A very long content field..."
}
PUT param_store_index/_doc/2
{
"title":"param_store_index_1",
"date":"2020-05-30",
"content":"A very long content field..."
}
GET param_store_index/_search
{
"stored_fields": ["title","date","content"]
}
term_vector参数
词元向量(方向及容量)包含了文本分析产生的词元(term),包含以下部分:
1)、词元列表;
2)、每个词元的位置或顺序;
3)、词元在原始字段中的相对起始位置;
4)、负载–与每个term关联的用户自定义二进制数据;
这些term vector将会被存储以用于检索特定的文档;
term_vector参数接受的参数值:
序号 | 参数值 | 说明 |
---|---|---|
1 | no | 没有term vector会被存储; |
2 | yes | 仅仅字段中的term会被存储; |
3 | with_positions | term和term位置会被存储; |
4 | with_offsets | term和term字符位置会被存储; |
5 | with_positions_offsets | term、term位置、term字符位置会被存储 |
6 | with_positions_payloads | term、term位置、负载会被存储 |
7 | with_positions_offsets_payloads | term、term位置、erm字符位置、负载会被存储 |
设置with_position_offsets会使字段的索引大小加倍;
PUT param_term_vector_index
{
"mappings": {
"properties": {
"text":{
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
}
PUT param_term_vector_index/_doc/1
{
"text":"Quick brown fox"
}
//因为配置了term_vector,可以使得高亮语法查询效率更高
GET param_term_vector_index/_search
{
"query": {
"match": {
"text": "brown fox"
}
},
"highlight": {
"fields": {
"text": {}
}
}
}