编程界的小学生

一、举例

用一个例子来说明mapping到底是什么玩意。

1、数据准备

PUT /blog/_doc/1
{
  "create_time": "2020-05-01",
  "title": "first article",
  "content": "xxxxxxx",
  "author_id": 123
}

PUT /blog/_doc/2
{
  "create_time": "2020-05-02",
  "title": "second article",
  "content": "xxxxxxxxxx",
  "author_id": 123
}

PUT /blog/_doc/3
{
  "create_time": "2020-05-03",
  "title": "third article",
  "content": "xxxxxxxxxxxxxxxx",
  "author_id": 123
}

2、搜索

GET /blog/_search?q=2020
0条结果输出。

GET /blog/_search?q=2020-05-01
1条结果输出，_id=1的那条。

GET /blog/_search?q=create_time:2020-05-01
1条结果输出，_id=1的那条。

GET /blog/_search?q=create_time:2020
0条结果输出。

3、分析

为什么上述的结果是那样的？理想状态下第一搜索方式不该三条都出来吗？这就是因为mapping再捣鬼，其实每个key对应的都有一种数据类型，比如create_date对应的就是date数据类型，每种数据类型es里的分词方式和搜索行为都是不同的，这些都体现在mapping里。mapping到底长什么样，怎么设置，什么查看等操作继续往下看。

二、Mapping

1、是什么

mapping就是ES数据字段field的type元数据，ES在创建索引的时候，dynamic mapping会自动为不同的数据指定相应mapping，mapping中包含了字段的类型、搜索方式（精确匹配还是全文检索）、分词器等。

2、如何查看

查看mapping的语法很简单：

GET /index/_mappings

比如：

GET /blog/_mapping

返回如下：

{
  "blog" : {
    "mappings" : {
      "properties" : {
        "author_id" : {
          "type" : "long"
        },
        "content" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "create_time" : {
          "type" : "date"
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

返回结果字段解释说明：

key	解释
blog	index名称
properties	index下所有document的字段
type	字段对应值所属数据类型
keyword	很有用，如果需要精确匹配的话就用field.keyword，这个es自动为我们默认生成的
ignore_above	超过长度将被忽略，比如content字段就是最大长度255，超出255的搜索字符长度会被忽略，想想百度/Google搜索，都有长度限制。

那为什么

GET /blog/_search?q=2020一条搜不到？因为默认ES的除了text数据类型，其他类型默认是不分词的，所以2020-05-01这完整是一个词，不会再次分词。

3、创建mapping

3.1、语法

PUT /index
{
  "mappings": {
    "properties": {
        "field": {
          "mapping_parameter": "parameter_value"
        }
      }
  }
}

3.2、Demo

创建之前要先保证index是没有的，也就是需先del掉 DELETE /blog，Mapping的创建只允许在index创建之前。

PUT /blog
{
  "mappings": {
    "properties": {
      "author_id": {
        "type": "long"
      },
      "title": {
        "type": "text",
        "analyzer": "english"
      },
      "content": {
        "type": "text", 
        "analyzer": "standard"
      },
      "create_date": {
        "type": "date"
      }
    }
  }
}

然后我们再次查看此mapping GET /blog/_mapping，会发现content分词器是standard，title分词器是english了。

{
  "blog" : {
    "mappings" : {
      "properties" : {
        "author_id" : {
          "type" : "long"
        },
        "content" : {
          "type" : "text",
          "analyzer" : "standard"
        },
        "create_date" : {
          "type" : "date"
        },
        "title" : {
          "type" : "text",
          "analyzer" : "english"
        }
      }
    }
  }
}

3.3、analyzer字段释义

取值	解释
no	无法通过检索查询到该字段
not_analyzed	将整个字段存储为一个词，不进行再次分词，常用于短语/成语/邮箱等场景
具体分词器	比如：english，standard等

只有text类型默认是分词的，分词器是standard，其余数据类型皆是not_analyzed（不分词）。

3.4、测试mapping

我们刚对title设置了english分词器，测试一把：

GET /blog/_analyze
{
  "field": "title",
  "text": "Hello-WorlD"
}

结果：

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

发现进行了分词操作，没毛病，而且很完美的是大小写自动转换、自动去除了-符号，这些都是分词器帮我们干的。分词器不是此篇重点，不多BB。

4、修改mapping

尝试将author_id和create_date修改为字符串类型。

PUT /blog
{
  "mappings": {
    "properties": {
      "author_id": {
        "type": "text"
      },
      "create_date": {
        "type": "text"
      }
    }
  }
}

结果：

{
  "error" : {
    "root_cause" : [
      {
        "type" : "resource_already_exists_exception",
        "reason" : "index [blog/cZEVhKIbS8GfHHmDS9rGYg] already exists",
        "index_uuid" : "cZEVhKIbS8GfHHmDS9rGYg",
        "index" : "blog"
      }
    ],
    "type" : "resource_already_exists_exception",
    "reason" : "index [blog/cZEVhKIbS8GfHHmDS9rGYg] already exists",
    "index_uuid" : "cZEVhKIbS8GfHHmDS9rGYg",
    "index" : "blog"
  },
  "status" : 400
}

会提示已存在，不让修改mapping，很简单的道理，你都有几百万数据了，你要修改mapping结构，那我数据类型、分词啥的都要重新搞一遍，这我怎么可能让你修改！

5、mapping的属性列表

属性	解释
analyzer	指定分析器
coerce	是否允许强制类型转换 true： long类型的field插入“1”也能成功 false： long类型的field插入“1”会报错，类型不匹配
doc values	为了提升排序和聚合效率，默认true，如果确定不需要对字段进行排序或聚合，也不需要通过脚本访问字段值，则可以禁用doc值以节省磁盘空间（`不支持text`和`annotated_text`）
eager_global_ordinals	用于聚合的字段上，优化聚合性能
ignore_above	超过长度将被忽略，想想百度/Google搜索都有长度限制
fields	给field创建多字段，用于不同目的（全文检索或者聚合分析排序）
norms	是否禁用评分（作为优化在filter和聚合字段上应该禁用）
search_analyzer	设置单独的查询时分析器
similarity	为字段设置相关度算法，支持BM25、claassic（TF-IDF）等，默认TF/IDF

更多的查看官方文档：https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-params.html

三、定制化dynamic mapping

重点在于dynamic这个属性，比如如下

举个最简单的例子：

PUT /my_blog
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "title": {
        "type": "text"
      },
      "tags": {
        "type": "object",
        "dynamic": "true"
      }
    }
  }
}

创建了一个只能有title和tags字段的index。因为dynamic:strict。而tags里配置了dynamic:true代表可以有任意额外的字段。而dynamic:strict代表严格模式，不允许其他额外字段，就比如mysql表结构都定了，然后你insert了一个不存在的字段，那就报错找不到字段了。

测试：

PUT /my_blog/_doc/1
{
  "title" : "first article",
  "content" : "xxxxxxx",
  "tags" : {
    "language" : "java c++ python"
  }
}

结果：

{
  "error" : {
    "root_cause" : [
      {
        "type" : "strict_dynamic_mapping_exception",
        "reason" : "mapping set to strict, dynamic introduction of [content] within [_doc] is not allowed"
      }
    ],
    "type" : "strict_dynamic_mapping_exception",
    "reason" : "mapping set to strict, dynamic introduction of [content] within [_doc] is not allowed"
  },
  "status" : 400
}

不出意外的报错，由于是严格模式，不允许动态增加额外字段。

测试tags：

PUT /my_blog/_doc/1
{
  "title" : "first article",
  "tags" : {
    "language" : "java c++ python",
    "stars": 10000
  }
}

结果：

{
  "_index" : "my_blog",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

成功了，因为tags带有"dynamic": "true"属性，允许额外增加field。这就叫dynamic mapping。

更多的查看官方文档：https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html

四、数据类型

4.1、列表

数据类型	举例
long	123456
double	123.111
boolean	true false
date	2020-05-28
string/text	字符串
integer/short/byte/float	基本类型
binary	二进制
range	区间类型，比如：integer_range、float_range long_range、double_range、date_range
Object	单个JSON对象
Nested	JSON对象数组

为啥123456不是integer？因为es的mapping_type是由JSON分析器检测数据类型，而Json没有隐式类型转换（integer=>long or float=>double）,所以dynamic mapping会选择一个比较宽的数据类型。

4.2、Object类型

PUT /blog/_doc/4
{
  "tags" : {
    "language" : "java c++ python",
    "city" : "beijing",
    "stars": 10000
  },
  "create_time": "2020-05-03",
  "title": "third article",
  "content": "xxxxxxxxxxxxxxxx",
  "author_id": 123
}

tags就是object类型，里面包含三个字段：language、city、stars。

五、mapping总结

创建索引的时候ES会默认为我们创建他认为合适的mapping
mapping其实可以粗糙理解成“表结构”，定义了数据类型等
不同的数据类型分词规则不同，string类型默认都是standard分词器，其他类型默认都不分词
可以提前手动创建index的mapping，进行自定义对每个field设置数据类型和分词器等等
keyword这个属性很牛逼的，ES自动为我们生成的，因为string类型默认都是分词的，但是指定field.keyword去查的话就是精确匹配了
面向官方文档学习：https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Elasticsearch的mapping到底是个什么玩意？