基礎入門:索引文檔、查詢、聚合

1 和elasticsearch交互

1.1 格式

  curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'

1.2 查詢實例

  # 計算集羣中文檔數量
  curl -XGET 'http://localhost:9200/_count?pretty'
  # 返回http請求頭部信息
  curl -i -XGET 'http://localhost:9200/_count?pretty'
  # 集羣健康狀態 Yellow:副本分片異常  Red：主分片異常
  curl -i -XGET "http://localhost:9200/_cluster/health"

2 面向文檔

elasticsearch是面向文檔的，存儲整個對象或文檔、索引每個文檔內容，使之可以被檢索.
elasticsearch使用Json作爲文檔的序列化格式.

3 簡單示例瞭解elasticsearch如何工作

3.1 索引員工文檔

elasticsearch中"索引"含義：

名稱	說明
索引(名詞)	一個"索引"類似關係型數據庫的一個"數據庫",存儲關係型文檔地方.
索引(動詞)	索引一個文檔就是存儲一個文檔到一個"索引"中以便被檢索和查詢. 類似於SQL語句的INSERT語句.
倒排索引

創建索引

PUT /megacorp/employee/1
curl -X PUT "localhost:9200/megacorp/employee/1?pretty" -H 'Content-Type: application/json' -d ''
{
	"first_name" : "John",
	"last_name" :  "Smith",
	"age" :        25,
	"about" :      "I love to go rock climbing",
	"interests": [ "sports", "music" ]
}

PUT /megacorp/employee/2
curl -X PUT "localhost:9200/megacorp/employee/1?pretty" -H 'Content-Type: application/json' -d ''
{
	"first_name" :  "Jane",
	"last_name" :   "Smith",
	"age" :         32,
	"about" :       "I like to collect rock albums",
	"interests":  [ "music" ]
}

PUT /megacorp/employee/3
curl -X PUT "localhost:9200/megacorp/employee/1?pretty" -H 'Content-Type: application/json' -d ''
{
	"first_name" :  "Douglas",
	"last_name" :   "Fir",
	"age" :         35,
	"about":        "I like to build cabinets",
	"interests":  [ "forestry" ]
}

3.2 查詢(ad-hoc)

3.2.1 檢索文檔

GET /megacorp/employee/1
curl -X GET "localhost:9200/megacorp/employee/1?pretty"

3.2.2 輕量搜索

# 返回所有文檔信息
GET /megacorp/employee/_search
curl -X GET "localhost:9200/megacorp/employee/_search?pretty"

# 添加查詢字符串搜索
GET /megacorp/employee/_search?q=last_name:Smith
curl -X GET "localhost:9200/megacorp/employee/_search?q=last_name:Smith&pretty"

3.2.3 查詢表達式搜索

GET /megacorp/employee/_search
curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d ''
{
	"query" : {
    	"match" : {
        	"last_name" : "Smith"
    	}
	}
}

3.2.4 過濾器搜索

GET /megacorp/employee/_search
curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d ''
{
	"query" : {
    	"bool": {                                        # 布爾表達式
        	"must": {
            	"match" : {
                	"last_name" : "Smith" 
            	}
        	},
        	"filter": {
            	"range" : {                              # range 過濾器
                	"age" : { "gt" : 30 } 
            	}
        	}
    	}
	}
}

3.2.5 全文搜索

elasticsearch默認按照相關性得分排序，即每個文檔跟查詢的匹配程度.
對於搜索到任何一個字符串的查詢都會被返回.

GET /megacorp/employee/_search
curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d ''
{
	"query" : {
    	"match" : {
        	"about" : "rock climbing"
    	}
	}
}

3.2.6 短語搜索

精確匹配一系列單詞或者短語.
只會返回包含短語"rock climbing"的文檔.

GET /megacorp/employee/_search
curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d ''
{
	"query" : {
    	"match_phrase" : {
        	"about" : "rock climbing"
    	}
	},
	"highlight": {                             # 高亮搜索，搜索結果"about"字段會被html修飾
    	"fields" : {
        	"about" : {}
    	}
	}
}

3.3 查詢(請求體查詢)

3.3.1 查詢表達式

查詢表達式(Query DSL)是一種非常靈活又富有表現力的查詢語言。 Elasticsearch 使用它可以以簡單的 JSON 接口來展現 Lucene 功能的絕大部分。

# 空查詢
POST/GET /_search
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{}
'
# 空查詢類似於match_all
GET /_search
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
	"query": {
    	"match_all": {}
	}
}
'
# 查詢tweet字段中是否包含elasticsearch
GET /_search
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
	"query": {
    	"match": {
        	"tweet": "elasticsearch"
    	}
	}
}
'

3.3.2 組合查詢語句：

# 查找 title 字段匹配 how to make millions 並且不被標識爲 spam 的文檔
# 被標識爲 starred 或在2014之後的文檔，將比另外那些文檔擁有更高的排名
{
"bool": {
    "must":     { "match": { "title": "how to make millions" }},
    "must_not": { "match": { "tag":   "spam" }},
    "should": [
        { "match": { "tag": "starred" }},
        { "range": { "date": { "gte": "2014-01-01" }}}
    	]
	}
}
==> 如果沒有 must 語句，那麼至少需要能夠匹配其中的一條 should 語句.
==> 但，如果存在至少一條 must 語句，則對 should 語句的匹配沒有要求.

3.3.3 過濾查詢

當使用於過濾情況時，查詢被設置成一個“不評分”或者“過濾”查詢. 即，這個查詢只是簡單的問一個問題：“這篇文檔是否匹配？”. 回答也是非常的簡單，yes 或者 no ，二者必居其一.
對於精確值的查詢，你可能需要使用 filter 語句來取代 query，因爲 filter 將會被緩存.

# 帶過濾器組合查詢
{
"bool": {
    "must":     { "match": { "title": "how to make millions" }},
    "must_not": { "match": { "tag":   "spam" }},
    "should": [
        { "match": { "tag": "starred" }}
    ],
    "filter": {
      "range": { "date": { "gte": "2014-01-01" }} 
   	 	}
	}
}
==> 查找title字段匹配how to make millions，並且不被標識爲spam的文檔. 如果文檔被標識爲starred，那麼該文檔擁有更高的排名得分. 
==> 如果文檔date字段大於2014-01-01，那麼使用filter查詢，將不計算相關性排名.

#  多個不同的標準來過濾你的文檔，bool 查詢本身也可以被用做不評分的查詢
{
"bool": {
    "must":     { "match": { "title": "how to make millions" }},
    "must_not": { "match": { "tag":   "spam" }},
    "should": [
        { "match": { "tag": "starred" }}
    ],
    "filter": {
      "bool": { 
          "must": [
              { "range": { "date": { "gte": "2014-01-01" }}},
              { "range": { "price": { "lte": 29.99 }}}
          ],
          "must_not": [
              { "term": { "category": "ebooks" }}
         	 	]
      		}		
    	}
	}
}
==> 查找title字段匹配how to make millions，並且不被標識爲spam的文檔. 如果文檔被標識爲starred，那麼該文檔擁有更高的排名得分. 
==> 如果date大於 2014-01-01並且price小於29.99，並且category不等於ebooks，那麼使用filter查詢過濾.

3.3.4 自帶的重要查詢語句
match_all：匹配所有文檔
match查詢：即可用於全文搜索，也可用於精確查詢
multi_match：多個字段執行相同的match查詢
range：找出落在指定區間的數字或者時間允許的操作符：gt、gte、lt、lte
term查詢：用於精確值匹配，這些精確值可能是數字、時間、布爾或者那些 not_analyzed 的字符串
terms 查詢：允許你指定多值進行匹配，如果這個字段包含了指定值中的任何一個值，那麼這個文檔滿足條件. 不評分
exists和missing查詢：查找指定字段中有值和無值的文檔.
3.3.5 驗證查詢

GET /gb/tweet/_validate/query

4 排序和相關性

4.1 排序

4.1.1 按照字段值排序

GET /_search
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d''
{
	"query" : {
		"bool" : {
    		"filter" : { "term" : { "user_id" : 1 }}
		}
	},
	"sort": { "date": { "order": "desc" }}
}

4.1.2 多級排序

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
	"query" : {
		"bool" : {
    		"must":   { "match": { "tweet": "manage text search" }},
    		"filter" : { "term" : { "user_id" : 2 }}
	}
},
	"sort": [
		{ "date":   { "order": "desc" }},
		{ "_score": { "order": "desc" }}
 ]
}
'

4.1.3 多值字段排序

"sort": {
	"dates": {
		"order": "asc",
		"mode":  "min"
	}
}

4.2 字符串排序與多字段

多字段映射：

"tweet": { 
	"type":     "string",
	"analyzer": "english",
	"fields": {
    	"raw": { 
        	"type":  "string",
        	"index": "not_analyzed"
    	}
	}
}

tweet 字段用於搜索，tweet.raw 字段用於排序：

GET /_search
{
	"query": {
    	"match": {
        	"tweet": "elasticsearch"
    	}
	},
	"sort": "tweet.raw"
}

5 分析

elasticsearch 有一個功能叫聚合（aggregations），允許我們基於數據生成一些精細的分析結果.
聚合與 SQL 中的 GROUP BY 類似但更強大.
比如這裏對興趣愛好字段：interests進行聚合.

```
# 先執行下面命令，對interests的mapping添加fielddata=true
# 5.x後對排序，聚合這些操作用單獨的數據結構(fielddata)緩存到內存裏了
PUT megacorp/_mapping/employee/
{
	"properties": {
	"interests": { 
  		"type":     "text",
  		"fielddata": true
		}
	}
}

# 執行查詢
GET /megacorp/employee/_search
curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d ''
{
	"aggs": {
		"all_interests": {
  			"terms": { "field": "interests" }
		}
	}
}
輸出結果：兩位員工對音樂感興趣，一位對林業感興趣，一位對運動感興趣.
```

查找叫Smith員工興趣愛好：

```
GET /megacorp/employee/_search
curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d ''
{
	"query": {
		"match": {
  			"last_name": "smith"
		}
	},
	"aggs": {
		"all_interests": {
  			"terms": {
    			"field": "interests"
  			}
		}
	}
}
```

分級聚合：比如，查詢特定興趣愛好員工的平均年齡.

```
GET /megacorp/employee/_search
curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d ''
{
	"aggs" : {
    	"all_interests" : {
        	"terms" : { "field" : "interests" },
        	"aggs" : {
            	"avg_age" : {
                	"avg" : { "field" : "age" }
            	}
        	}
    	}
	}
}
```

6 文檔

elasticsearch中一個文檔就是一條記錄，一個json串.

6.1 索引文檔(存儲文檔)

# 使用自定義的ID
PUT /website/blog/123
curl -X PUT "localhost:9200/website/blog/123?pretty" -H 'Content-Type: application/json' -d ''
{
	"title": "My first blog entry",
	"text":  "Just trying this out...",
	"date":  "2014/01/01"
}

# 自動生成ID
POST /website/blog/
curl -X POST "localhost:9200/website/blog/?pretty" -H 'Content-Type: application/json' -d ' '
{
	"title": "My second blog entry",
	"text":  "Still trying this out...",
"date":  "2014/01/01"
}

6.2 獲取文檔

# 取回一個文檔
GET /website/blog/123?pretty
# 返回文檔的一部分
GET /website/blog/123?_source=title,text
# 只返回_source字段，不需要其他元數據
GET /website/blog/123/_source
# 獲取多個文檔
# mget API 要求有一個 docs 數組作爲參數，每個 元素包含需要檢索文檔的元數據， 包括 _index 、 _type 和 _id 
GET /_mget
curl -X GET "localhost:9200/_mget?pretty" -H 'Content-Type: application/json' -d'
{
	"docs" : [
  		{
     		"_index" : "website",
     		"_type" :  "blog",
     		"_id" :    2
  		},
  		{
     		"_index" : "website",
     		"_type" :  "pageviews",
     		"_id" :    1,
     		"_source": "views"                            # 檢索某個字段
  		}
	]
}
	'
# 檢索文檔位於一個_index中
GET /website/blog/_mget
curl -X GET "localhost:9200/website/blog/_mget?pretty" -H 'Content-Type: application/json' -d'
{
	"docs" : [
  		{ "_id" : 2 },
  		{ "_type" : "pageviews", "_id" :   1 }
	]
}
'
# 如果所有文檔的 _index 和 _type 都是相同的
 GET /website/blog/_mget
{
	"ids" : [ "2", "1" ]
}

6.3 判斷文檔在不在

# 存在的話，頭部信息返回200返回碼
curl -i -X HEAD http://172.22.12.25:9200/megacorp/employee/1

6.4 更新文檔
- 6.4.1 更新整個文檔
elasticsearch 中文檔是不可改變的，不能修改它們. 相反，如果想要更新現有的文檔，需要重建索引或者進行替換，我們可以使用相同的 index API 進行實現
```
PUT /website/blog/123
curl -X PUT "localhost:9200/website/blog/123?pretty" -H 'Content-Type: application/json' -d''
{
	"title": "My first blog entry",
	"text":  "I am starting to get the hang of this...",
	"date":  "2014/01/02"
}
```
- 6.4.2 更新部分文檔
  
  發生在分片內部，這樣就避免了多次請求的網絡開銷.
  update 請求最簡單的一種形式是接收文檔的一部分作爲 doc 的參數，它只是與現有的文檔進行合併。對象被合併到一起，覆蓋現有的字段，增加新的字段.
```
POST /website/blog/1/_update
curl -X POST "localhost:9200/website/blog/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
	"doc" : {
		"tags" : [ "testing" ],
		"views": 0
	}
}
'
```
6.5 創建新文檔

如果創建新文檔的請求成功執行，Elasticsearch 會返回元數據和一個 201 Created 的 HTTP 響應碼.
如果具有相同的 _index 、 _type 和 _id 的文檔已經存在，Elasticsearch 將會返回 409 Conflict 響應碼.
```
PUT /website/blog/123/_create
curl -X PUT "localhost:9200/website/blog/123/_create" -H 'Content-Type: application/json' -d''
```
6.6 刪除文檔

如果文檔沒有找到，我們將得到 404 Not Found 的響應碼和類似這樣的響應體.
刪除文檔不會立即將文檔從磁盤中刪除，只是將文檔標記爲已刪除狀態.
```
DELETE /website/blog/123
curl -X DELETE "localhost:9200/website/blog/123?pretty"
```
6.7 處理衝突

使用 index API 更新文檔原理：一次性讀取原始文檔，做修改，然後重新索引整個文檔. elasticsearch只會保留最近一次對文檔的更改狀態.
如果同時有多個用戶更改文檔，他們的更改有可能會丟失.
- 兩種併發控制策略：
  悲觀併發控制：關係型數據庫使用，假定有可能發生變更衝突，因此阻塞訪問資源防止衝突. 比如對線程加鎖.
  樂觀併發控制：elasticsearch使用，假定不會發生衝突，並且不會阻塞正在嘗試的操作. 但是，如果源數據在讀寫當中被修改，更新將會失敗. 應用程序接下來決定如何解決衝突，例如重試更新、使用新的數據、將相關情況報告給用戶.
- 樂觀併發控制
  
  elasticsearch使用文檔的 _version 字段確保變更以正確順序執行. 如果舊版本在新版本後到達，直接被忽略.
  所有文檔的更新或刪除 API，都可以接受 version 參數，這允許你在代碼中使用樂觀的併發控制.
  
  創建一個新的博客文章：
```
PUT /website/blog/1/_create
curl -X PUT "localhost:9200/website/blog/1/_create?pretty" -H 'Content-Type: application/json' -d'
{
	"title": "My first blog entry",
	"text":  "Just trying this out..."
}
'
```
  檢索文檔:
```
# 響應體包含相同的 _version 版本號 1
curl -X GET "localhost:9200/website/blog/1?pretty"
GET /website/blog/1

{
	"_index" :   "website",
	"_type" :    "blog",
	"_id" :      "1",
	"_version" : 1,
	"found" :    true,
	"_source" :  {
	"title": "My first blog entry",
	"text":  "Just trying this out..."
	}
}
```
  重建文檔的索引來保存修改: 此時_version=2.
  如果我們重新運行相同的索引請求，仍然指定 version=1 ， Elasticsearch 返回 409 Conflict HTTP 響應碼(因爲現在的版本號爲2).
```
PUT /website/blog/1?version=1
curl -X PUT "localhost:9200/website/blog/1?version=1&pretty" -H 'Content-Type: application/json' -d'
{
"title": "My first blog entry",
"text":  "Starting to get the hang of this..."
}
'
```

6.8 重用版本號

  elasticsearch 不是檢查當前 _version 和請求中指定的版本號是否相同， 而是檢查當前 _version 是否 小於 指定的版本號.
  如果請求成功，外部的版本號作爲文檔的新 _version 進行存儲.
  ```
  PUT /website/blog/2?version=5&version_type=external
  PUT /website/blog/2?version=10&version_type=external    # 正確響應
  PUT /website/blog/2?version=10&version_type=external    # 報錯409，因爲當前版本號等於指定的_version
  ```

6.9 批量操作(bulk)

bulk API 允許在單個步驟中進行多次 create 、 index 、 update 或 delete 請求.

示例

{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "create":  { "_index": "website", "_type": "blog", "_id": "123" }}

POST /_bulk             # 每個子請求都是獨立執行，因此某個子請求的失敗不會對其他子請求的成功與否造成影響
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} 
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }
{ "index":  { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} } 

POST /website/log/_bulk     # 相同索引請求
{ "index": {}}
{ "event": "User logged in" }
{ "index": { "_type": "blog" }}
{ "title": "Overriding the default type" }

基礎入門:索引文檔、查詢、聚合

目錄

1 和elasticsearch交互

1.1 格式

1.2 查詢實例

2 面向文檔

3 簡單示例瞭解elasticsearch如何工作

3.1 索引員工文檔

3.2 查詢(ad-hoc)

3.2.1 檢索文檔

3.2.2 輕量搜索

3.2.3 查詢表達式搜索

3.2.4 過濾器搜索

3.2.5 全文搜索

3.2.6 短語搜索

3.3 查詢(請求體查詢)

3.3.1 查詢表達式

3.3.2 組合查詢語句：

3.3.3 過濾查詢

3.3.4 自帶的重要查詢語句

3.3.5 驗證查詢

4 排序和相關性

4.1 排序

4.1.1 按照字段值排序

4.1.2 多級排序

4.1.3 多值字段排序

4.2 字符串排序與多字段

5 分析

6 文檔

6.1 索引文檔(存儲文檔)

6.2 獲取文檔

6.3 判斷文檔在不在

6.4 更新文檔

6.4.1 更新整個文檔

6.4.2 更新部分文檔

6.5 創建新文檔

6.6 刪除文檔

6.7 處理衝突

6.8 重用版本號

6.9 批量操作(bulk)