Elasticsearch Reference [5.2] » Getting Started

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

Elasticsearch是可高度擴展的、開源的、全字段搜索與分析的引擎。允許快速的存儲、搜索以及分析大量的數據,性能接近實時。一般用來作爲進行復雜搜索特徵和需求的應用的解決方案。

Here are a few sample use-cases that Elasticsearch could be used for:

  • You run an online web store where you allow your customers to search for products that you sell. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them.
  • You want to collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies. In this case, you can use Logstash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate, and parse your data, and then have Logstash feed this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information that is of interest to you.
  • You run a price alerting platform which allows price-savvy customers to specify a rule like "I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month". In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.
  • You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records). In this case, you can use Elasticsearch to store your data and then use Kibana (part of the Elasticsearch/Logstash/Kibana stack) to build custom dashboards that can visualize aspects of your data that are important to you. Additionally, you can use the Elasticsearch aggregations functionality to perform complex business intelligence queries against your data.

以下是一些Elasticsearch可以支持的簡單用例:
  • 經營在線網店時,允許用戶在線搜索商品。這種場景下,可以使用Elasticsearch存儲所有商品的特徵和分類,併爲上品提供搜索功能以及完全的建議。
  • 可以用來蒐集日誌或者交易數據,然後分析或者挖掘這些數據以獲取購買趨勢、統計數據、彙總數據、或者異常。這種情況下,你可以使用Logstash(Elasticsearch/Logstash/Kibana技術棧的一部分)去收集、聚合或者分析數據,然後將Logstash數據傳輸給Elasticsearch。一旦數據存入Elasticsearch,可以搜索或者聚合數據以挖掘你感興趣的信息。
  • 可以運行價格提醒平臺,支持客戶設定一些價格提醒規則:例如對某種商品比較感興趣,當此商品價格低於某個價格時,可以通知此客戶。這種情況下,你可以抓取買方價格,然後將它們推送到Elasticsearch,然後使用逆向搜索能力對價格變動和客戶需求相匹配,最終將找到的價格匹配的商品通知客戶。
  • 有分析/商業智能需求,想快速調查、分析、可視化或者詢問大量數據的臨時問題(想象一下百萬或者十億級別的記錄)。這種情況下,你可以使用Elasticsearch存儲數據,然後使用Kibana來創建自定義的儀表盤,可以將你認爲重要的數據可視化顯示。另外,可以使用Elasticsearch聚合功能執行復雜的商業智能需求。


For the rest of this tutorial, I will guide you through the process of getting Elasticsearch up and running, taking a peek inside it, and performing basic operations like indexing, searching, and modifying your data. At the end of this tutorial, you should have a good idea of what Elasticsearch is, how it works, and hopefully be inspired to see how you can use it to either build sophisticated search applications or to mine intelligence from your data.

本文剩下的部分,將會引導你啓動Elasticsearch、運行、執行基本操作諸如索引、搜索以及修正數據。本文最後,你會發現Elasticsearch是什麼、怎樣工作、以及對你使用它創建搜索應用或者數據挖掘智能化方面是多麼有幫助。








1、Basic Conceptsedit

基本概念





There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.

Elasticsearch有一些核心概念。理解這些概念有助於簡化學習過程。


Near Realtime (NRT)edit

接近實時


Elasticsearch is a near real time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.

Elasticsearch是接近實時搜索的平臺。這就意味着從你開始索引文檔開始到文檔變爲可搜索狀態會有輕微的延遲(通常是1s)。


Clusteredit

集羣

A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.

Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-devlogging-stage, and logging-prod for the development, staging, and production clusters.


Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.


集羣是一個或者多個節點(servers)組成,共同存儲全部數據並提供索引和搜索能力。集羣通過唯一的名字區分,默認是elasticsearch。這個名字非常重要,因爲一個節點只能加入一個集羣,就是通過這個名字加入集羣的。

確保不會使用相同的集羣名字,否則節點將會加入錯誤的集羣。例如你可以使用logging-dev,logging-stage以及logging-prod用於開發、級連、以及生產環境。

注意:集羣只有一個節點是正確的。而且,你以後可能會有多個獨立的集羣,每個集羣都有自己獨一無二的集羣名字。


Nodeedit

節點


A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.

一個節點就是一個單獨的server,作爲集羣的一部分,存儲數據,參與集羣的索引以及搜索操作。就像集羣一樣,一個節點可以用一個唯一的名字標識,默認是UUID,首次啓動時分配。如果不想使用默認名字,可以使用任意節點名字。這個名字對於你確定哪些servers和Elasticsearch集羣中哪些節點相關是非常重要的。


A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch.

一個節點可以通過集羣名字配置爲加入一個特定的集羣。默認的,每個節點設定爲加入名爲elasticsearch的集羣,即如果啓動一些節點-假定它們可以發現對方-它們會自動形成並加入一個名爲elasticsearch的集羣。


In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.

在一個單獨的集羣中,要多少節點就可以有多少節點。而且,如果沒有其他的Elasticsearch節點運行在你的網絡中,啓動一個單獨的節點將默認的創建一個名爲elasticsearch的集羣。


Indexedit

索引

An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.


In a single cluster, you can define as many indexes as you want.

索引是具有某些相似特徵的文檔的集合。例如,可以用一個索引存儲用戶數據,一個索引存儲生產目錄,以及另外一個索引存儲其他有序數據。一個索引用一個名字標識(必須是小寫字母),這個名字用來找到這個索引,當執行索引、搜索、更新以及刪除操作時,都會用到這個名字。

在一個單獨的集羣中,索引的個數可以要多少有多少。


Typeedit

類型


Within an index, you can define one or more types. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.

索引內部,可以定義一種或者多種類型。一個類型就是一個邏輯上目錄,它的語義對於你來說是完整的。大體來說,一種類型可以定義爲包含一系列共同字段的文檔。例如,讓我們假定你運行了一個博客平臺,在一個單獨的索引中存儲你所有的數據。在這個索引中,你可以爲用戶數據定義一種類型,爲博文數據定義一種類型,爲評論數據定義一種類型。


Documentedit

文檔

A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.


Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.

一個文檔是可以被索引的基礎信息單元。例如,你可以對每個單獨的客戶創建一個文檔,爲每個單獨的產品創建另一個文檔,爲一個單獨的命令創建一個文檔。這個文檔使用JSON格式表達,這是一種無處不在的互聯網數據交互方式。

在索引/類型內部,想要存儲多少文檔就可以存儲多少。注意:儘管文檔物理上必須存儲在索引中,但是實際上一個文檔必須索引到或者分配到索引內部的某個類型中。


Shards & Replicasedit

分片以及備份


An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

一個索引可以存儲大量的數據,可以存儲超出單個節點的硬件限制。例如,一個索引包含十億個文檔,佔有1TB的磁盤空間,那麼一個單獨節點的磁盤是無法存放的,或者說對於單個節點來說,存儲這麼數據對於搜索來說也太慢了。


To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.

爲了解決這個問題,Elasticsearch提供將索引劃分成多個稱謂shards的數據分片。當創建一個索引時,可以簡單的定義shards的數量。每個shard自己都是具有全功能並且獨立的“索引”,可以存放到集羣中的任何一個節點上。

Sharding is important for two primary reasons:

  • It allows you to horizontally split/scale your content volume
  • It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput

Sharding(分片)的重要性有以下兩個原因:
  • 允許水平分割或者擴展數據
  • 允許在shards之間進行分佈式或者並行操作,這樣可以提高性能或者吞吐量

The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.

In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.

shard的分佈式機制以及文檔聚合機制是完全由Elasticsearch管理並對用戶來說是透明的。

在網絡或者雲環境下,任何時候都可能發生任何錯誤,因此高度建議採用備份機制以避免數據丟失。Elasticsearch允許對索引的shards備份一份或者多份。

Replication is important for two primary reasons:

  • It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
  • It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.


備份的重要性有兩個原因:
  • 提供某個shard或者節點失效時的高可用性。出於這個原因,備份的shards之間不應當分佈在相同的節點上。
  • 允許擴展搜索的數據量或者吞儲量,這樣可以並行搜索所有備份。

To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards after-the-fact.

總的來說,每個索引可以劃分成多個shards。一個索引可以沒有備份或者有很多備份。一旦備份了,每個索引都會有主要shards(原始的shard,其他備份shard都是從這個shard上拷貝)和備份shard。shards和備份的數量可以在創建索引時指定。在索引創建之後,可以動態的改變備份數,但是無法更改shards的數量(備份數可改,但是主要shards數目不能改)


By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.

默認情況下,每個Elasticsearch中的索引分配5個主shards,每個主shard都有一個備份,即集羣至少要有兩個節點,索引將會有5個主要shards和另外5個備份shards(一個完整的備份),即每個索引都會有10個shards。

Note

Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519(= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards api.

每個Elasticsearch shard是一個Lucene索引。每個單獨的Lucene索引的文檔數都有最大值。就像Lucene-5843,限制是2,147,483,519(=整型的最大值-128)個文檔。你可以通過食用_cat/shards api來監控shard的尺寸。





2、Installationedit


安裝部署

Elasticsearch requires at least Java 8. Specifically as of this writing, it is recommended that you use
the Oracle JDK version 1.8.0_73. Java installation varies from platform to platform so we won’t go
into those details here. Oracle’s recommended installation documentation can be found on
Oracle’s website. Suffice to say, before you install Elasticsearch, please check your Java version
first by running (and then install/upgrade accordingly if needed):

Elasticsearch至少需要Java 8。本文推薦你使用Oracle JDK version 1.8.0_73。在不同的平臺上安裝Java
會有很大不同,這裏我們就不討論這些細節了。Oracle的推薦安裝文檔可以在Oracle網站上找到。在安裝
Elasticsearch之前,最好檢查一下java版本。

java -version
echo $JAVA_HOME

Once we have Java set up, we can then download and run Elasticsearch. The binaries are available
from www.elastic.co/downloads along with all the releases that have been made in the past.
For each release, you have a choice among a zip or tar archive, or a DEB or RPM package.
For simplicity, let’s use the tar file.

一旦安裝好Java,可以下載並運行Elasticsearch。從www.elastic.co/downloads可以找到二進制文件,
也可以在過去的下載列表中獲取之前所有的發佈版本。對於每個發佈版,可以選擇zip或者tar格式,也可以
選擇DEB或者RPM壓縮包。爲簡單起見,這裏使用tar格式文件。

Let’s download the Elasticsearch 5.2.0 tar as follows (Windows users should download the zip package):

像下面一樣下載Elasticsearch 5.2.0(Windows用戶可以下載zip壓縮包):




Then extract it as follows (Windows users should unzip the zip package):

然後解壓文件:

tar -xvf elasticsearch-5.2.0.tar.gz


It will then create a bunch of files and folders in your current directory. We then go into the bin directory as follows:

將會在當前目錄中創建大量的文件和文件夾。然後進入bin目錄:


cd elasticsearch-5.2.0/bin


And now we are ready to start our node and single cluster (Windows users should run the elasticsearch.bat file):

現在可以啓動程序,即啓動單節點集羣:

./elasticsearch



If everything goes well, you should see a bunch of messages that look like below:

如果所有事情都ok的話,你可以看到如下的信息:

[2016-09-16T14:17:51,251][INFO ][o.e.n.Node               ] [] initializing ...
[2016-09-16T14:17:51,329][INFO ][o.e.e.NodeEnvironment    ] [6-bjhwl] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [317.7gb], net total_space [453.6gb], spins? [no], types [ext4]
[2016-09-16T14:17:51,330][INFO ][o.e.e.NodeEnvironment    ] [6-bjhwl] heap size [1.9gb], compressed ordinary object pointers [true]
[2016-09-16T14:17:51,333][INFO ][o.e.n.Node               ] [6-bjhwl] node name [6-bjhwl] derived from node ID; set [node.name] to override
[2016-09-16T14:17:51,334][INFO ][o.e.n.Node               ] [6-bjhwl] version[5.2.0], pid[21261], build[f5daa16/2016-09-16T09:12:24.346Z], OS[Linux/4.4.0-36-generic/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_60/25.60-b23]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [aggs-matrix-stats]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [ingest-common]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [lang-expression]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [lang-groovy]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [lang-mustache]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [lang-painless]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [percolator]
[2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [reindex]
[2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [transport-netty3]
[2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [transport-netty4]
[2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded plugin [mapper-murmur3]
[2016-09-16T14:17:53,521][INFO ][o.e.n.Node               ] [6-bjhwl] initialized
[2016-09-16T14:17:53,521][INFO ][o.e.n.Node               ] [6-bjhwl] starting ...
[2016-09-16T14:17:53,671][INFO ][o.e.t.TransportService   ] [6-bjhwl] publish_address {192.168.8.112:9300}, bound_addresses {{192.168.8.112:9300}
[2016-09-16T14:17:53,676][WARN ][o.e.b.BootstrapCheck     ] [6-bjhwl] max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144]
[2016-09-16T14:17:56,731][INFO ][o.e.h.HttpServer         ] [6-bjhwl] publish_address {192.168.8.112:9200}, bound_addresses {[::1]:9200}, {192.168.8.112:9200}
[2016-09-16T14:17:56,732][INFO ][o.e.g.GatewayService     ] [6-bjhwl] recovered [0] indices into cluster_state
[2016-09-16T14:17:56,748][INFO ][o.e.n.Node               ] [6-bjhwl] started

Without going too much into detail, we can see that our node named "6-bjhwl" (which will be a different set of characters in
your case) has started and elected itself as a master in a single cluster. Don’t worry yet at the moment what master means.
The main thing that is important here is that we have started one node within one cluster.

不需要仔細觀察細節,就可以看到名爲"6-bjhwl”(每個啓動的節點都會不同)的節點已經啓動並將自己選爲單節點集羣的
master節點。這時候不用着急知道master的含義。現在最重要的事情是我們已經啓動了單節點集羣。


As mentioned previously, we can override either the cluster or node name. This can be done from the command line when
starting Elasticsearch as follows:

就像前面提到的,我們可以自己命名集羣或者節點的名字。這個可以通過命令行實現,如下所示:

./elasticsearch -Ecluster.name=my_cluster_name -Enode.name=my_node_name

Also note the line marked http with information about the HTTP address (192.168.8.112) and port (9200) that our node is
reachable from. By default, Elasticsearch uses port 9200 to provide access to its REST API. This port is configurable if
necessary.

需要知道的是:可以通過http地址192.168.8.112,端口9200來訪問節點。默認情況下,Elasticsearch使用端口9200提供
REST API的訪問方式。如果需要,可以配置此端口。









3、Exploring Your Clusteredit

探索集羣


The REST APIedit

REST API


Now that we have our node (and cluster) up and running, the next step is to understand how to communicate with it. Fortunately, Elasticsearch provides a very comprehensive and powerful REST API that you can use to interact with your cluster. Among the few things that can be done with the API are as follows:

現在我們的節點(或者集羣)已經啓動並運行,下一步就是理解如何通信。幸運的是,Elasticsearch提供非常方便的REST API,可以用來和集羣進行交互,能夠做的事包括以下:


  • Check your cluster, node, and index health, status, and statistics
  • Administer your cluster, node, and index data and metadata
  • Perform CRUD (Create, Read, Update, and Delete) and search operations against your indexes
  • Execute advanced search operations such as paging, sorting, filtering, scripting, aggregations, and many others
  • 檢查集羣、節點、索引健康、狀態、以及統計信息
  • 管理集羣、節點,以及索引數據和元數據
  • 執行CRUD(創建、讀取、更新、刪除)以及搜索操作
  • 執行高級搜索操作例如分頁、排序、過濾、腳本執行、聚合以及很多其他的操作












4、Cluster Healthedit

集羣健康


Let’s start with a basic health check, which we can use to see how our cluster is doing. We’ll be using curl to do this but you can use any tool that allows you to make HTTP/REST calls. Let’s assume that we are still on the same node where we started Elasticsearch on and open another command shell window.

讓我們開始基本的集羣健康檢查,可以用來查看集羣是怎樣運行的。將使用curl命令來查看,但是你可以使用任何支持HTTP/REST調用的工具。假定我們在啓動Elasticsearch的節點上,並開啓另外一個命令行窗口。


To check the cluster health, we will be using the _cat API. You can run the command below in Kibana’s Console by clicking "VIEW IN CONSOLE" or with curl by clicking the "COPY AS CURL" link below and pasting it into a terminal.


想要檢查集羣健康,將使用_cat API。你可以在Kibana's控制檯運行這些命令,或者通過點擊"COPY AS CURL”鏈接來在終端發送命令:


GET /_cat/health?v


And the response:

返回信息如下:


epoch    timestamp cluster      status  node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1475247709 17:01:49 elasticsearch green    1                  1             0        0    0    0     0              0                           -                        100.0%



We can see that our cluster named "elasticsearch" is up with a green status.

Whenever we ask for the cluster health, we either get green, yellow, or red. Green means everything is good (cluster is fully functional), yellow means all data is available but some replicas are not yet allocated (cluster is fully functional), and red means some data is not available for whatever reason. Note that even if a cluster is red, it still is partially functional (i.e. it will continue to serve search requests from the available shards) but you will likely need to fix it ASAP since you have missing data.


Also from the above response, we can see a total of 1 node and that we have 0 shards since we have no data in it yet. Note that since we are using the default cluster name (elasticsearch) and since Elasticsearch uses unicast network discovery by default to find other nodes on the same machine, it is possible that you could accidentally start up more than one node on your computer and have them all join a single cluster. In this scenario, you may see more than 1 node in the above response.

可以看到集羣名字爲“elasticsearch”, 啓動後狀態爲“green”

無論什麼時候查看集羣健康狀況,狀態只能是green,yellow,red三種之一。green意味着所有事情都很好(集羣功能良好),yellow意味着所有數據可用但是某些備份還沒有分配(集羣功能良好)。red意味着某些數據因爲某些原因不可用。注意,即使集羣是red,仍然部分可用(例如,集羣仍然可以提供可用shards的搜索功能),但是你需要修復集羣,因爲你可能丟失了某些數據。


We can also get a list of nodes in our cluster as follows:

通過以下命令,可以獲取集羣列表:


GET /_cat/nodes?v


And the response:

返回信息如下:


ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1 10                    5                5    4.46                                         mdi           *        PB2SGZY



Here, we can see our one node named "PB2SGZY", which is the single node that is currently in our cluster.

這裏可以看到節點名字爲“PB2SGZY”, 這就是我們集羣中的唯一的節點。










5、List All Indicesedit

Now let’s take a peek at our indices:

GET /_cat/indices?v

And the response:

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

Which simply means we have no indices yet in the cluster.
以上顯示說明現在集羣中並沒有索引。

查看有索引的顯示:











6、Create an Indexedit

創建索引


Now let’s create an index named "customer" and then list all the indexes again:

現在創建名爲“customer”的索引,然後列舉出所有的索引

PUT /customer?pretty
GET /_cat/indices?v

The first command creates the index named "customer" using the PUT verb. We simply append pretty to the end of the call to tell it to pretty-print the JSON response (if any).

第一行是創建名爲“customer”的索引,使用PUT命令。我們使用pretty放在調用末尾用來告訴它返回漂亮打印的JSON返回信息。


And the response:

返回信息如下:

health status index    uuid    pri rep docs.count docs.deleted store.size pri.store.size
yellow open customer 95SQ4TSUT7mWBT7VNHH67A   5   1   0   0       260b           260b

The results of the second command tells us that we now have 1 index named customer and it has 5 primary shards and 1 replica (the defaults) and it contains 0 documents in it.

第二行的結果告訴我們:我們現在有一個名爲customer的索引,它有5個主要shards以及一個備份(默認配置),包含0個文檔。


You might also notice that the customer index has a yellow health tagged to it. Recall from our previous discussion that yellow means that some replicas are not (yet) allocated. The reason this happens for this index is because Elasticsearch by default created one replica for this index. Since we only have one node running at the moment, that one replica cannot yet be allocated (for high availability) until a later point in time when another node joins the cluster. Once that replica gets allocated onto a second node, the health status for this index will turn to green.

你可能注意到了:customer索引的狀態爲yellow。從前面的討論中可以知道yellow意味着某些備份還沒有分配。發生這種情況的原因是因爲Elasticsearch默認情況下會爲這個索引創建一個備份。因爲我們目前只有一個節點,因此備份無法分配(爲了高可用性),直到有其他節點能夠加入到集羣中。一旦備份分配到第二個節點上,健康狀態就會變爲green。











7、Index and Query a Documentedit

索引以及請求文檔

Let’s now put something into our customer index. Remember previously that in order to index a document, we must tell Elasticsearch which type in the index it should go to.


Let’s index a simple customer document into the customer index, "external" type, with an ID of 1 as follows:

現在往customer 索引中放一些東西。記住前面我們說了,要想索引一個文檔,必須告訴Elasticsearch要把文檔索引到哪個type中

PUT /customer/external/1?pretty
{
  "name": "John Doe"
}

And the response:

{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}

From the above, we can see that a new customer document was successfully created inside the customer index and the external type. The document also has an internal id of 1 which we specified at index time.


It is important to note that Elasticsearch does not require you to explicitly create an index first before you can index documents into it. In the previous example, Elasticsearch will automatically create the customer index if it didn’t already exist beforehand.

從上面來看,新customer文檔已經成功創建,並存儲到customer 索引中的external 類型。這個文檔在索引的時候已經被分配一個內部的id 1。


Let’s now retrieve that document that we just indexed:

現在我們查看一下索引的文檔內容

GET /customer/external/1?pretty

And the response:

返回信息如下:

{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : { "name": "John Doe" }
}

Nothing out of the ordinary here other than a field, found, stating that we found a document with the requested ID 1 and another field, _source, which returns the full JSON document that we indexed from the previous step.

其他都很正常,出現了一個found字段,此字段標識了我們找到了請求ID 爲1的文檔,還有_source的字段,此字段返回了前面插入的JSON文檔。











8、Delete an Indexedit

刪除一個索引


Now let’s delete the index that we just created and then list all the indexes again:

現在讓我們刪除剛纔創建的索引,然後再列舉所有索引:

DELETE /customer?pretty
GET /_cat/indices?v

And the response:

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

Which means that the index was deleted successfully and we are now back to where we started with nothing in our cluster.

這就意味着索引已經成功刪除了,現在我們回到了一無所有的空集羣。


Before we move on, let’s take a closer look again at some of the API commands that we have learned so far:

在我們移動之前,讓我們更仔細的查看一下我們所學的API命令:

PUT /customer
PUT /customer/external/1
{
  "name": "John Doe"
}
GET /customer/external/1
DELETE /customer

If we study the above commands carefully, we can actually see a pattern of how we access data in Elasticsearch. That pattern can be summarized as follows:

如果我們仔細學習上面的命令,我們可以看出Elasticsearch訪問數據的規則。規則可以總結如下:

<REST Verb> /<Index>/<Type>/<ID>

This REST access pattern is pervasive throughout all the API commands that if you can simply remember it, you will have a good head start at mastering Elasticsearch.

REST訪問規則可以訪問所有API民營,如果你記住這個規則,對於掌握Elasticsearch來說是一個很好的開端。












9、Modifying Your Dataedit

修正數據


Elasticsearch provides data manipulation and search capabilities in near real time. By default, you can expect a one second delay (refresh interval) from the time you index/update/delete your data until the time that it appears in your search results. This is an important distinction from other platforms like SQL wherein data is immediately available after a transaction is completed.

Elasticsearch提供接近實時的數據操作和搜索能力。默認情況下,你可以認爲延遲爲1秒,從你索引/更新/刪除數據開始,到搜索結果出現爲止。這個是和其他平臺(例如SQL)很大的區別,即數據在事務操作完成之後立即可用。



Indexing/Replacing Documentsedit

索引化/替代化文檔


We’ve previously seen how we can index a single document. Let’s recall that command again:

我們之前看到是怎樣索引一個單獨文檔的,讓我們重新執行這個操作:

PUT /customer/external/1?pretty
{
  "name": "John Doe"
}

Again, the above will index the specified document into the customer index, external type, with the ID of 1. If we then executed the above command again with a different (or same) document, Elasticsearch will replace (i.e. reindex) a new document on top of the existing one with the ID of 1:

上面的命令將再次將特定的文檔索引到customer 索引,external類型中,分配的ID爲1.如果我們使用不同的文檔來執行上述命令,則Elastisearch將會替代ID爲1的文檔內容:

PUT /customer/external/1?pretty
{
  "name": "Jane Doe"
}

The above changes the name of the document with the ID of 1 from "John Doe" to "Jane Doe". If, on the other hand, we use a different ID, a new document will be indexed and the existing document(s) already in the index remains untouched.

上面的命令改變在於將ID爲1的文檔從John Doe改爲Jane Doe。如果,另一方面,我們使用一個不同的ID,新文檔會索引到相應位置,而已存在的文檔不會受影響。

PUT /customer/external/2?pretty
{
  "name": "Jane Doe"
}

The above indexes a new document with an ID of 2.


上面索引了一個新文檔,ID爲2

When indexing, the ID part is optional. If not specified, Elasticsearch will generate a random ID and then use it to index the document. The actual ID Elasticsearch generates (or whatever we specified explicitly in the previous examples) is returned as part of the index API call.

當索引發生時,ID part是可選的。如果沒有指定,Elasticsearch將會產生一個隨機ID,然後使用它索引文檔。Elasticsearch產生的實際ID(或者我們在前面例子中顯式指定的東西)會作爲索引API調用返回信息的一部分。


This example shows how to index a document without an explicit ID:

例子顯示瞭如果沒有顯式ID索引一個文檔時會發生什麼:

POST /customer/external?pretty
{
  "name": "Jane Doe"
}

Note that in the above case, we are using the POST verb instead of PUT since we didn’t specify an ID.

注意上面的例子,我們使用POST方式而不是PUT方式,因爲我們沒有指定ID
實際上,如果不指定ID,則PUT方式會失敗:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "No endpoint or operation is available at [external]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "No endpoint or operation is available at [external]"
  },
  "status" : 400
}











10、Updating Documentsedit

更新文檔


In addition to being able to index and replace documents, we can also update documents. Note though that Elasticsearch does not actually do in-place updates under the hood. Whenever we do an update, Elasticsearch deletes the old document and then indexes a new document with the update applied to it in one shot.

爲了能夠索引以及替代文檔,我們也可以更新文檔。注意:Elasticsearch實際上不會在內部進行更換。無論我們何時進行更新,Elasticsearch都會刪除舊的文檔,然後索引一個新文檔到某個槽中。


This example shows how to update our previous document (ID of 1) by changing the name field to "Jane Doe":

這個例子可以顯示如何更新的:通過改變name field爲“Jane Doe”來更新前面的文檔(ID爲1)

POST /customer/external/1/_update?pretty
{
  "doc": { "name": "Jane Doe" }
}

This example shows how to update our previous document (ID of 1) by changing the name field to "Jane Doe" and at the same time add an age field to it:

本里顯示了是如何更新我們前面的文檔(ID爲1),通過改變name field到“Jane Doe”,同時增加一個age field到此文檔:

POST /customer/external/1/_update?pretty
{
  "doc": { "name": "Jane Doe", "age": 20 }
}

Updates can also be performed by using simple scripts. This example uses a script to increment the age by 5:

更新也可以使用簡單的腳本來執行。例子就是使用腳本對age增加5:

POST /customer/external/1/_update?pretty
{
  "script" : "ctx._source.age += 5"
}

In the above example, ctx._source refers to the current source document that is about to be updated.

在上面的例子中,ctx._source指的是當前源文檔,即將要被更新的文檔。


Note that as of this writing, updates can only be performed on a single document at a time. In the future, Elasticsearch might provide the ability to update multiple documents given a query condition (like an SQL UPDATE-WHERE statement).












11、Deleting Documentsedit


Deleting a document is fairly straightforward. This example shows how to delete our previous customer with the ID of 2:

刪除一個文檔相當的直接。這個例子展示瞭如何刪除之前ID爲2的記錄:

DELETE /customer/external/2?pretty

See the Delete By Query API to delete all documents matching a specific query. It is worth noting that it is much more efficient to delete a whole index instead of deleting all documents with the Delete By Query API.

查看 Delete By Query API ,可以看到怎樣刪除匹配某個特定請求的所有文檔。更加有效的是刪除整個索引,而不是刪除匹配的所有記錄。










12、Batch Processingedit

批量處理


In addition to being able to index, update, and delete individual documents, Elasticsearch also provides the ability to perform any of the above operations in batches using the _bulk API. This functionality is important in that it provides a very efficient mechanism to do multiple operations as fast as possible with as few network roundtrips as possible.

爲了能夠索引、更新、刪除個別文檔,Elasticsearch也提供了批量執行任何上述操作的能力,一般採用_bulk API。這種功能很重要,他提功了非常有效的機制,用來儘可能快的執行多個操作。


As a quick example, the following call indexes two documents (ID 1 - John Doe and ID 2 - Jane Doe) in one bulk operation:

以下爲例子:同時索引兩份文檔在同一個bulk操作中:

POST /customer/external/_bulk?pretty
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }

This example updates the first document (ID of 1) and then deletes the second document (ID of 2) in one bulk operation:

這個例子更新了第一個文檔(ID爲1),並刪除了第二個文檔(ID爲2):


POST /customer/external/_bulk?pretty
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}

Note above that for the delete action, there is no corresponding source document after it since deletes only require the ID of the document to be deleted.

注意上面的刪除行爲,刪除操作只需要文檔的ID,一旦刪除之後,就沒有相關的原文檔了。


The Bulk API does not fail due to failures in one of the actions. If a single action fails for whatever reason, it will continue to process the remainder of the actions after it. When the bulk API returns, it will provide a status for each action (in the same order it was sent in) so that you can check if a specific action failed or not.

Bulk API不回因爲其中某個行爲的失敗而失敗。如果某個行爲因爲某種原因失敗,他會繼續處理其它的行爲。當bulk API返回時,他會返回每個行爲的狀態(以請求的順序),因此,你可以檢查每個行爲的返回狀態就可以知道是否成功。











13、Exploring Your Dataedit

探索數據

Sample Datasetedit


樣本數據集合

Now that we’ve gotten a glimpse of the basics, let’s try to work on a more realistic dataset. I’ve prepared a sample of fictitious JSON documents of customer bank account information. Each document has the following schema:

現在我們已經知道基本操作,下面來操作一些更加實際的數據集合。我已經準備了一個JSON格式的文檔樣本-客戶銀行帳戶信息。每個文檔都包含以下字段:

{
    "account_number": 0,
    "balance": 16623,
    "firstname": "Bradshaw",
    "lastname": "Mckenzie",
    "age": 29,
    "gender": "F",
    "address": "244 Columbus Place",
    "employer": "Euron",
    "email": "[email protected]",
    "city": "Hobucken",
    "state": "CO"
}

For the curious, I generated this data from www.json-generator.com/ so please ignore the actual values and semantics of the data as these are all randomly generated.

我是從網站隨機生成的數據,請忽略實際含義。


Loading the Sample Datasetedit

加載樣本數據集合


You can download the sample dataset (accounts.json) from here. Extract it to our current directory and let’s load it into our cluster as follows:

你可以從這裏下載樣本數據集合。將它提取到我們當前目錄,然後加載這些數據到我們的集羣中:

curl -XPOST 'localhost:9200/bank/account/_bulk?pretty&refresh' --data-binary "@accounts.json"
curl 'localhost:9200/_cat/indices?v'

And the response:

返回信息:

health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   bank  l7sSYV2cQXmu6_4rJWVIww   5   1       1000            0    128.6kb        128.6kb

Which means that we just successfully bulk indexed 1000 documents into the bank index (under the account type).

這就意味着,我們已經成功將1000條數據索引到bank 索引庫中(在帳戶type下)。











14、The Search APIedit

搜索API


Now let’s start with some simple searches. There are two basic ways to run searches: one is by sending search parameters through the REST request URI and the other by sending them through the REST request body. The request body method allows you to be more expressive and also to define your searches in a more readable JSON format. We’ll try one example of the request URI method but for the remainder of this tutorial, we will exclusively be using the request body method.

現在讓我們看一下一些簡單的搜索操作。有兩個基本方式執行搜索:一個是通過REST request URI發送搜索參數,一個是通過REST request body發送。request body方式表現力更強,而且可以以可讀性更強的JSON格式定義搜索。我們將嘗試使用request URI方式測試,後面所有測試都會採用這種方式。


The REST API for search is accessible from the _search endpoint. This example returns all documents in the bank index:

搜索的REST API從_search末尾開始。返回所有索引庫bank中文檔的例子爲:

GET /bank/_search?q=*&sort=account_number:asc&pretty

Let’s first dissect the search call. We are searching (_search endpoint) in the bank index, and the q=* parameter instructs Elasticsearch to match all documents in the index. The sort=account_number:asc parameter indicates to sort the results using the account_number field of each document in an ascending order. The pretty parameter, again, just tells Elasticsearch to return pretty-printed JSON results.

先看一下搜索調用。我們正在搜索bank索引庫(_search endpoint),q=*參數是指匹配索引庫中所有文檔。sort=account_number:asc是指按照文檔中account_number升序排列返回的結果。然後就是pretty參數,只是用來告訴Elasticsearch返回比較好看的JSON結果。

And the response (partially shown):


{
  "took" : 63,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : null,
    "hits" : [ {
      "_index" : "bank",
      "_type" : "account",
      "_id" : "0",
      "sort": [0],
      "_score" : null,
      "_source" : {"account_number":0,"balance":16623,"firstname":"Bradshaw",
"lastname":"Mckenzie","age":29,"gender":"F","address":"244 Columbus Place","employer":"Euron","email":"[email protected]","city":"Hobucken",
"state":"CO"}
    }, {
      "_index" : "bank",
      "_type" : "account",
      "_id" : "1",
      "sort": [1],
      "_score" : null,
      "_source" : {"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,
"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"[email protected]","city":"Brogan","state":"IL"}
    }, ...
    ]
  }
}

As for the response, we see the following parts:

  • took – time in milliseconds for Elasticsearch to execute the search
  • timed_out – tells us if the search timed out or not
  • _shards – tells us how many shards were searched, as well as a count of the successful/failed searched shards
  • hits – search results
  • hits.total – total number of documents matching our search criteria
  • hits.hits – actual array of search results (defaults to first 10 documents)
  • hits.sort - sort key for results (missing if sorting by score)
  • hits._score and max_score - ignore these fields for now


如返回信息所示,可以看到以下字段:
  • took:Elasticsearch執行本次搜索花費的微秒數
  • timed_out:本次搜索是否超時
  • _shards:本次搜索共搜索了多少個shards,還有搜索成功或者失敗的shards個數
  • hits:搜索結果
  • hits.total:匹配我們搜索規則的文檔總數
  • hits.hits:搜索結果的數組格式(默認是前10個結果)
  • hits.sort:排序的關鍵字內容,如果使用balance排序,則顯示的是balance值(如果使用score排序,則沒有這個關鍵字)
  • hits._score以及max_score:目前忽略這些字段

Here is the same exact search above using the alternative request body method:

這裏有和上面同樣搜索的方法,即使用request body 方式請求:

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ]
}

The difference here is that instead of passing q=* in the URI, we POST a JSON-style query request body to the _search API. We’ll discuss this JSON query in the next section.

不同在於不再在URI中傳遞q=*, 此處發送一個JSON格式請求體給_search API。下一節中我們會討論JSON請求方式。


It is important to understand that once you get your search results back, Elasticsearch is completely done with the request and does not maintain any kind of server-side resources or open cursors into your results. This is in stark contrast to many other platforms such as SQL wherein you may initially get a partial subset of your query results up-front and then you have to continuously go back to the server if you want to fetch (or page through) the rest of the results using some kind of stateful server-side cursor.

重要的是理解下述說明:一旦搜索結果返回,Elasticsearch就完成了請求,不再維護任何形式的服務端資源,也不再以遊標的形式標記搜索結果。這和其它平臺(例如sql)形成鮮明對比,你可以先獲取獲取請求結果中比較靠前部分的一個子集,然後當你需要剩餘結果時,你可以繼續訪問server獲取剩餘的結果,這就需要服務端維護一個狀態的遊標。










15、Introducing the Query Languageedit

介紹請求語言


Elasticsearch provides a JSON-style domain-specific language that you can use to execute queries. This is referred to as the Query DSL. The query language is quite comprehensive and can be intimidating at first glance but the best way to actually learn it is to start with a few basic examples.

Elasticsearch提供JSON格式的請求方式。這被稱爲查詢DSL。這種查詢語言還是比較全面的,咋一看還是挺嚇人的,但是最好的學習方式是從一些簡單基礎的例子開始。


Going back to our last example, we executed this query:

回顧一下上一個例子,我們執行了這個查詢:

GET /bank/_search
{
  "query": { "match_all": {} }
}

Dissecting the above, the query part tells us what our query definition is and the match_all part is simply the type of query that we want to run. The match_all query is simply a search for all documents in the specified index.

認真看一下上面的語法,query字段告訴我們查詢定義是什麼,match_all字段告訴我們想執行的簡單查詢類型。match_all查詢是針對特定索引庫所有文檔進行搜索。


In addition to the query parameter, we also can pass other parameters to influence the search results. In the example in the section above we passed in sort, here we pass in size:

除了query參數以外,我們還可以傳遞其它參數以影響搜索結果。上個例子中我們傳遞sort參數,此處我們傳遞size:

GET /bank/_search
{
  "query": { "match_all": {} },
  "size": 1
}

Note that if size is not specified, it defaults to 10.

注意,如果不指定size,默認值是10


This example does a match_all and returns documents 11 through 20:

下面的例子是返回從11到20的文檔:

GET /bank/_search
{
  "query": { "match_all": {} },
  "from": 10,
  "size": 10
}

The from parameter (0-based) specifies which document index to start from and the sizeparameter specifies how many documents to return starting at the from parameter. This feature is useful when implementing paging of search results. Note that if from is not specified, it defaults to 0.

from參數(從0開始)指定:從哪一個文檔索引開始,size參數指定返回多少個文檔。這個特徵在我們執行分頁查詢時非常有用。注意,如果from沒有指定,默認從0開始


This example does a match_all and sorts the results by account balance in descending order and returns the top 10 (default size) documents.

下面的例子是做一個完全匹配( match_all),然後將結果按照account balance字段的降序排列,然後返回前10個文檔

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": { "balance": { "order": "desc" } }
}











16、Executing Searchesedit

執行搜索


Now that we have seen a few of the basic search parameters, let’s dig in some more into the Query DSL. Let’s first take a look at the returned document fields. By default, the full JSON document is returned as part of all searches. This is referred to as the source (_source field in the search hits). If we don’t want the entire source document returned, we have the ability to request only a few fields from within source to be returned.

現在我們已經看到了一些基本的搜索參數,讓我們深入瞭解一些更多的查詢DSL的知識。首先看一下返回的文檔域。默認情況下,完整的JSON文檔會作爲所有搜索的一部分返回。文檔會在source字段出現(_source字段)。如果我們不像要完整的文檔返回,可以只請求某些想要的字段返回即可。


This example shows how to return two fields, account_number and balance (inside of _source), from the search:

例子展現了只返回兩個字段,account_number和balance(_source內部),

GET /bank/_search
{
  "query": { "match_all": {} },
  "_source": ["account_number", "balance"]
}

Note that the above example simply reduces the _source field. It will still only return one field named _source but within it, only the fields account_number and balance are included.

If you come from a SQL background, the above is somewhat similar in concept to the SQL SELECT FROM field list.


Now let’s move on to the query part. Previously, we’ve seen how the match_all query is used to match all documents. Let’s now introduce a new query called the match query, which can be thought of as a basic fielded search query (i.e. a search done against a specific field or set of fields).

注意:上面的例子簡單的減少了_source的字段。它依然回返回一個名爲_source的字段,在_source內部,只包含account_number和balance兩個字段。

如果你有SQL背景知識,上面比較類似於 SQL SELECT FROM語句

現在讓我們看一下查詢部分。前面我們已經看到match_all查詢用來匹配所有文檔。現在我們來介紹一種新的查詢:match query,可以認爲是一種基本字段搜索查詢(例如,每個搜索都是針對特定字段或者字段集合的搜索)。


This example returns the account numbered 20:

下面的例子返回account_number爲20的匹配文檔:


GET /bank/_search
{
  "query": { "match": { "account_number": 20 } }
}

This example returns all accounts containing the term "mill" in the address:

下面的例子返回所有address中包含“mill”單詞的帳戶信息:
可以看成sql中的like

GET /bank/_search
{
  "query": { "match": { "address": "mill" } }
}

This example returns all accounts containing the term "mill" or "lane" in the address:

下面的例子返回所有address中包含“mill”或者“lane”的帳戶信息

GET /bank/_search
{
  "query": { "match": { "address": "mill lane" } }
}

This example is a variant of match (match_phrase) that returns all accounts containing the phrase "mill lane" in the address:

下面例子返回address中包含詞組“mill lane”的帳戶信息:
可以看成sql語句中 =

GET /bank/_search
{
  "query": { "match_phrase": { "address": "mill lane" } }
}

Let’s now introduce the bool(ean) query. The bool query allows us to compose smaller queries into bigger queries using boolean logic.

接下來介紹bool(ean) query。bool 查詢允許使用boolean邏輯來組合小於的查詢和大於的查詢


This example composes two match queries and returns all accounts containing "mill" and "lane" in the address:

下面的例子組合了兩個match查詢,返回所有address中包含“mill”和“lane”的帳戶信息。

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}

In the above example, the bool must clause specifies all the queries that must be true for a document to be considered a match.

上面的例子中,bool must條目指定了所有查詢必須同時滿足指定的條件。
注:可以看成sql語句中的and語句


In contrast, this example composes two match queries and returns all accounts containing "mill" or "lane" in the address:

作爲對比,下面的例子組合了兩個match查詢,然後返回所有address中包含“mill”或者“lane”的帳戶信息

GET /bank/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}

In the above example, the bool should clause specifies a list of queries either of which must be true for a document to be considered a match.

上面例子中,bool should條款指定了每個記錄如果符合任何一個match都可以認爲是符合匹配的結果
類似於前面 {“query”: { “match”: {“address”: “mill lane”}}


This example composes two match queries and returns all accounts that contain neither "mill" nor "lane" in the address:

下面的例子中組合了兩個match查詢,返回的結果必須不能包含指定的兩個匹配,即address中不能包含“mill”或者“lane”:

GET /bank/_search
{
  "query": {
    "bool": {
      "must_not": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}

In the above example, the bool must_not clause specifies a list of queries none of which must be true for a document to be considered a match.

上面的例子中,bool must_not條款指定查詢列表:即所有符合這兩個匹配的結果都不是需要的結果。


We can combine mustshould, and must_not clauses simultaneously inside a bool query. Furthermore, we can compose bool queries inside any of these bool clauses to mimic any complex multi-level boolean logic.

我們可以在同一個bool查詢中同時使用must,should以及must_not條款。而且,我們可以使用這些bool條款模擬任何複雜的多重boolean邏輯查詢。


This example returns all accounts of anybody who is 40 years old but doesn’t live in ID(aho):

這個例子返回了符合以下匹配的所有帳戶信息:即任何一個age爲40的,但是state不是ID的。

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}












17、Executing Filtersedit

執行過濾


In the previous section, we skipped over a little detail called the document score (_score field in the search results). The score is a numeric value that is a relative measure of how well the document matches the search query that we specified. The higher the score, the more relevant the document is, the lower the score, the less relevant the document is.

前面的章節中,我們跳過了一些細節信息,例如document score(搜索結果中_score字段)。score是一個數值,用來表示文檔與查詢需求之間的相關性衡量。score越高,文檔的相關性越強,score越低,文檔相關性越差。


But queries do not always need to produce scores, in particular when they are only used for "filtering" the document set. Elasticsearch detects these situations and automatically optimizes query execution in order not to compute useless scores.

但是請求並不是都需要產生scores,特別是當他們只用於過濾文檔集合時。Elasticsearch會檢測這些情況並自動優化查詢執行,以便不計算無用的分數。


The bool query that we introduced in the previous section also supports filter clauses which allow to use a query to restrict the documents that will be matched by other clauses, without changing how scores are computed. As an example, let’s introduce the range query, which allows us to filter documents by a range of values. This is generally used for numeric or date filtering.

前面章節介紹的bool query也支持filter條款,允許使用查詢限制被其他條款匹配的文檔,而不用改變計算得分的方式。作爲示例,我們來介紹range query,這個允許我們使用一個值範圍來過濾文檔。這個在數字或者日期過濾中比較常用。


This example uses a bool query to return all accounts with balances between 20000 and 30000, inclusive. In other words, we want to find accounts with a balance that is greater than or equal to 20000 and less than or equal to 30000.

這個例子使用了bool查詢,返回所有balances在20000和30000之間的帳戶信息。換句話說,我們想找到這樣的帳戶,即balance既>=20000,又<=30000

GET /bank/_search
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}

Dissecting the above, the bool query contains a match_all query (the query part) and a rangequery (the filter part). We can substitute any other queries into the query and the filter parts. In the above case, the range query makes perfect sense since documents falling into the range all match "equally", i.e., no document is more relevant than another.

來深入看一下上面的東西,bool query包含了一個match_all查詢(query部分),一個range查詢(filter部分)。我們可以在query部分和filter部分提交任何查詢。在上面的用例中,range查詢是完美的,因爲文檔會落入這些平等的匹配範圍,沒有任何一個文檔會比其它文檔顯得更相關(即相關性是相同的,因爲匹配條件就是一個範圍)。


In addition to the match_allmatchbool, and range queries, there are a lot of other query types that are available and we won’t go into them here. Since we already have a basic understanding of how they work, it shouldn’t be too difficult to apply this knowledge in learning and experimenting with the other query types.

除了match_all,match,bool以及range查詢之外,還有大量的其它查詢類型,現在不去討論他們。因爲我們對如何工作已經有了基本的概念,對於後面擴展學習來說不會很難。













18、Executing Aggregationsedit

執行聚合


Aggregations provide the ability to group and extract statistics from your data. The easiest way to think about aggregations is by roughly equating it to the SQL GROUP BY and the SQL aggregate functions. In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response. This is very powerful and efficient in the sense that you can run queries and multiple aggregations and get the results back of both (or either) operations in one shot avoiding network roundtrips using a concise and simplified API.

聚合提供了對數據進行分組和抽象統計的能力。想象聚合最簡單的類比是sql group by以及sql聚合功能。在Elasticsearch中,可以執行搜索返回結果,也可以在同一個返回信息中同時返回聚合結果。這個非常有效,你可以同時執行查詢和多哥聚合操作,然後同時獲得結果,避免了多餘的網絡開銷。


To start with, this example groups all the accounts by state, and then returns the top 10 (default) states sorted by count descending (also default):

開始,下面的例子使用state劃分了所有帳戶,然後返回前10個states,默認會採用降序排列:f

GET /bank/_search
{"size":0,"aggs":{"group_by_state":{"terms":{"field":"state.keyword"}}}}

In SQL, the above aggregation is similar in concept to:

sql中,上面的聚合類似於下面:

SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC

And the response (partially shown):

{"took":29,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1000,"max_score":0.0,"hits":[]},"aggregations":{"group_by_state":{"doc_count_error_upper_bound":20,"sum_other_doc_count":770,"buckets":[{"key":"ID","doc_count":27},{"key":"TX","doc_count":27},{"key":"AL","doc_count":25},{"key":"MD","doc_count":25},{"key":"TN","doc_count":23},{"key":"MA","doc_count":21},{"key":"NC","doc_count":21},{"key":"ND","doc_count":21},{"key":"ME","doc_count":20},{"key":"MO","doc_count":20}]}}}

We can see that there are 27 accounts in ID (Idaho), followed by 27 accounts in TX (Texas), followed by 25 accounts in AL (Alabama), and so forth.

我們可以看到ID(Idaho)爲27個,緊接着是TX爲27個,然後是AL爲25個,然後是其它的。


Note that we set size=0 to not show search hits because we only want to see the aggregation results in the response.

注意:設置size=0,意思是不顯示搜索到的結果,因爲我們只想在返回信息中看到聚合結果


Building on the previous aggregation, this example calculates the average account balance by state (again only for the top 10 states sorted by count in descending order):

基於前面聚合的例子,下面的例子計算了平均帳戶餘額,基於state信息的(還是前10個states,按照降序排列):

GET /bank/_search
{"size":0,"aggs":{"group_by_state":{"terms":{"field":"state.keyword"},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}}

Notice how we nested the average_balance aggregation inside the group_by_state aggregation. This is a common pattern for all the aggregations. You can nest aggregations inside aggregations arbitrarily to extract pivoted summarizations that you require from your data.

注意:我們把average_balance聚合內置在了group_by_state聚合中。這是所有聚合操作的一種通用方式。你可以任意嵌套聚合,以提取你想從數據中獲得透視需要。


Building on the previous aggregation, let’s now sort on the average balance in descending order:

基於前面的聚合,現在可以按照降序排列這些平均餘額。

GET /bank/_search
{"size":0,"aggs":{a"group_by_state":{"terms":{"field":"state.keyword","order":{"average_balance":"desc"}},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}}

This example demonstrates how we can group by age brackets (ages 20-29, 30-39, and 40-49), then by gender, and then finally get the average account balance, per age bracket, per gender:

下面這個例子顯示了我們如何通過age階段劃分,然後再通過性別劃分,最後獲取每個年齡階段每個性別的平均帳戶餘額,

GET /bank/_search
{"size":0,"aggs":{"group_by_age":{"range":{"field":"age","ranges":[{"from":20,"to":30},{"from":30,"to":40},{"from":40,"to":50}]},"aggs":{"group_by_gender":{"terms":{"field":"gender.keyword"},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}}}}

There are a many other aggregations capabilities that we won’t go into detail here. The aggregations reference guide is a great starting point if you want to do further experimentation.

還有很多其它的聚合操作細節沒有展示。聚合操作參考是比較好的學習資料。











19、Conclusionedit

總結


Elasticsearch is both a simple and complex product. We’ve so far learned the basics of what it is, how to look inside of it, and how to work with it using some of the REST APIs. I hope that this tutorial has given you a better understanding of what Elasticsearch is and more importantly, inspired you to further experiment with the rest of its great features!

Elasticsearch既簡單又複雜。目前爲止,我們學習了基本操作,以及如何使用一些簡單的REST APIs。我希望這個指導能給你更好的理解-Elasticsearch是什麼以及更加重要的是,促使你學習其它更多更好的特徵。










發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章