Here are a few sample use-cases that Elasticsearch could be used for:
- You run an online web store where you allow your customers to search for products that you sell. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them.
- You want to collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies. In this case, you can use Logstash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate, and parse your data, and then have Logstash feed this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information that is of interest to you.
- You run a price alerting platform which allows price-savvy customers to specify a rule like "I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month". In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.
- You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records). In this case, you can use Elasticsearch to store your data and then use Kibana (part of the Elasticsearch/Logstash/Kibana stack) to build custom dashboards that can visualize aspects of your data that are important to you. Additionally, you can use the Elasticsearch aggregations functionality to perform complex business intelligence queries against your data.
- 經營在線網店時,允許用戶在線搜索商品。這種場景下,可以使用Elasticsearch存儲所有商品的特徵和分類,併爲上品提供搜索功能以及完全的建議。
- 可以用來蒐集日誌或者交易數據,然後分析或者挖掘這些數據以獲取購買趨勢、統計數據、彙總數據、或者異常。這種情況下,你可以使用Logstash(Elasticsearch/Logstash/Kibana技術棧的一部分)去收集、聚合或者分析數據,然後將Logstash數據傳輸給Elasticsearch。一旦數據存入Elasticsearch,可以搜索或者聚合數據以挖掘你感興趣的信息。
- 可以運行價格提醒平臺,支持客戶設定一些價格提醒規則:例如對某種商品比較感興趣,當此商品價格低於某個價格時,可以通知此客戶。這種情況下,你可以抓取買方價格,然後將它們推送到Elasticsearch,然後使用逆向搜索能力對價格變動和客戶需求相匹配,最終將找到的價格匹配的商品通知客戶。
- 有分析/商業智能需求,想快速調查、分析、可視化或者詢問大量數據的臨時問題(想象一下百萬或者十億級別的記錄)。這種情況下,你可以使用Elasticsearch存儲數據,然後使用Kibana來創建自定義的儀表盤,可以將你認爲重要的數據可視化顯示。另外,可以使用Elasticsearch聚合功能執行復雜的商業智能需求。
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-dev
, logging-stage
,
and logging-prod
for
the development, staging, and production clusters.
elasticsearch
which
means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch
.elasticsearch
.An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.
Sharding is important for two primary reasons:
- It allows you to horizontally split/scale your content volume
- It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput
- 允許水平分割或者擴展數據
- 允許在shards之間進行分佈式或者並行操作,這樣可以提高性能或者吞吐量
Replication is important for two primary reasons:
- It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
- It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.
- 提供某個shard或者節點失效時的高可用性。出於這個原因,備份的shards之間不應當分佈在相同的節點上。
- 允許擴展搜索的數據量或者吞儲量,這樣可以並行搜索所有備份。
LUCENE-5843
,
the limit is 2,147,483,519
(=
Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards
api.2、Installationedit
java -version echo $JAVA_HOME
www.elastic.co/downloads
along
with all the releases that have been made in the past.zip
or tar
archive,
or a DEB
or RPM
package.[2016-09-16T14:17:51,251][INFO ][o.e.n.Node ] [] initializing ... [2016-09-16T14:17:51,329][INFO ][o.e.e.NodeEnvironment ] [6-bjhwl] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [317.7gb], net total_space [453.6gb], spins? [no], types [ext4] [2016-09-16T14:17:51,330][INFO ][o.e.e.NodeEnvironment ] [6-bjhwl] heap size [1.9gb], compressed ordinary object pointers [true] [2016-09-16T14:17:51,333][INFO ][o.e.n.Node ] [6-bjhwl] node name [6-bjhwl] derived from node ID; set [node.name] to override [2016-09-16T14:17:51,334][INFO ][o.e.n.Node ] [6-bjhwl] version[5.2.0], pid[21261], build[f5daa16/2016-09-16T09:12:24.346Z], OS[Linux/4.4.0-36-generic/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_60/25.60-b23] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [aggs-matrix-stats] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [ingest-common] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [lang-expression] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [lang-groovy] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [lang-mustache] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [lang-painless] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [percolator] [2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [reindex] [2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [transport-netty3] [2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [transport-netty4] [2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded plugin [mapper-murmur3] [2016-09-16T14:17:53,521][INFO ][o.e.n.Node ] [6-bjhwl] initialized [2016-09-16T14:17:53,521][INFO ][o.e.n.Node ] [6-bjhwl] starting ... [2016-09-16T14:17:53,671][INFO ][o.e.t.TransportService ] [6-bjhwl] publish_address {192.168.8.112:9300}, bound_addresses {{192.168.8.112:9300} [2016-09-16T14:17:53,676][WARN ][o.e.b.BootstrapCheck ] [6-bjhwl] max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144] [2016-09-16T14:17:56,731][INFO ][o.e.h.HttpServer ] [6-bjhwl] publish_address {192.168.8.112:9200}, bound_addresses {[::1]:9200}, {192.168.8.112:9200} [2016-09-16T14:17:56,732][INFO ][o.e.g.GatewayService ] [6-bjhwl] recovered [0] indices into cluster_state [2016-09-16T14:17:56,748][INFO ][o.e.n.Node ] [6-bjhwl] started
./elasticsearch -Ecluster.name=my_cluster_name -Enode.name=my_node_name
192.168.8.112
)
and port (9200
)
that our node is9200
to
provide access to its REST API. This port is configurable if- Check your cluster, node, and index health, status, and statistics
- Administer your cluster, node, and index data and metadata
- Perform CRUD (Create, Read, Update, and Delete) and search operations against your indexes
- Execute advanced search operations such as paging, sorting, filtering, scripting, aggregations, and many others
- 檢查集羣、節點、索引健康、狀態、以及統計信息
- 管理集羣、節點,以及索引數據和元數據
- 執行CRUD(創建、讀取、更新、刪除)以及搜索操作
- 執行高級搜索操作例如分頁、排序、過濾、腳本執行、聚合以及很多其他的操作
_cat
API.
You can run the command below in Kibana’s Console by
clicking "VIEW IN CONSOLE" or with curl
by
clicking the "COPY AS CURL" link below and pasting it into a terminal.Whenever we ask for the cluster health, we either get green, yellow, or red. Green means everything is good (cluster is fully functional), yellow means all data is available but some replicas are not yet allocated (cluster is fully functional), and red means some data is not available for whatever reason. Note that even if a cluster is red, it still is partially functional (i.e. it will continue to serve search requests from the available shards) but you will likely need to fix it ASAP since you have missing data.
5、List All Indicesedit
Now let’s take a peek at our indices:
GET /_cat/indices?v
And the response:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
pretty
to
the end of the call to tell it to pretty-print the JSON response (if any).Let’s now put something into our customer index. Remember previously that in order to index a document, we must tell Elasticsearch which type in the index it should go to.
PUT /customer/external/1?pretty { "name": "John Doe" }
And the response:
{ "_index" : "customer", "_type" : "external", "_id" : "1", "_version" : 1, "result" : "created", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true }
From the above, we can see that a new customer document was successfully created inside the customer index and the external type. The document also has an internal id of 1 which we specified at index time.
GET /customer/external/1?pretty
{ "_index" : "customer", "_type" : "external", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "name": "John Doe" } }
found
,
stating that we found a document with the requested ID 1 and another field, _source
,
which returns the full JSON document that we indexed from the previous step.DELETE /customer?pretty GET /_cat/indices?v
And the response:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
PUT /customer PUT /customer/external/1 { "name": "John Doe" } GET /customer/external/1 DELETE /customer
<REST Verb> /<Index>/<Type>/<ID>
PUT /customer/external/1?pretty { "name": "John Doe" }
PUT /customer/external/1?pretty { "name": "Jane Doe" }
PUT /customer/external/2?pretty { "name": "Jane Doe" }
The above indexes a new document with an ID of 2.
POST /customer/external?pretty { "name": "Jane Doe" }
POST
verb
instead of PUT since we didn’t specify an ID.POST /customer/external/1/_update?pretty { "doc": { "name": "Jane Doe" } }
POST /customer/external/1/_update?pretty { "doc": { "name": "Jane Doe", "age": 20 } }
POST /customer/external/1/_update?pretty { "script" : "ctx._source.age += 5" }
ctx._source
refers
to the current source document that is about to be updated.SQL
UPDATE-WHERE
statement).11、Deleting Documentsedit
DELETE /customer/external/2?pretty
_bulk
API.
This functionality is important in that it provides a very efficient mechanism to do multiple operations as fast as possible with as few network roundtrips as possible.POST /customer/external/_bulk?pretty {"index":{"_id":"1"}} {"name": "John Doe" } {"index":{"_id":"2"}} {"name": "Jane Doe" }
POST /customer/external/_bulk?pretty {"update":{"_id":"1"}} {"doc": { "name": "John Doe becomes Jane Doe" } } {"delete":{"_id":"2"}}
"account_number": 0,
"balance": 16623,
"firstname": "Bradshaw",
"lastname": "Mckenzie",
"age": 29,
"gender": "F",
"address": "244 Columbus Place",
"employer": "Euron",
"email": "[email protected]",
"city": "Hobucken",
"state": "CO"
www.json-generator.com/
so
please ignore the actual values and semantics of the data as these are all randomly generated.health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open bank l7sSYV2cQXmu6_4rJWVIww 5 1 1000 0 128.6kb 128.6kb
_search
endpoint.
This example returns all documents in the bank index:GET /bank/_search?q=*&sort=account_number:asc&pretty
_search
endpoint)
in the bank index, and the q=*
parameter
instructs Elasticsearch to match all documents in the index. The sort=account_number:asc
parameter
indicates to sort the results using the account_number
field
of each document in an ascending order. The pretty
parameter,
again, just tells Elasticsearch to return pretty-printed JSON results.And the response (partially shown):
{
"took" : 63,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : null,
"hits" : [ {
"_index" : "bank",
"_type" : "account",
"_id" : "0",
"sort": [0],
"_score" : null,
"_index" : "bank",
"_type" : "account",
"_id" : "1",
"sort": [1],
"_score" : null,
]
}
}
As for the response, we see the following parts:
took
– time in milliseconds for Elasticsearch to execute the searchtimed_out
– tells us if the search timed out or not_shards
– tells us how many shards were searched, as well as a count of the successful/failed searched shardshits
– search resultshits.total
– total number of documents matching our search criteriahits.hits
– actual array of search results (defaults to first 10 documents)hits.sort
- sort key for results (missing if sorting by score)hits._score
andmax_score
- ignore these fields for now
- took:Elasticsearch執行本次搜索花費的微秒數
- timed_out:本次搜索是否超時
- _shards:本次搜索共搜索了多少個shards,還有搜索成功或者失敗的shards個數
- hits:搜索結果
- hits.total:匹配我們搜索規則的文檔總數
- hits.hits:搜索結果的數組格式(默認是前10個結果)
- hits.sort:排序的關鍵字內容,如果使用balance排序,則顯示的是balance值(如果使用score排序,則沒有這個關鍵字)
- hits._score以及max_score:目前忽略這些字段
GET /bank/_search { "query": { "match_all": {} }, "sort": [ { "account_number": "asc" } ] }
q=*
in
the URI, we POST a JSON-style query request body to the _search
API.
We’ll discuss this JSON query in the next section.GET /bank/_search { "query": { "match_all": {} } }
query
part
tells us what our query definition is and the match_all
part
is simply the type of query that we want to run. The match_all
query
is simply a search for all documents in the specified index.query
parameter,
we also can pass other parameters to influence the search results. In the example in the section above we passed in sort
,
here we pass in size
:GET /bank/_search { "query": { "match_all": {} }, "size": 1 }
size
is
not specified, it defaults to 10.match_all
and
returns documents 11 through 20:GET /bank/_search { "query": { "match_all": {} }, "from": 10, "size": 10 }
from
parameter
(0-based) specifies which document index to start from and the size
parameter
specifies how many documents to return starting at the from parameter. This feature is useful when implementing paging of search results. Note that if from
is
not specified, it defaults to 0.match_all
and
sorts the results by account balance in descending order and returns the top 10 (default size) documents.GET /bank/_search { "query": { "match_all": {} }, "sort": { "balance": { "order": "desc" } } }
_source
field
in the search hits). If we don’t want the entire source document returned, we have the ability to request only a few fields from within source to be returned.account_number
and balance
(inside
of _source
),
from the search:GET /bank/_search { "query": { "match_all": {} }, "_source": ["account_number", "balance"] }
Note that the above example simply reduces the _source
field.
It will still only return one field named _source
but
within it, only the fields account_number
and balance
are
included.
If you come from a SQL background, the above is somewhat similar in concept to the SQL
SELECT FROM
field list.
match_all
query
is used to match all documents. Let’s now introduce a new query called the match
query,
which can be thought of as a basic fielded search query (i.e. a search done against a specific field or set of fields).GET /bank/_search { "query": { "match": { "account_number": 20 } } }
GET /bank/_search { "query": { "match": { "address": "mill" } } }
GET /bank/_search { "query": { "match": { "address": "mill lane" } } }
match
(match_phrase
)
that returns all accounts containing the phrase "mill lane" in the address:GET /bank/_search { "query": { "match_phrase": { "address": "mill lane" } } }
bool
(ean)
query. The bool
query
allows us to compose smaller queries into bigger queries using boolean logic.match
queries
and returns all accounts containing "mill" and "lane" in the address:GET /bank/_search { "query": { "bool": { "must": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } } }
bool
must
clause specifies all the queries that must be true for a document to be considered a match.match
queries
and returns all accounts containing "mill" or "lane" in the address:GET /bank/_search { "query": { "bool": { "should": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } } }
bool
should
clause specifies a list of queries either of which must be true for a document to be considered a match.match
queries
and returns all accounts that contain neither "mill" nor "lane" in the address:GET /bank/_search { "query": { "bool": { "must_not": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } } }
bool
must_not
clause specifies a list of queries none of which must be true for a document to be considered a match.must
, should
,
and must_not
clauses
simultaneously inside a bool
query.
Furthermore, we can compose bool
queries
inside any of these bool
clauses
to mimic any complex multi-level boolean logic.GET /bank/_search { "query": { "bool": { "must": [ { "match": { "age": "40" } } ], "must_not": [ { "match": { "state": "ID" } } ] } } }
_score
field
in the search results). The score is a numeric value that is a relative measure of how well the document matches the search query that we specified. The higher the score, the more relevant the document is, the lower the score, the less relevant the document
is.bool
query that
we introduced in the previous section also supports filter
clauses
which allow to use a query to restrict the documents that will be matched by other clauses, without changing how scores are computed. As an example, let’s introduce the range
query,
which allows us to filter documents by a range of values. This is generally used for numeric or date filtering.GET /bank/_search { "query": { "bool": { "must": { "match_all": {} }, "filter": { "range": { "balance": { "gte": 20000, "lte": 30000 } } } } } }
match_all
query
(the query part) and a range
query
(the filter part). We can substitute any other queries into the query and the filter parts. In the above case, the range query makes perfect sense since documents falling into the range all match "equally", i.e., no document is more relevant than another.match_all
, match
, bool
,
and range
queries,
there are a lot of other query types that are available and we won’t go into them here. Since we already have a basic understanding of how they work, it shouldn’t be too difficult to apply this knowledge in learning and experimenting with the other query types.GET /bank/_search {"size":0,"aggs":{"group_by_state":{"terms":{"field":"state.keyword"}}}}
SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC
And the response (partially shown):
{"took":29,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1000,"max_score":0.0,"hits":[]},"aggregations":{"group_by_state":{"doc_count_error_upper_bound":20,"sum_other_doc_count":770,"buckets":[{"key":"ID","doc_count":27},{"key":"TX","doc_count":27},{"key":"AL","doc_count":25},{"key":"MD","doc_count":25},{"key":"TN","doc_count":23},{"key":"MA","doc_count":21},{"key":"NC","doc_count":21},{"key":"ND","doc_count":21},{"key":"ME","doc_count":20},{"key":"MO","doc_count":20}]}}}
ID
(Idaho),
followed by 27 accounts in TX
(Texas),
followed by 25 accounts in AL
(Alabama),
and so forth.size=0
to
not show search hits because we only want to see the aggregation results in the response.GET /bank/_search {"size":0,"aggs":{"group_by_state":{"terms":{"field":"state.keyword"},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}}
average_balance
aggregation
inside the group_by_state
aggregation.
This is a common pattern for all the aggregations. You can nest aggregations inside aggregations arbitrarily to extract pivoted summarizations that you require from your data.GET /bank/_search {"size":0,"aggs":{a"group_by_state":{"terms":{"field":"state.keyword","order":{"average_balance":"desc"}},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}}
GET /bank/_search {"size":0,"aggs":{"group_by_age":{"range":{"field":"age","ranges":[{"from":20,"to":30},{"from":30,"to":40},{"from":40,"to":50}]},"aggs":{"group_by_gender":{"terms":{"field":"gender.keyword"},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}}}}