queries and filters
Although we refer to the query DSL, in reality there are two DSLs: the query DSL and the filter DSL.Query clauses and filter clauses are similar in nature, but have slightly different purposes.
filter:結果是或否,查詢速度快,可以被緩存,一般用在真實值的查找上。
query:查詢結果與搜索內容的相關性怎樣,不能被緩存,一般用在全文檢索上。
most important queries and filters
{exists and missing filter
"range": {
"age": {
"gte": 20,
"lt": 30
}
}
}
exists
and missing
filters
are used
to find documents in which the specified field either has one or more values (exists
)
or doesn’t have any values (missing
).
It is similar in nature to IS_NULL
(missing
)
and NOT
IS_NULL
(exists
)in
SQL
bool filter
用於複合查詢
must should must_not
{
"query":{
"bool":{
must:{
"query":{
"match":{
"text":"fadsfdasfds"
}
}
}
}
}
}
QUERYS:
MATCH
The match
query
should be the standard query that you
reach for whenever you want to query for a full-text or exact value in almost any field.
If you run a match
query
against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search:
{ "match": { "tweet": "About Search" }}VIEW IN SENSE
If you use it on a field containing an exact value, such
as a number, a date, a Boolean, or a not_analyzed
string
field, then it will search for that exact value:
{ "match": { "age": 26 }}For exact-value searches, you probably want to use a filter instead of a query, as a filter will be cached.
{ "match": { "date": "2014-09-01" }}
{ "match": { "public": true }}
{ "match": { "tag": "full_text" }}
MULTI_MATCH
bool query
combining queries with filters
GET /_search
{
"query": {
"filtered": {
"query": { "match": { "email": "business opportunity" }},
"filter": { "term": { "folder": "inbox" }}
}
}
}
just a filter
While in query context, if you need to use a filter without a query (for instance, to match all emails in the inbox), you can just omit the query:
GET /_search
{
"query": {
"filtered": {
"filter": { "term": { "folder": "inbox" }}
}
}
}
You seldom need to use a query as a filter, but we have included it for completeness' sake. The only time you may need it is when you need to use full-text matching while in filter context.
finding multiple exact values
GET /my_store/products/_search
{
"query" : {
"filtered" : {
"filter" : {
"terms" : {
"price" : [20, 30]
}
}
}
}
}
contains, but does not equal
GET /my_index/my_type/_search
{
"query": {
"filtered" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tag_count" : 1 } }
]
}
}
}
}
}
When used on date fields, the range
filter supports date
math operations. For example, if we want to find all documents that have a timestamp sometime in the last hour:
"range" : {
"timestamp" : {
"gt" : "now-1h"
}
}
When used on date fields, the range
filter supports date
math operations. For example, if we want to find all documents that have a timestamp sometime in the last hour:
"range" : {Less than January 1, 2014 plus one month
"timestamp" : {
"gt" : "now-1h"
}
}
dealing with null values
GET /my_index/posts/_search
{
"query" : {
"filtered" : {
"filter" : {
"exists" : { "field" : "tags" }
}
}
}
}
GET /my_index/posts/_search
{
"query" : {
"filtered" : {
"filter": {
"missing" : { "field" : "tags" }
}
}
}
}
all about caching
script
filters cannot
be cached because the meaning of the script is opaque to Elasticsearch.
Geo-filters
The geolocation filters, which we cover in more detail in Geolocation ,
are usually used to filter results based on the geolocation of a specific user. Since each user has a unique geolocation, it is unlikely that geo-filters will be reused, so it makes no sense to cache them.
Date ranges
Date ranges that use
the now
function
(for example "now-1h"
),
result in values accurate to the millisecond. Every time the filter is run, now
returns
a new time. Older filters will never be reused, so caching is disabled by default. However, when using now
with
rounding (for example, now/d
rounds
to the nearest day), caching is enabled by default.Sometimes the default caching strategy is not correct. Perhaps you have a complicated bool
expression
that is reused several times in the same query. Or you have a filter on a date
field
that will never be reused. The default caching strategy can
be overridden on almost any filter by setting the _cache
flag:{
"range" : {
"timestamp" : {
"gt" : "2014-01-02 16:15:14"
},
"_cache": false
}
}
filter order
過濾條件越精確的過濾器應該排在前邊。例如 a filter返回1w個結果,b filter返回10個結果,則應將b過濾器置於a之前。full-text search
Term-based queries
Queries like the term
or fuzzy
queries
are low-level queries that have no analysis phase. They operate
on a single term. A term
query
for the term Foo
looks
for that exact term in the inverted index and calculates the TF/IDF relevance _score
for
each document that contains the term.
It is important to remember that the term
query
looks in the inverted index for the exact term only; it won’t match any variants like foo
or FOO
.
It doesn’t matter how the term came to be in the index, just that it is. If you were to index ["Foo","Bar"]
into
an exact value not_analyzed
field,
or Foo Bar
into
an analyzed field with the whitespace
analyzer,
both would result in having the two terms Foo
and Bar
in
the inverted index.
Queries like the match
or query_string
queries
are high-level queries that understand the mapping of a field:
- If you use them to query a
date
orinteger
field, they will treat the query string as a date or integer, respectively. - If you query an exact value (
not_analyzed
) string field, they will treat the whole query string as a single term. - But if you query a full-text (
analyzed
) field, they will first pass the query string through the appropriate analyzer to produce the list of terms to be queried.
a single-word queryedit
Our first example explains what happens
when we use the match
query
to search within a full-text field for a single word:
GET /my_index/my_type/_searchVIEW IN SENSE
{
"query": {
"match": {
"title": "QUICK!"
}
}
}
Elasticsearch executes the preceding match
query as
follows:
-
Check the field type.
The
title
field is a full-text (analyzed
)string
field, which means that the query string should be analyzed too. -
Analyze the query string.
The query string
QUICK!
is passed through the standard analyzer, which results in the single termquick
. Because we have a just a single term, thematch
query can be executed as a single low-levelterm
query. -
Find matching docs.
The
term
query looks upquick
in the inverted index and retrieves the list of documents that contain that term—in this case, documents 1, 2, and 3. -
Score each doc.
The
term
query calculates the relevance_score
for each matching document, by combining the term frequency (how oftenquick
appears in thetitle
field of each document), with the inverse document frequency (how oftenquick
appears in thetitle
field in all documents in the index), and the length of each field (shorter fields are considered more relevant). See What Is Relevance?.
multiword queries
GET /my_index/my_type/_search
{
"query": {
"match": {
"title": {
"query": "BROWN DOG!",
"operator": "and"
}
}
}
}
controlling precision
GET /my_index/my_type/_search
{
"query": {
"match": {
"title": {
"query": "quick brown dog",
"minimum_should_match": "75%"
}
}
}
}
controlling precision
GET /my_index/my_type/_search上邊的查詢語句等價於
{
"query": {
"bool": {
"should": [
{ "match": { "title": "brown" }},
{ "match": { "title": "fox" }},
{ "match": { "title": "dog" }}
],
"minimum_should_match": 2
}
}
}
"minimum_should_match": "66%"
boosting query clauses
評分相關,如果某個字段完全匹配,如何讓它得到更多的評分。boostGET /_searchThe
{
"query": {
"bool": {
"must": {
"match": {
"content": {
"query": "full text search",
"operator": "and"
}
}
},
"should": [
{ "match": {
"content": {
"query": "Elasticsearch",
"boost": 3
}
}},
{ "match": {
"content": {
"query": "Lucene",
"boost": 2
}
}}
]
}
}
}
boost
parameter
is used to increase the
relative weight of a clause (with a boost
greater
than 1
)
or decrease the relative weight (with a boost
between 0
and 1
),
but the increase or decrease is not linear. In other words, a boost
of 2
does
not result in double the _score
.controlling analysis
GET /my_index/my_type/_validate/query?explainvalidate-query API 可以檢查查詢語句是否正確,可以查看分詞效果。
{
"query": {
"bool": {
"should": [
{ "match": { "title": "Foxes"}},
{ "match": { "english_title": "Foxes"}}
]
}
}
}
-
he
analyzer
defined in the field mapping, else 在field-mapping中指定的 -
The analyzer defined in the
_analyzer
field of the document, else 在document中指定的 -
The default
analyzer
for thetype
, which defaults to type中指定的 -
The analyzer named
default
in the index settings, which defaults to index中指定的 - The analyzer named
default
at node level, which defaults to 節點中的默認配置爲standard 分詞器 -
The
standard
analyzer
At search time, the sequence is slightly different: 在搜索的時候,順序有點不同
- The
analyzer
defined in the query itself, else 查詢語句本身定義的analyzer - The
analyzer
defined in the field mapping, else field-mapping中定義的analyzer - The default
analyzer
for thetype
, which defaults to type中定義的 - The analyzer named
default
in the index settings, which defaults to index中定義的 - The analyzer named
default
at node level, which defaults to 節點默認配置爲standard分詞器 - The
standard
analyzer
configuring analyzers in practice
use index settings, not config filesedit
The first thing to remember is that, even though you may start out using Elasticsearch for a single purpose or a single application such as logging, chances are that you will find more use cases and end up running several distinct applications on the same cluster. Each index needs to be independent and independently configurable. You don’t want to set defaults for one use case, only to have to override them for another use case later.
This rules out configuring analyzers at the node level. Additionally, configuring analyzers at the node level requires changing the config file on every node and restarting every node, which becomes a maintenance nightmare. It’s a much better idea to keep Elasticsearch running and to manage settings only via the API.
用indexsetting 而不要去更改es的配置文件。如果啓動多個node,需要更改es默認配置,不太方便。推薦使用index級別的analyzer.
relevance is broken!
However, for performance reasons, Elasticsearch doesn’t calculate the IDF across all documents in the index. Instead, each shard calculates a local IDF for the documents contained in that shard.
每個分片單獨計算查詢結果的評分,Because our documents are well distributed, the IDF for both shards will be the same. Now imagine instead that five of the foo
documents
are on shard 1, and the sixth document is on shard 2. In this scenario, the term foo
is
very common on one shard (and so of little importance), but rare on the other shard (and so much more important). These differences in IDF can produce incorrect results.
好吧。我直接說結論,結論就是你的數據不夠多。如果你具有了非常的多的數據,每個shard可以代表整個index的文檔分佈情況,(離散數學,概率論?)保證你的es中有足夠多的數據就可以了。