ES學習筆記四-Query DSL

queries and filters

Although we refer to the query DSL, in reality there are two DSLs: the query DSL and the filter DSL.Query clauses and filter clauses are similar in nature, but have slightly different purposes.

filter:結果是或否,查詢速度快,可以被緩存,一般用在真實值的查找上。

query:查詢結果與搜索內容的相關性怎樣,不能被緩存,一般用在全文檢索上。

most important queries and filters

term filter
{query:{
"term":"value"
}}
terms filer
{
query:{
  "terms":["a","b"]
}
}
range filter
{
   
"range": {
       
"age": {
           
"gte":  20,
           
"lt":   30
       
}
   
}
}
exists and missing filter

The exists and missing filters are used to find documents in which the specified field either has one or more values (exists) or doesn’t have any values (missing). It is similar in nature to IS_NULL (missing) and NOT IS_NULL (exists)in SQL

bool filter

用於複合查詢

must should must_not

{

"query":{

  "bool":{

must:{

 "query":{

         "match":{

            "text":"fadsfdasfds"

     }

    }


}

   

  }

}

}

QUERYS:

MATCH

The match query should be the standard query that you reach for whenever you want to query for a full-text or exact value in almost any field.

If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search:

{ "match": { "tweet": "About Search" }}
VIEW IN SENSE

If you use it on a field containing an exact value, such as a number, a date, a Boolean, or a not_analyzed string field, then it will search for that exact value:

{ "match": { "age":    26           }}
{ "match": { "date":   "2014-09-01" }}
{ "match": { "public": true         }}
{ "match": { "tag":    "full_text"  }}
For exact-value searches, you probably want to use a filter instead of a query, as a filter will be cached.

MULTI_MATCH

bool query

combining queries with filters

GET /_search
{
   
"query": {
       
"filtered": {
           
"query":  { "match": { "email": "business opportunity" }},
           
"filter": { "term": { "folder": "inbox" }}
       
}
   
}
}

just a filter

While in query context, if you need to use a filter without a query (for instance, to match all emails in the inbox), you can just omit the query:

GET /_search
{
   
"query": {
       
"filtered": {
           
"filter":   { "term": { "folder": "inbox" }}
       
}
   
}
}

You seldom need to use a query as a filter, but we have included it for completeness' sake. The only time you may need it is when you need to use full-text matching while in filter context.

finding multiple exact values

GET /my_store/products/_search
{
   
"query" : {
       
"filtered" : {
           
"filter" : {
               
"terms" : {
                   
"price" : [20, 30]
               
}
           
}
       
}
   
}
}

contains, but does not equal

GET /my_index/my_type/_search
{
   
"query": {
       
"filtered" : {
           
"filter" : {
                 
"bool" : {
                   
"must" : [
                       
{ "term" : { "tags" : "search" } },
                       
{ "term" : { "tag_count" : 1 } }
                   
]
               
}
           
}
       
}
   
}
}

When used on date fields, the range filter supports date math operations. For example, if we want to find all documents that have a timestamp sometime in the last hour:

"range" : {
   
"timestamp" : {
       
"gt" : "now-1h"
   
}
}

When used on date fields, the range filter supports date math operations. For example, if we want to find all documents that have a timestamp sometime in the last hour:

"range" : {
   
"timestamp" : {
       
"gt" : "now-1h"
   
}
}
Less than January 1, 2014 plus one month

dealing with null values

GET /my_index/posts/_search
{
   
"query" : {
       
"filtered" : {
           
"filter" : {
               
"exists" : { "field" : "tags" }
           
}
       
}
   
}
}

GET /my_index/posts/_search
{
   
"query" : {
       
"filtered" : {
           
"filter": {
               
"missing" : { "field" : "tags" }
           
}
       
}
   
}
}

all about caching

cache 是實時的,所以不用擔心緩存的有效期問題。
Leaf filters have to consult the inverted index on disk, so it makes sense to cache them. Compound filters, on the other hand, use fast bit logic to combine the bitsets resulting from their inner clauses, so it is efficient to recalculate them every time.
Certain leaf filters, however, are not cached by default, because it doesn’t make sense to do so:
某些頁節點的過濾器不會被緩存,因爲緩存他們並沒有意義。
例如
Script filters The results from script filters cannot be cached because the meaning of the script is opaque to Elasticsearch. Geo-filters The geolocation filters, which we cover in more detail in Geolocation , are usually used to filter results based on the geolocation of a specific user. Since each user has a unique geolocation, it is unlikely that geo-filters will be reused, so it makes no sense to cache them. Date ranges Date ranges that use the now function (for example "now-1h"), result in values accurate to the millisecond. Every time the filter is run, now returns a new time. Older filters will never be reused, so caching is disabled by default. However, when using now with rounding (for example, now/d rounds to the nearest day), caching is enabled by default.Sometimes the default caching strategy is not correct. Perhaps you have a complicated boolexpression that is reused several times in the same query. Or you have a filter on a date field that will never be reused. The default caching strategy can be overridden on almost any filter by setting the _cache flag:
{
   
"range" : {
       
"timestamp" : {
           
"gt" : "2014-01-02 16:15:14"
       
},
       
"_cache": false
   
}
}

filter order

過濾條件越精確的過濾器應該排在前邊。例如 a filter返回1w個結果,b filter返回10個結果,則應將b過濾器置於a之前。
Cached filters are very fast, so they should be placed before filters that are not cacheable.
被緩存的過濾器非常快,應該放在爲被緩存的之前。

full-text search

Term-based queries

Queries like the term or fuzzy queries are low-level queries that have no analysis phase. They operate on a single term. A term query for the term Foo looks for that exact term in the inverted index and calculates the TF/IDF relevance _score for each document that contains the term.

It is important to remember that the term query looks in the inverted index for the exact term only; it won’t match any variants like foo or FOO. It doesn’t matter how the term came to be in the index, just that it is. If you were to index ["Foo","Bar"] into an exact value not_analyzedfield, or Foo Bar into an analyzed field with the whitespace analyzer, both would result in having the two terms Foo and Bar in the inverted index.

Full-text queries

Queries like the match or query_string queries are high-level queries that understand the mapping of a field:

  • If you use them to query a date or integer field, they will treat the query string as a date or integer, respectively.
  • If you query an exact value (not_analyzed) string field, they will treat the whole query string as a single term.
  • But if you query a full-text (analyzed) field, they will first pass the query string through the appropriate analyzer to produce the list of terms to be queried.

a single-word queryedit

Our first example explains what happens when we use the match query to search within a full-text field for a single word:

GET /my_index/my_type/_search
{
   
"query": {
       
"match": {
           
"title": "QUICK!"
       
}
   
}
}
VIEW IN SENSE

Elasticsearch executes the preceding match query as follows:

  1. Check the field type.

    The title field is a full-text (analyzedstring field, which means that the query string should be analyzed too.

  2. Analyze the query string.

    The query string QUICK! is passed through the standard analyzer, which results in the single term quick. Because we have a just a single term, the match query can be executed as a single low-level term query.

  3. Find matching docs.

    The term query looks up quick in the inverted index and retrieves the list of documents that contain that term—in this case, documents 1, 2, and 3.

  4. Score each doc.

    The term query calculates the relevance _score for each matching document, by combining the term frequency (how often quick appears in the title field of each document), with the inverse document frequency (how often quick appears in the titlefield in all documents in the index), and the length of each field (shorter fields are considered more relevant). See What Is Relevance?.

multiword queries

GET /my_index/my_type/_search
{
   
"query": {
       
"match": {
           
"title": {      
               
"query":    "BROWN DOG!",
               
"operator": "and"
           
}
       
}
   
}
}

controlling precision

GET /my_index/my_type/_search
{
 
"query": {
   
"match": {
     
"title": {
       
"query":                "quick brown dog",
       
"minimum_should_match": "75%"
     
}
   
}
 
}
}

controlling precision

GET /my_index/my_type/_search
{
 
"query": {
   
"bool": {
     
"should": [
       
{ "match": { "title": "brown" }},
       
{ "match": { "title": "fox"   }},
       
{ "match": { "title": "dog"   }}
     
],
     
"minimum_should_match": 2
   
}
 
}
}
上邊的查詢語句等價於
{query:
"match":{
  "title":{
      "query": " brown fox dog",(operator 默認爲or)
"minimum_should_match": "66%"
}
}

boosting query clauses

評分相關,如果某個字段完全匹配,如何讓它得到更多的評分。boost
GET /_search
{
   
"query": {
       
"bool": {
           
"must": {
               
"match": {  
                   
"content": {
                       
"query":    "full text search",
                       
"operator": "and"
                   
}
               
}
           
},
           
"should": [
               
{ "match": {
                   
"content": {
                       
"query": "Elasticsearch",
                       
"boost": 3
                   
}
               
}},
               
{ "match": {
                   
"content": {
                       
"query": "Lucene",
                       
"boost": 2
                   
}
               
}}
           
]
       
}
   
}
}
The boost parameter is used to increase the relative weight of a clause (with a boostgreater than 1) or decrease the relative weight (with a boost between 0 and 1), but the increase or decrease is not linear. In other words, a boost of 2 does not result in double the _score.
增加某個詞搜索的權重 大於1就增大權重,介於0-1之前就是減小權重。注意 boost的值會影響查詢結果的評分,但不是線性關係。比如boost是2 不代表得分是上個查詢的兩倍。

controlling analysis

GET /my_index/my_type/_validate/query?explain
{
   
"query": {
       
"bool": {
           
"should": [
               
{ "match": { "title":         "Foxes"}},
               
{ "match": { "english_title": "Foxes"}}
           
]
       
}
   
}
}
validate-query API 可以檢查查詢語句是否正確,可以查看分詞效果。
索引一篇文檔如何找到合適的analyzer
analyzer的等級層次結構
  • he analyzer defined in the field mapping, else 在field-mapping中指定的
  • The analyzer defined in the _analyzer field of the document, else  在document中指定的
  • The default analyzer for the type, which defaults to type中指定的
  • The analyzer named default in the index settings, which defaults to index中指定的
  • The analyzer named default at node level, which defaults to 節點中的默認配置爲standard 分詞器
  • The standard analyzer

At search time, the sequence is slightly different: 在搜索的時候,順序有點不同

  • The analyzer defined in the query itself, else  查詢語句本身定義的analyzer
  • The analyzer defined in the field mapping, else field-mapping中定義的analyzer
  • The default analyzer for the type, which defaults to type中定義的
  • The analyzer named default in the index settings, which defaults to index中定義的
  • The analyzer named default at node level, which defaults to 節點默認配置爲standard分詞器
  • The standard analyzer

configuring analyzers in practice

use index settings, not config filesedit

The first thing to remember is that, even though you may start out using Elasticsearch for a single purpose or a single application such as logging, chances are that you will find more use cases and end up running several distinct applications on the same cluster. Each index needs to be independent and independently configurable. You don’t want to set defaults for one use case, only to have to override them for another use case later.

This rules out configuring analyzers at the node level. Additionally, configuring analyzers at the node level requires changing the config file on every node and restarting every node, which becomes a maintenance nightmare. It’s a much better idea to keep Elasticsearch running and to manage settings only via the API.

用indexsetting 而不要去更改es的配置文件。如果啓動多個node,需要更改es默認配置,不太方便。推薦使用index級別的analyzer.

relevance is broken!

However, for performance reasons, Elasticsearch doesn’t calculate the IDF across all documents in the index. Instead, each shard calculates a local IDF for the documents contained in that shard.

每個分片單獨計算查詢結果的評分,Because our documents are well distributed, the IDF for both shards will be the same. Now imagine instead that five of the foo documents are on shard 1, and the sixth document is on shard 2. In this scenario, the term foo is very common on one shard (and so of little importance), but rare on the other shard (and so much more important). These differences in IDF can produce incorrect results.

好吧。我直接說結論,結論就是你的數據不夠多。如果你具有了非常的多的數據,每個shard可以代表整個index的文檔分佈情況,(離散數學,概率論?)保證你的es中有足夠多的數據就可以了。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章