ES學習筆記七-多字段搜索

multifield search

好吧,讓我們來複習下filter,有個重要的filter叫term,term又可以同時搜索多個值。多值搜索不是多字段搜索

{

“query”:{

  "filtered":{

      "filter":{

          "terms":{

                 "title":["1", "2", "3"]

             }

        }

   }

}

}

GET /_search
{
 
"query": {
   
"bool": {
     
"should": [
       
{ "match": { "title":  "War and Peace" }},
       
{ "match": { "author": "Leo Tolstoy"   }},
       
{ "bool":  {
         
"should": [
           
{ "match": { "translator": "Constance Garnett" }},
           
{ "match": { "translator": "Louise Maude"      }}
         
]
       
}}
     
]
   
}
 
}
}
如果把should單獨不用bool包裹會產生什麼影響呢

GET /_search
{
 
"query": {
   
"bool": {
     
"should": [
       
{ "match": { "title":  "War and Peace" }},
       
{ "match": { "author": "Leo Tolstoy"   }},
       
  { "match": { "translator": "Constance Garnett" }},
           
{ "match": { "translator": "Louise Maude"      }}


     
]
   
}
 
}
}

The answer lies in how the score is calculated. The bool query runs each match query, adds their scores together, then multiplies by the number of matching clauses, and divides by the total number of clauses. Each clause at the same level has the same weight. In the preceding query, the boolquery containing the translator clauses counts for one-third of the total score. If we had put the translator clauses at the same level as title and author, they would have reduced the contribution of the title and author clauses to one-quarter each.

GET /_search
{
 
"query": {
   
"bool": {
     
"should": [
       
{ "match": {
           
"title":  {
             
"query": "War and Peace",
             
"boost": 2
       
}}},
       
{ "match": {
           
"author":  {
             
"query": "Leo Tolstoy",
             
"boost": 2
       
}}},
       
{ "bool":  {
           
"should": [
             
{ "match": { "translator": "Constance Garnett" }},
             
{ "match": { "translator": "Louise Maude"      }}
           
]
       
}}
     
]
   
}
 
}
}

single query string

best fileds

dis_max queryedit

Instead of the bool query, we can use the dis_max or Disjunction Max Query. Disjunction means or(while conjunction means and) so the Disjunction Max Query simply means return documents that match any of these queries, and return the score of the best matching query:

{
   
"query": {
       
"dis_max": {
           
"queries": [
               
{ "match": { "title": "Brown fox" }},
               
{ "match": { "body":  "Brown fox" }}
           
]
       
}
   
}
}
普通的bool查詢會將所有符合條件的search的得分相加再取平均分,而dis_max會返回所有符合條件的查詢中得分最高的結果(分離式的), Disjunction Max Query. Disjunction means or(while conjunction means and) so the Disjunction Max Query simply means return documents that match any of these queries, and return the score of the best matching query:

A simple dis_max query like the following would choose the single best matching field, and ignore the other:

tie_breakeredit

It is possible, however, to also take the _score from the other matching clauses into account, by specifying the tie_breaker parameter:

{
   
"query": {
       
"dis_max": {
           
"queries": [
               
{ "match": { "title": "Quick pets" }},
               
{ "match": { "body":  "Quick pets" }}
           
],
           
"tie_breaker": 0.3
       
}
   
}
}

The tie_breaker parameter makes the dis_max query behave more like a halfway house between dis_max and bool. It changes the score calculation as follows:

參數tie_breaker找到了bool與dis_max計算評分的折中方案,

  1. Take the _score of the best-matching clause.
  2. Multiply the score of each of the other matching clauses by the tie_breaker.
  3. Add them all together and normalize.
意思就是best_matching的得分仍然是最高的,其他的分數乘以tie_breaker,並相加再,使得分正常化(?取平均數)

With the tie_breaker, all matching clauses count, but the best-matching clause counts most.

tie_breaker的值最好保持在0.1-0.4之間,爲了保證單個query(dis_max)最高得分的意義。

換言之,如果tie_breaker的值爲0,則此查詢的意義就是dis_max,如果爲1,則意義就是bool查詢(取平均值)。

multi_match query

By default, this query runs as type best_fields, which means that it generates a match query for each field and wraps them in a dis_max query. This dis_max query

{
   
"multi_match": {
       
"query":                "Quick brown fox",
       
"type":                 "best_fields",
       
"fields":               [ "title", "body" ],
       
"tie_breaker":          0.3,
       
"minimum_should_match": "30%"
   
}
}
從title與body查詢中取出得分最高的filed + 另一個查詢條件的得分*tie_breaker 

等價於下邊的查詢----------------------------------------------------------------------------------

{
 
"dis_max": {
   
"queries":  [
     
{
       
"match": {
         
"title": {
           
"query": "Quick brown fox",
           
"minimum_should_match": "30%"
         
}
       
}
     
},
     
{
       
"match": {
         
"body": {
           
"query": "Quick brown fox",
           
"minimum_should_match": "30%"
         
}
       
}
     
},
   
],
   
"tie_breaker": 0.3
 
}
}

using wildcards in field names

字段名字可以使用通配符

{
   
"multi_match": {
       
"query":  "Quick brown fox",
       
"fields": "*_title"
   
}
}

boosting individual fields

單獨爲某個字段設置得分權重

{
   
"multi_match": {
       
"query":  "Quick brown fox",
       
"fields": [ "*_title", "chapter_title^2" ]
   
}
}

most fields

We can achieve this by indexing the same text in other fields to provide more-precise matching. One field may contain the unstemmed version, another the original word with diacritics, and a third might use shingles to provide information about word proximity. These other fields act as signals that increase the relevance score of each matching document. The more fields that match, the better.

A document is included in the results list if it matches the broad-matching main field. If it also matches the signal fields, it gets extra points and is pushed up the results list.

什麼事most fields呢,由於分析器的不同,詞元的提取也會不同,比如white-space 會將空格的詞隔開,

jump 與 jumped jumping 的詞根都是jump,但由於分詞器的不同會導致可能存在3個詞元。如果signal fileds匹配的更多,則查詢將會獲得額外的分數並將數據顯示在更靠前的列表中。

cross-fields entity search

GET /books/_search
{
   
"query": {
       
"multi_match": {
           
"query":       "peter smith",
           
"type":        "cross_fields",
           
"fields":      [ "title^2", "description" ]
       
}
   
}
}

field-centric queries

While this would work, we don’t like having to store redundant data. Instead, Elasticsearch offers us two solutions—one at index time and one at search time—which we discuss next.

關於cross fields的問題 有兩種方式解決,第一種,將所有字段的所有內容聚合成一個字段存入es

PUT /my_index
{
   
"mappings": {
       
"person": {
           
"properties": {
               
"first_name": {
                   
"type":     "string",
                   
"copy_to":  "full_name"
               
},
               
"last_name": {
                   
"type":     "string",
                   
"copy_to":  "full_name"
               
},
               
"full_name": {
                   
"type":     "string"
               
}
           
}
       
}
   
}
}

exact-value fields


Avoid using not_analyzed fields in multi_match queries.


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章