ES學習筆記五-搜索相關性

By default, results are returned sorted by relevance—with the most relevant docs first。

首先來了解一下排序:

{query:{

},

"from":0,

"size":10,

"sort":"field" | "sort:"["filed1","field2"] | "sort":{"filed":"desc"}

}

"sort": {
   
"dates": {
       
"order": "asc",
       
"mode":  "min"
   
}
}

string sorting and multifields

Analyzed string fields are also multivalue fields, but sorting on them seldom gives you the results you want. If you analyze a string like fine old art, it results in three terms. We probably want to sort alphabetically on the first term, then the second term, and so forth, but Elasticsearch doesn’t have this information at its disposal at sort time.
被分析的string類型的字段是多值字段,如果在這些字段上排序很有可能得不到預期結果。

解決的辦法是定義mapping

"tweet": { 
   
"type":     "string",
   
"analyzer": "english",
   
"fields": {
       
"raw": {
           
"type":  "string",
           
"index": "not_analyzed"
       
}
   
}
}
GET /_search
{
   
"query": {
       
"match": {
           
"tweet": "elasticsearch"
       
}
   
},
   
"sort": "tweet.raw"
}
搜索結果相關性

The standard similarity algorithm used in Elasticsearch is known as term frequency/inverse document frequency, or TF/IDF, which takes the following factors into account:

Term frequency 詞元在此文檔中出現的頻率越高,則相關性越好
How often does the term appear in the field? The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.
Inverse document frequency 詞元在其他文檔中出現的頻率越高,則相關性越低
How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more-uncommon terms.
Field-length norm 文檔的長度越低,相關度越小
How long is the field? The longer it is, the less likely it is that words in the field will be relevant. A term appearing in a short title field carries more weight than the same term appearing in a long content field.
It adds information about the shard and the node that the document came from, which is useful to know because term and document frequencies are calculated per shard, rather than per index

相關性得分計算是以分片爲單位計算的,不是以索引爲單位計算的。

GET /_search?explain 
{
   
"query"   : { "match" : { "tweet" : "honeymoon" }}
}
記得 explain只在debug中使用 production model中請關閉此選項,性能開銷很大

fielddata

To make sorting efficient, Elasticsearch loads all the values for the field that you want to sort on into memory. This is referred to as fielddata.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章