Elasticsearch相關評分度TF/IDF算法揭祕

編程界的小學生

一、算法介紹

Elasticsearch採取的是TF/IDF算法來評估score的，而score決定了排序。每次搜索score分數越大的越靠前。

1、TF

1.1、概念

Term Frequency簡稱TF，就是搜索文本中的各個詞條在要搜索的field文本中出現的次數，次數越多就越相關。

1.2、舉例

比如：
doc1：hello world，I love you
doc2：hello，I love you，too

搜索：hello world，es首先會進行分詞建立倒排索引，分詞成：hello和world兩個單詞。
發現doc1匹配了兩次，doc2中只匹配了一次（hello），所以doc1的score最大，優先被匹配，會排到doc2前面。

2、IDF

2.1、概念

Inverse Document Frequency簡稱IDF，就是搜索文本中的各個詞條在整個index的所有document中出現的次數，出現的次數越多，越不相關。

2.2、舉例

比如
doc1：hello，love you
doc2：hi world，I love you

搜索hello world，es分詞器會將其分詞成hello和world兩個單詞
首先hello和world在doc1和doc2中各出現了一次，其次再比如說index有10000條document，hello這個詞在10000個document中出現了2000次。world這個詞在10000個document中出現了100次。那麼doc2更相關，因爲他的次數出現的少。

3、補充

3.1、說明

Field-Length Norm：搜索的field對應的內容越長，相關度越弱。

3.2、舉例

比如
doc1：{ "title": "hello java", "content": "xxxxxxxxxx1萬個單詞" }
doc2：{ "title": "Hi java", "content": "xxxxxxxxxx1萬個單詞，Hi world" }

搜索hello world，es分詞器會將其分詞成hello和world兩個單詞
首先hello和world在doc1和doc2中各出現了一次，其次假設在整個index中出現的次數也是一樣多的（不像IDF那個案例中那麼明顯的不一致），則doc1更相關。因爲title的內容比content的內容短太多了（短了一萬多個單詞）。所以doc1會排到doc2前面。

二、Demo演示

1、數據準備

PUT /product/_doc/1
{
    "name": "xiaomi shouji",
    "desc": "niubi quanwangtong",
    "tags": ["niubi", "quanwangtong", "xiaomi", "shouji"]
}

PUT /product/_doc/2
{
    "name": "huawei shouji",
    "desc": "4G 5G",
    "tags": ["shouji"]
}

PUT /product/_doc/3
{
    "name": "xiaomi shouhuan",
    "desc": "quanzidong",
    "tags": ["shengdian", "xiaomi", "shouji"]
}

2、進行搜索

GET /product/_search
{
  "query": {
    "match": {
      "tags": "shouji"
    }
  }  
}

結果是id：2 -> 3 -> 1

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.8847681,
    "hits" : [
      {
        "_index" : "product",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.8847681,
        "_source" : {
          "name" : "huawei shouji",
          "desc" : "4G 5G",
          "tags" : [
            "shouji"
          ]
        }
      },
      {
        "_index" : "product",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.59321976,
        "_source" : {
          "name" : "xiaomi shouhuan",
          "desc" : "quanzidong",
          "tags" : [
            "shengdian",
            "xiaomi",
            "shouji"
          ]
        }
      },
      {
        "_index" : "product",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.50930655,
        "_source" : {
          "name" : "xiaomi shouji",
          "desc" : "niubi quanwangtong",
          "tags" : [
            "niubi",
            "quanwangtong",
            "xiaomi",
            "shouji"
          ]
        }
      }
    ]
  }
}

3、結果分析

先看TF：出現的次數都一樣。
再看IDF：很明顯id=2的最短，所以分數相對較高，其次是id=3的較短，最後是id=1的。所以TF都一樣，IDF對比結果是2 -> 1 -> 3

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Elasticsearch相關評分度TF/IDF算法揭祕

編程界的小學生

一、算法介紹

1、TF

1.1、概念

1.2、舉例

2、IDF

2.1、概念

2.2、舉例

3、補充

3.1、說明

3.2、舉例

二、Demo演示

1、數據準備

2、進行搜索

3、結果分析

1、數據結構&算法是什麼、爲什麼、怎麼學？

Redis面試必問的緩存穿透、緩存雪崩、緩存擊穿問題

“源碼”到底該怎麼學？

大白話講解Redis的事務

你知道Redis慢查詢嗎？

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結