Elasticsearch系列---數據建模實戰

概要

本篇以實際案例爲背景，介紹不同技術組件對數據建模的特點，並以ES爲背景，介紹常用的聯合查詢的利弊，最後介紹了一下文件系統分詞器path_hierarchy和嵌套對象的使用。

數據模型對比

實際項目中，電商平臺系統常見的組合Java、Mysql和Elasticsearch，以基礎的部門-員工實體爲案例。

JavaBean類型定義

如果是JavaBean類型，會這樣定義

public class Department {
	private Long id;
	private String name;
	private String desc;
	private List<Long> userIds;
}

public class Employee {
	private Long id;
	private String name;
	private byte gender;
	private Department dept;
}

數據庫模型定義

如果是關係型數據庫(mysql)，會這樣建表

create table t_department (
	id bigint(20) not null auto_increment,
	name varchar(30) not null,
	desc varchar(80) not null,
	PRIMARY KEY (`id`)
)

create table t_employee (
	id bigint(20) not null auto_increment,
	name varchar(30) not null,
	gender tinyint(1) not null,
	dept_id bigint(20),
	PRIMARY KEY (`id`)
)

依據數據庫三範式設計表，每個實體設計成獨立的表，用主外鍵約束進行關聯，按照現有的數據表規範，已經不再使用外鍵約束了，外鍵約束放在應用層控制。

ES文檔數據模型

如果es的文檔數據模型，會這樣設計document

{
	"deptId": 1,
	"deptname": "CEO辦公室",
	"desc":"這一個有情懷的CEO",
	"employee":[
		{
			"userId":1,
			"name":Lily,
			"gender":0
		},
		{
			"userId":2,
			"name":Lucy,
			"gender":0
		},
		{
			"userId":3,
			"name":Tom,
			"gender":1
		}
	]
}

es更類似面向對象的數據模型，將所有關聯的數據放在一個document裏。

JOIN查詢

我們以博客網站爲案例背景，建立博客網站中博客與用戶的數據模型。

將用戶與博客分別建立document，分割實體，類似數據庫三範式，並使用關鍵field（userId）建立依賴關係

先建立兩個實體document，放一條示例數據

PUT /blog/user/1
{
  "id":1,
  "username":"Lily",
  "age":18
}

PUT /website/article/1
{
  "title":"my frist blog",
  "content":"this is my first blog, thank you",
  "userId":1
}

需求：要查詢用戶名Lily發表的博客
步驟：1）查詢用戶document，根據名字Lily查詢到它的userId；
2）根據第一步查詢返回的userId，重新組裝請求報文，查詢博客docuement
示例報文：

GET /blog/user/_search
{
  "query": {
    "match": {
      "username.keyword": "Lily"
    }
  }
}

GET /website/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "userId": [
            "1"
          ]
        }
      }
    }
  }
}

以上步驟叫做應用層Join實現關聯查詢

優點：結構清晰明瞭，數據不冗餘，維護方便。
缺點：應用層join，如關聯的數據過多，查詢性能很低。

適用場景：兩層join，第一層document查詢基本上能做到精準查詢，返回的結果數很少，並且第二層數據量特別大。如案例中的場景，根據名稱找userId，返回的數據相對較少，第二層的查詢性能就比較高，第二層數據屬於業務數據類型，數據量肯定特別大。

適度冗餘減少應用層Join查詢

普通查詢

接上面案例，修改博客document，將username冗餘到該document中，如：

PUT /website/article/2
{
  "title":"my second blog",
  "content":"this is my second blog, thank you",
  "userInfo": {
    "id":1,
    "username":"Lily"
  }
}

查詢時直接指定username:

GET /website/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "userInfo.username.keyword": "Lily"
        }
      }
    }
  }
}

優點：一次查詢即可，性能較高
缺點：若冗餘的字段有更新，維護非常麻煩

適合場景：適當的冗餘比較有必要，可以減小join查詢，關係型數據庫設計也經常有冗餘數據的優化，只要挑選冗餘字段時要注意儘量選變更可能性小的字段，避免查詢一時爽，更新想撞牆這種事情發生。

數據冗餘設計後聚合分組查詢

造點測試數據進去

PUT /website/article/3
{
  "title":"my third blog",
  "content":"this is my third blog, thank you",
  "userInfo": {
    "id":2,
    "username":"Lucy"
  }
}

PUT /website/article/4
{
  "title":"my 4th blog",
  "content":"this is my 4th blog, thank you",
  "userInfo": {
    "id":2,
    "username":"Lucy"
  }
}

分組查詢：Lily發表了哪些博客，Lucy發表了哪些博客

GET website/article/_search
{
  "size": 0,
  "aggs": {
    "group_by_username": {
      "terms": {
        "field": "userInfo.username.keyword"
      },
      "aggs": {
        "top_articles": {
          "top_hits": {
            "size": 10,
            "_source": {
              "includes": "title"
            }
          }
        }
      }
    }
  }
}

文件搜索

文件類型的數據有個很大的特點：有目錄層次關係。如果我們有對文件搜索的需求，可以這個建立索引：

PUT /files
{
  "settings": {
    "analysis": {
      "analyzer": {
        "paths": {
          "tokenizer":"path_hierarchy"
        }
      }
    }
  }
}

PUT /files/_mapping/file
{
  "properties": {
    "name": {
      "type": "keyword"
    },
    "path": {
      "type": "keyword",
      "fields": {
        "tree": {
          "type": "text",
          "analyzer": "paths"
        }
      }
    }
  }
}

注意分詞器path_hierarchy，會把/opt/data/log分成

/opt/

/opt/data/

/opt/data/log

插入一條測試數據

PUT /files/file/1
{
  "name":"hello.txt",
  "path":"/opt/data/txt/"
}

搜索案例

指定文件名，指定具體路徑搜索

GET files/file/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "hello.txt"
          }
        },
        {
          "match": {
            "path": "/opt/data/txt/"
          }
        }
      ]
    }
  }
}

/opt路徑下的hello.txt文件（包含子目錄）

GET files/file/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "hello.txt"
          }
        },
        {
          "match": {
            "path.tree": "/opt/"
          }
        }
      ]
    }
  }
}

區別：path與path.tree的使用
path.tree是會分詞的，並且指定分詞器爲path_hierarchy
path不分詞，直接使用。

nested object數據類型

提出問題

用普通的object對象做數據冗餘時，如果冗餘的數據是一個數組集合類的，查詢可能會出問題，例如：博客信息下面的評論，是一個集合類型

PUT /website/article/5
{
  "title": "清茶豆奶發表的一篇技術帖子",
  "content":  "我是清茶豆奶，大家要不要考慮關注一下Java架構社區啊",
  "tags":  [ "IT技術", "Java架構社區" ],
  "comments": [ 
    {
      "name":    "清茶",
      "comment": "有什麼乾貨沒有啊？",
      "age":     29,
      "stars":   4,
      "date":    "2019-10-29"
    },
    {
      "name":    "豆奶",
      "comment": "我最喜歡研究技術，真好",
      "age":     32,
      "stars":   5,
      "date":    "2019-10-30"
    }
  ]
}

需求：查詢被29歲的豆奶用戶評論過的博客

GET /website/article/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "comments.name.keyword": "豆奶"
          }
        },
        {
          "match": {
            "comments.age": "29"
          }
        }
      ]
    }
  }
}

根據這條演示數據，這個條件是查不到結果的，但實際卻查出來這條數據，爲什麼？

原因：object類型底層數據結果，會將json進行扁平化存儲，如上例子，存儲結構將變成：

{
	"title":["清茶","豆奶","發表","一篇","技術","帖子"],
	"content": ["我","清茶","豆奶","大家","要不要","考慮","關注","一下","Java架構社區"],
	tags:["IT技術", "Java架構社區"],
	comments.name:["清茶","豆奶"],
	comments.comment:["有","什麼","乾貨","沒有啊","我","最喜歡","研究","技術","真好"],
	comments.age:[29,32],
	comments.stars:[4,5],
	comments.date:["2019-10-29","2019-10-30"]
}

這樣"豆奶"和29就被命中了，跟預期的結果不一致。

解決辦法

引入nested object類型，就可以解決這種問題。
修改mapping，將comments的類型改成nested object。
先刪掉索引，再重新建立

PUT /website
{
  "mappings": {
    "article": {
      "properties": {
        "comments": {
          "type": "nested",
          "properties": {
            "name": {"type":"text"},
            "comment": {"type":"text"},
            "age":     {"type":"short"},
            "stars":   {"type":"short"},
            "date":  {"type":"date"}
          }
        }
      }
    }
  }
}

這樣底層數據結構就成變成：

{
	"title":["清茶","豆奶","發表","一篇","技術","帖子"],
	"content": ["我","清茶","豆奶","大家","要不要","考慮","關注","一下","Java架構社區"],
	tags:["IT技術", "Java架構社區"],
	comments:[
		{
			"name":"清茶",
			"comment":["有","什麼","乾貨","沒有啊"],
			"age":29,
			"stars":4,
			"date":"2019-10-29"
		},
		{
			"name":"豆奶",
			"comment":["我","最喜歡","研究","技術","真好"],
			"age":32,
			"stars":5,
			"date":"2019-10-30"
		}
	]
}

再查詢結果爲空，符合預期。

聚合查詢示例

求博客每天評論的平均星數

GET /website/article/_search
{
  "size": 0,
  "aggs": {
    "comments_path": {
      "nested": {
        "path": "comments"
      }, 
      "aggs": {
        "group_by_comments_date": {
          "date_histogram": {
            "field": "comments.date",
            "interval": "day",
            "format": "yyyy-MM-dd"
          }, 
          "aggs": {
            "stars_avg": {
              "avg": {
                "field": "comments.stars"
              }
            }
          }
        }
      }
    }
  }
}

響應結果(有刪節)：

{
  "aggregations": {
    "comments_path": {
      "doc_count": 2,
      "group_by_comments_date": {
        "buckets": [
          {
            "key_as_string": "2019-10-29",
            "key": 1572307200000,
            "doc_count": 1,
            "stars_avg": {
              "value": 4
            }
          },
          {
            "key_as_string": "2019-10-30",
            "key": 1572393600000,
            "doc_count": 1,
            "stars_avg": {
              "value": 5
            }
          }
        ]
      }
    }
  }
}

小結

本篇以實際的案例爲主，簡單快速的介紹了實際項目中常用的數據聯合查詢，嵌套對象的使用等，很有實用價值，可以瞭解一下。

專注Java高併發、分佈式架構，更多技術乾貨分享與心得，請關注公衆號：Java架構社區
可以掃左邊二維碼添加好友，邀請你加入Java架構社區微信羣共同探討技術

Elasticsearch系列---數據建模實戰

概要

數據模型對比

JavaBean類型定義

數據庫模型定義

ES文檔數據模型

JOIN查詢

適度冗餘減少應用層Join查詢

普通查詢

數據冗餘設計後聚合分組查詢

文件搜索

搜索案例

nested object數據類型

提出問題

解決辦法

聚合查詢示例

小結

如何基於surging跨網關跨語言進行緩存降級

2024合集

程序員天天 CURD，怎麼才能成長，職業發展的思考(2)

移位操作搞定兩數之商

教你用Perl實現Smgp協議

如何通過前端表格控件在10分鐘內完成一張分組報表？

win11關閉自動檢測病毒刪文件

通用代碼生成器簡介

lightdb 單機模式下數據庫平移

千兆寬帶實際網速能到達多少？

學一點Ceph知識：初識Ceph

SpringMVC Json自定義序列化和反序列化

SpringMVC日期格式屬性自動轉成時間戳實現源碼分析

spring-cloud-gateway聚合swagger文檔

基於SpringCloud的enum枚舉值國際化處理實踐

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結