基于 Apache Doris 的有道精品课数据中台建设实践丨开源案例库

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文旨在向大家分享有道精品课数据中台的架构演进过程,以及Doris作为一个MPP分析型数据库是如何为不断增长的业务体量提供有效支撑并进行数据赋能的。内容分享逻辑首先从"},{"type":"text","marks":[{"type":"strong"}],"text":"实时数仓选型的经验"},{"type":"text","text":"为切入点,进一步着重分享使用Doris过程中遇到的问题以及Doris技术团队针对这些问题所做出的调整和优化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"1、背景"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1.1 业务场景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根据业务需求,目前有道精品课的数据层架构上可分为离线和实时两部分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"离线系统主要处理埋点相关数据,采用批处理的方式定时计算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而实时流数据主要来源于各个业务系统实时产生的数据流以及数据库的变更日志,需要考虑数据的准确性、实时性和时序特征,处理过程非常复杂。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有道精品课数据中台团队依托于其实时计算能力在整个数据架构中主要承担了实时数据处理的角色,同时为下游离线数仓提供实时数据同步服务。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据中台主要服务的用户角色和对应的数据需求如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/94kKZtOp8zg9O3Li.png!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"运营\/策略\/负责人主要查看学生的整体情况,查询数据中台的一些课程维度实时聚合数据"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"辅导\/销售主要关注所服务学生的各种实时明细数据"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"品控主要查看课程\/老师\/辅导各维度整体数据,通过T+1的离线报表进行查看"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据分析师对数据中台T+1同步到离线数仓的数据进行交互式分析"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1.2 数据中台前期系统架构及业务痛点"}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/UoOWIWWvQNhqRRXy.png!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上图所示,在数据中台1.0架构中我们的实时数据存储主要依托于Elasticsearch,遇到了以下几个问题:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"聚合查询效率不高"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"数据压缩空间低"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"不支持多索引的join,在业务设计上我们只能设置很多大宽表来解决问题"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"不支持标准SQL,查询成本较高"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"2、实时数仓选型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于上面的业务痛点,我们开始对实时数仓进行调研。当时调研了Doris, ClickHouse,  TiDB+TiFlash, Druid, Kylin。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
OLAP引擎
优势
劣势
Doris
1. 兼容MySQL协议
2. 支持Online Schema Change
3. 支持更新
4. 集群扩缩容自动化
5. 支持基于时间分区,冷热数据分离
1. 开源较晚,目前还在孵化中
ClickHouse
1. 单机性能强劲
2. 向量化引擎
3. 数据压缩空间大
1. 不支持标准SQL
2. 集群扩缩容不能自动Rebalance
3. 对更新支持不好
4. 运维成本较高
TiDB+TiFlash
1. 兼容MySQL协议
2. 向量化引擎
3. 业务数据和分析数据同步方便(内部Raft同步)
1. TiFlash不开源
2. 落地公司较少
3. 架构主要面向TP场景
Druid
1. 基于时间分区,聚合数据查询较快
2. 支持冷热数据分离
1. 不支持明细数据存储
2. 不支持标准SQL
Kylin
1. 支持标准SQL查询
2. 支持预聚合
3. 社区发展较好
1. 依赖较多
2. 明细查询支持较弱
3. 资源消耗较多"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章