數據架構:概念與冷熱分離

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"封面圖片來自51CTO.com的文章:","attrs":{}},{"type":"link","attrs":{"href":"http://www.100ec.cn/detail--6069294.html","title":"","type":null},"content":[{"type":"text","text":"淘寶大數據分析應對雙十一來襲","attrs":{}}]},{"type":"text","text":"。引用請註明來源。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"一 什麼是數據架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 關於架構,大家都有了解和理解。通常一個業務或項目,在做架構設計時,可能會包含業務架構和技術架構。其中技術架構是我們作爲開發角色,在做設計時重點的工作內容。但還有架構類型的劃分方式,會包括業務架構、技術架構、數據架構和應用架構四種。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 數據架構管理的內容包括管理對象、管理流程、管理組織,管理對象又包括數據標準、數據模型、數據庫、數據質量。總之,數據架構就是由一定的管理組織,通過一系列管理流程,來實現對數據對象的管理。數據架構構成如下圖所示:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f5/f50e16e0f52838740077310c15ac4f95.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"二 爲什麼需要數據架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" “經驗來源於實踐”。經歷過一個或多箇中大型項目/產品生命週期的朋友,大多會有這樣的經驗。在項目早期時,爲了快速驗證,會以儘快上線運行爲最主要的目標,架構設計會有數據結構部分,但不會過多設計。在項目快速發展之後,頻繁的表結構變更、數據類型變化會帶來一系列的問題,尤其是當可能發生拆庫、分表等動作之後,帶來幾個典型的數據問題:","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.1 數據標準不一致","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"列名相同,數據類型不同;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"列名相同,數據類型相同,長度不同;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"列名定義沒有統一標準,識別困難;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"列名定義不統一,類型不統一,長度不同;","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.2 數據模型混亂","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表、字段缺乏註釋;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表無主鍵、允許爲NULL列;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表關係不清晰;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不合理的冗餘設計;","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.3 性能問題","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對錶結構、索引理解、使用不當;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SQL的編寫與開發者的技術水平有關,當sql編寫不當且缺乏審覈導致帶入線上,就會導致性能問題","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.4 數據缺乏安全管理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表結構規範;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"索引合理性設計、創建檢查;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SQL質量;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據安全管理(插入、刪除、更新,以及批量查詢動作)","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"三 數據架構生命週期","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 數據架構的重點是數據的標準化處理,這會貫穿於系統/項目的整個生命週期。包括數據架構設計階段、開發階段、遷移階段、測試階段等等。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/19/195f606a5440f8940cece2de9370d0f7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"四 數據冷熱分離","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.1 大數據存儲方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 關於大數據量的存儲方案,常用的有分庫分表方案,可以選用多種分庫分表技術和中間件來實現。但有一個問題,當單表數據量到達多少的時候執行分庫分表?","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.2 單表上限2000w的起源","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 一直有這樣一種說法,MySQL 單表數據量大於 2000 萬行,性能會明顯下降。","attrs":{}},{"type":"link","attrs":{"href":"https://zhuanlan.zhihu.com/p/370031862","title":"","type":null},"content":[{"type":"text","text":"冷熱分離之 OTS 表格存儲實戰","attrs":{}}]},{"type":"text","text":"這篇文章中給出了來源:這個傳聞據說最早起源於百度。“具體情況大概是這樣的,當年的 DBA 測試 MySQL性能時發現,當單表的量在 2000 萬行量級的時候,SQL 操作的性能急劇下降,因此,結論由此而來。然後又據說百度的工程師流動到業界的其它公司,隨之也帶去了這個信息,所以,就在業界流傳開這麼一個說法。再後來,阿里巴巴《Java 開發手冊》提出單錶行數超過 500 萬行或者單表容量超過2GB,才推薦進行分庫分表。對此,有阿里的黃金鐵律支撐,所以,很多人設計大數據存儲時,多會以此爲標準,進行分表操作。”","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 背後的原理,大家也可以仔細閱讀這篇文章,簡單來說,這個說法源於Mysql的InnoDB引擎的存儲結構和索引結構。記錄數過多時導致B+樹高度過高從而需要多次IO,導致性能明顯下降。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.3 冷熱分離","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.3.1 數據的冷熱劃分","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 首先,絕大部分場景,數據都可以分爲“冷數據”和“熱數據”。數據劃分的原則,可以根據時間遠近、熱點/非熱點用戶等等。例如在以往項目中的實例,用戶通常只訪問一段時間之內的數據,例如近一週或一個月。如果數據不做劃分,必然會導致一定程度上的性能、成本損耗。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.3.2 冷熱分離好處","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 通過合理的冷熱分離設計,可以達到的好處:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"降低單表數據量,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"提升單表性能","attrs":{}},{"type":"text","text":";","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大量業務冷數據轉冷存,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"存儲成本可以降低很多,至少50%+","attrs":{}},{"type":"text","text":"。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"五 冷熱分離方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 需要考慮的包括存儲方案、數據遷移方案,另外需要做歷史查詢時也需要支持聚合查詢和自動的冷熱查詢路由。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"5.1 存儲方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 存儲方案,包括本地方案和雲方案。本地的存儲介質,通常是硬盤,但通常機械硬盤會受限於磁盤空間和IO瓶頸,這也是單表限制的主要原因。所以一般處於性能提升的考慮,會使用固態硬盤(SSD)。但SSD成本較高(遠高於機械硬盤),所以不適合海量數據存儲,這時候就需要考慮磁盤陣列等等。另外,磁帶也是一種方案,但僅適合歷史數據的持久化保存,和必要時做數據恢復,本身並不適合查詢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 如果能夠接受雲方案,那麼可選的就有云硬盤作爲DB的存儲介質、或者是雲服務上提供的冷熱存儲(blob、表格存儲)。阿里雲的OTS就是一種表格存儲實現,其技術架構如下圖所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d4/d42ae92995a1cc93ae5886f0932f9e15.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"5.2 遷移方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 數據有冷熱劃分,那麼就會有界限、生命週期。新的數據寫入時,其屬性是“熱”的;當到達某個時間節點或預設閾值時,就需要把數據遷移到“冷”數據存儲。這裏又涉及到幾個問題:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"冷熱數據標記","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遷移方法。時效性保障","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據一致性","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,爲了保證冷熱數據遷移過程中業務系統的穩定性,在數據遷移的過程中還一定要做到:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"可灰度","attrs":{}},{"type":"text","text":"[降低影響,提前發現],","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"可一鍵回滾","attrs":{}},{"type":"text","text":"[快速止血]。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 本文介紹了數據架構的概念、意義,以及數據的冷熱分離,並闡述了冷熱分離方案和注意事項。本篇作爲綜述,在後續系列文章中會通過實際案例來進一步探究數據架構的內容。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"參考文章","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://blog.sina.com.cn/s/blog_4d22b9720102xhrr.html","title":"","type":null},"content":[{"type":"text","text":"數據架構-什麼是數據架構","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章