使用Kafka,如何成功遷移SQL數據庫中超過20億條記錄?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的一個客戶遇到了一個MySQL問題,他們有一張大表,這張表有20多億條記錄,而且還在不斷增加。如果不更換基礎設施,就有磁盤空間被耗盡的風險,最終可能會破壞整個應用程序。而且,這麼大的表還存在其他問題:糟糕的查詢性能、糟糕的模式設計,因爲記錄太多而找不到簡單的方法來進行數據分析。我們希望有這麼一個解決方案,既能解決這些問題,又不需要引入高成本的維護時間窗口,導致應用程序無法運行以及客戶無法使用系統。在這篇文章中,我將介紹我們的解決方案,但我還想提醒一下,這並不是一個建議:不同的情況需要不同的解決方案,不過也許有人可以從我們的解決方案中得到一些有價值的見解。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"雲解決方案會是解藥嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在評估了幾個備選解決方案之後,我們決定將數據遷移到雲端,我們選擇了Google Big Query。我們之所以選擇它,是因爲我們的客戶更喜歡谷歌的雲解決方案,他們的數據具有結構化和可分析的特點,而且不要求低延遲,所以BigQuery似乎是一個完美的選擇。經過測試,我們確信Big Query是一個足夠好的解決方案,能夠滿足客戶的需求,讓他們能夠使用分析工具,可以在幾秒鐘內進行數據分析。但是,正如你可能已經知道的那樣,對BigQuery進行大量查詢可能會產生很大的開銷,因此我們希望避免直接通過應用程序進行查詢,我們只將BigQuery作爲分析和備份工具。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/06\/72\/06343yy3b7527837c69ee810f6680672.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"將數據流到雲端"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"說到流式傳輸數據,有很多方法可以實現,我們選擇了非常簡單的方法。我們使用了Kafka,因爲我們已經在項目中廣泛使用它了,所以不需要再引入其他的解決方案。Kafka給了我們另一個優勢——我們可以將所有的數據推到Kafka上,並保留一段時間,然後再將它們傳輸到目的地,不會給MySQL集羣增加很大的負載。如果BigQuery引入失敗(比如執行請求查詢的成本太高或太困難),這個辦法爲我們提供了某種退路。這是一個重要的決定,它給我們帶來了很多好處,而開銷很小。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"將數據從MySQL流到Kafka"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於如何將數據從MySQL流到Kafka,你可能會想到Debezium("},{"type":"link","attrs":{"href":"https:\/\/debezium.io","title":"","type":null},"content":[{"type":"text","text":"https:\/\/debezium.io"}]},{"type":"text","text":")或Kafka Connect。這兩種解決方案都是很好的選擇,但在我們的案例中,我們沒有辦法使用它們。MySQL服務器版本太老了,Debezium不支持,升級MySQL升級也不是辦法。我們也不能使用Kafka Connect,因爲表中缺少自增列,Kafka Connect就沒辦法保證在傳輸數據時不丟失數據。我們知道有可能可以使用時間戳,但這種方法有可能會丟失部分數據,因爲Kafka查詢數據時使用的時間戳精度低於表列中定義的精度。當然,這兩種解決方案都很好,如果在你的項目中使用它們不會導致衝突,我推薦使用它們將數據庫裏的數據流到Kafka。在我們的案例中,我們需要開發一個簡單的Kafka生產者,它負責查詢數據,並保證不丟失數據,然後將數據流到Kafka,以及另一個消費者,它負責將數據發送到BigQuery,如下圖所示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/de\/6d\/dec1f3da6688f23d39665895cc4a0a6d.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"將數據流到BigQuery"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"通過分區來回收存儲空間"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將所有數據流到Kafka(爲了減少負載,我們使用了數據過濾),然後再將數據流到BigQuery,這幫我們解決了查詢性能問題,讓我們可以在幾秒鐘內分析大量數據,但空間問題仍然存在。我們想設計一個解決方案,既能解決現在的問題,又能在將來方便使用。我們爲數據表準備了新的schema,使用序列ID作爲主鍵,並將數據按月份進行分區。對大表進行分區,我們就能夠備份舊分區,並在不再需要這些分區時將其刪除,回收一些空間。因此,我們用新schema創建了新表,並使用來自Kafka的數據來填充新的分區表。在遷移了所有記錄之後,我們部署了新版本的應用程序,它向新表進行插入,並刪除了舊錶,以便回收空間。當然,爲了將舊數據遷移到新表中,你需要有足夠的空閒可用空間。不過,在我們的案例中,我們在遷移過程中不斷地備份和刪除舊分區,確保有足夠的空間來存儲新數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/4c\/20\/4cff483fc68a675a88975762e98a7720.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"將數據流到分區表中"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"通過整理數據來回收存儲空間"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在將數據流到BigQuery之後,我們就可以輕鬆地對整個數據集進行分析,並驗證一些新的想法,比如減少數據庫中表所佔用的空間。其中一個想法是驗證不同類型的數據是如何在表中分佈的。後來發現,幾乎90%的數據是沒有必要存在的,所以我們決定對數據進行整理。我開發了一個新的Kafka消費者,它將過濾掉不需要的記錄,並將需要留下的記錄插入到另一張表。我們把它叫作整理表,如下所示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/4f\/0a\/4f50d2bf61670331897ba14969e5b70a.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過整理,類型A和B被過濾掉了:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/ae\/27\/ae7264bf033fb163e9c9bcd4865de327.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/57\/9b\/5768deb8a7cc3ccfc8468339239b319b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"將數據流入新表"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整理好數據之後,我們更新了應用程序,讓它從新的整理表讀取數據。我們繼續將數據寫入之前所說的分區表,Kafka不斷地從這個表將數據推到整理表中。正如你所看到的,我們通過上述的解決方案解決了客戶所面臨的問題。因爲使用了分區,存儲空間不再是個問題,數據整理和索引解決了應用程序的一些查詢性能問題。最後,我們將所有數據流到雲端,讓我們的客戶能夠輕鬆對所有數據進行分析。由於我們只對特定的分析查詢使用BigQuery,而來自用戶其他應用程序的相關查詢仍然由MySQL服務器處理,所以開銷並不會很高。另一點很重要的是,所有這些都是在沒有停機的情況下完成的,因此客戶不會受到影響。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總的來說,我們使用Kafka將數據流到BigQuery。因爲將所有的數據都推到了Kafka,我們有了足夠的空間來開發其他的解決方案,這樣我們就可以爲我們的客戶解決重要的問題,而不需要擔心會出錯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/blog.softwaremill.com\/our-way-of-dealing-with-more-than-2-billion-records-in-sql-database-99deaff0d31","title":"","type":null},"content":[{"type":"text","text":"https:\/\/blog.softwaremill.com\/our-way-of-dealing-with-more-than-2-billion-records-in-sql-database-99deaff0d31"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章