開源Querybook:Pinterest的大數據協作樞紐

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對日益遠程化世界的高效大數據解決方案"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Pinterest擁有超過3000億的Pins,而這一數字背後是一個不斷增長的獨特數據集,通過數據映射無數人的興趣、想法和意圖。作爲一家數據驅動的公司,Pinterest使用數據洞察和分析技術來做出產品決策和評估,爲超過4.5億的月活用戶改善Pinner的體驗。爲了持續做出這些改進,尤其是在今天這個日益遠程化的世界中,與過去相比,團隊更需要進行查詢、創建分析並彼此高效協作。今天我們正在使用Querybook,這是我們實現更高效、更協作的大數據訪問的解決方案,我們還在向社區開源這一項目。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"無論在Pinterest上發起任何分析,一個常見起點是可以在SparkSQL、Hive、Presto集羣或任何Sqlalchemy兼容引擎上執行的即席查詢。我們構建了Querybook來爲此類分析提供一個響應快速且簡單的WebUI,以便數據科學家、產品經理和工程師發現正確的數據、構建他們的查詢並分享他們的成果。在本文中,我們將討論構建Querybook的動機,其特性、架構以及我們將項目開源的工作。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"旅程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"創建Querybook的提議始於2017年,它一開始是一個內部項目。在那時,我們使用的是一個供應商提供的Web應用程序作爲查詢UI。用戶經常抱怨該工具的UI、速度和穩定性、缺乏可視化、難以分享等缺陷。不久之後,我們意識到人們非常需要一個更好的查詢界面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在確定技術細節時,我們開始採訪數據科學家和工程師,諮詢他們的工作流程的細節。不久,我們意識到大多數人是在官方工具之外組織他們的查詢,很多人使用Evernote之類的應用。雖然Jupyter有自己的筆記本用戶體驗,但它需要使用Python\/R,而且它缺乏表元數據集成的問題勸退了很多用戶。基於這一發現,我們的團隊決定Querybook的查詢界面將是一個文檔,用戶可以在該文檔中通過搭配元數據和一個簡單的筆記應用,一站式完成查詢構建和編寫分析任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Querybook於2018年3月在內部發布,成爲了Pinterest上查詢大數據的官方解決方案。如今,Querybook平均有500DAU和7k的每日查詢運行。它的內部用戶評級爲8.1\/10,是Pinterest內部評級最高的工具之一。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"特性亮點"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c7\/c7abfd6a9918e1ce898cac13655ccd94.gif","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"圖1 Querybook的Doc UI"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"用戶首次訪問時,他們會很快注意到其獨特的DataDoc界面。這是用戶進行查詢和分析的主要位置。每個DataDoc均由一系列單元格組成,這些單元格可以是以下三種類型之一:文本、查詢或圖表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"文本單元格帶有內置的富文本支持,以供用戶記下他們的想法或見解。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"查詢單元格用於組成和執行查詢。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"圖表單元格用於根據執行結果創建可視化效果。類似Google Docs,授予用戶訪問DataDoc的權限後,他們可以共同實時協作。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"通過直觀的圖表UI,用戶可以輕鬆地將DataDoc變成一個展示內容的儀表板。你可以選擇多種可視化選項,例如時間序列、餅圖、散點圖等。然後你可以將可視化連接到DataDoc任意查詢的結果上,並按需對它們做排序和聚合預處理。要自動更新這些圖表,你可以使用計劃選項並選擇所需的時間安排。計劃程序可以通知用戶成功或失敗的結果。結合Jinja提供的模板選項,創建實時更新DataDoc的速度非常快。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"計劃任務和可視化特性並不是要取代Airflow或Superset之類的工具,而是爲用戶提供了一種簡單快速的方法來對其查詢進行實驗和迭代。Pinterest工程師通常將Querybook用作撰寫查詢的第一步,之後再創建生產級工作流和儀表板。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最後一點也很重要,Querybook帶有一套自動查詢分析系統。它可以對每個執行的查詢進行分析,以提取元數據(例如引用的表和查詢運行器)。Querybook使用這些信息自動更新其數據模式和搜索排名,並顯示錶的常用用戶和查詢示例。查詢越多,表的文檔化程度就越高。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"架構工程"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c9\/c94af196d4c05044ee50291a5e9bb3db.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"圖2 Querybook的架構概述"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"爲了瞭解Querybook的工作機制,我們來過一遍編寫和執行查詢的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第一步是創建一個DataDoc並將查詢寫入一個單元格中。當用戶鍵入內容時,用戶的查詢將通過Socket.IO流式傳輸到服務器。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"然後,服務器將這些增量推送給所有通過Redis讀取該DataDoc的用戶。同時,服務器會將更新的DataDoc保存在數據庫中,併爲worker創建一個異步作業以更新ElasticSearch中的DataDoc內容,待以後搜索。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"編寫完查詢後,用戶可以單擊運行按鈕來執行查詢,然後服務器將在數據庫中創建一條記錄,並將一個查詢作業插入到Redis任務隊列中。上述worker接受任務並將查詢發送到查詢引擎(Presto、Hive、SparkSQL或任何與Sqlalchemy兼容的引擎)。在查詢運行時,worker通過Socket.IO將實時更新推送到UI。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"執行完成後,worker加載查詢結果並將其分批上傳到一個可配置的存儲服務(例如S3)中。最後,瀏覽器將收到查詢完成通知,並向服務器發出一個請求以加載查詢結果,顯示給用戶。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"簡短起見,本節僅關注Querybook的一個用戶流,但已經涵蓋了其所使用的所有基礎架構。Querybook允許用戶自定義其中的一些部分。例如,你可以選擇將執行結果上傳到S3、Google Cloud Storage或本地文件。另外,MySQL也可以與任何與Sqlalchemy兼容的數據庫(例如Postgres)互換。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"開源之路"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在注意到Querybook在內部取得的成功之後,我們決定將其開源。我們遇到的一個挑戰是如何在保留一些特定於Pinterest的集成的同時讓它適合通用場景。爲此,我們決定通過一套插件系統來做一個兩層的組織,並添加一個Admin UI(管理界面)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"藉助Admin UI,其他公司可以通過單個友好的界面來配置Querybook的查詢引擎、表元數據提取和訪問權限。以前,這些配置是在配置文件中完成的,需要更改代碼並部署才能生效。有了這個新的UI,管理員無需更改代碼或配置文件即可進行實時Querybook更改。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d7\/d7dc5a64baecbc9263f2ad60de738744.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"圖3 Admin UI"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"插件系統使用Python的importlib將Querybook與Pinterest的內部系統集成在一起。開發人員可以使用插件系統配置認證、自定義查詢引擎並實現對內部站點的導出器。插件系統提供的自定義行爲讓Querybook可以針對用戶在Pinterest上的工作流程做出優化,同時確保開源項目適合大衆使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"你可以在Querybook.org上查看Querybook的更多特性及文檔,也可以通過[email protected]與我們聯繫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/medium.com\/pinterest-engineering\/open-sourcing-querybook-pinterests-collaborative-big-data-hub-ba2605558883","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/medium.com\/pinterest-engineering\/open-sourcing-querybook-pinterests-collaborative-big-data-hub-ba2605558883"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章