开源Querybook:Pinterest的大数据协作枢纽

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"针对日益远程化世界的高效大数据解决方案"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Pinterest拥有超过3000亿的Pins,而这一数字背后是一个不断增长的独特数据集,通过数据映射无数人的兴趣、想法和意图。作为一家数据驱动的公司,Pinterest使用数据洞察和分析技术来做出产品决策和评估,为超过4.5亿的月活用户改善Pinner的体验。为了持续做出这些改进,尤其是在今天这个日益远程化的世界中,与过去相比,团队更需要进行查询、创建分析并彼此高效协作。今天我们正在使用Querybook,这是我们实现更高效、更协作的大数据访问的解决方案,我们还在向社区开源这一项目。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"无论在Pinterest上发起任何分析,一个常见起点是可以在SparkSQL、Hive、Presto集群或任何Sqlalchemy兼容引擎上执行的即席查询。我们构建了Querybook来为此类分析提供一个响应快速且简单的WebUI,以便数据科学家、产品经理和工程师发现正确的数据、构建他们的查询并分享他们的成果。在本文中,我们将讨论构建Querybook的动机,其特性、架构以及我们将项目开源的工作。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"旅程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"创建Querybook的提议始于2017年,它一开始是一个内部项目。在那时,我们使用的是一个供应商提供的Web应用程序作为查询UI。用户经常抱怨该工具的UI、速度和稳定性、缺乏可视化、难以分享等缺陷。不久之后,我们意识到人们非常需要一个更好的查询界面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在确定技术细节时,我们开始采访数据科学家和工程师,咨询他们的工作流程的细节。不久,我们意识到大多数人是在官方工具之外组织他们的查询,很多人使用Evernote之类的应用。虽然Jupyter有自己的笔记本用户体验,但它需要使用Python\/R,而且它缺乏表元数据集成的问题劝退了很多用户。基于这一发现,我们的团队决定Querybook的查询界面将是一个文档,用户可以在该文档中通过搭配元数据和一个简单的笔记应用,一站式完成查询构建和编写分析任务。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Querybook于2018年3月在内部发布,成为了Pinterest上查询大数据的官方解决方案。如今,Querybook平均有500DAU和7k的每日查询运行。它的内部用户评级为8.1\/10,是Pinterest内部评级最高的工具之一。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"特性亮点"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c7\/c7abfd6a9918e1ce898cac13655ccd94.gif","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图1 Querybook的Doc UI"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"用户首次访问时,他们会很快注意到其独特的DataDoc界面。这是用户进行查询和分析的主要位置。每个DataDoc均由一系列单元格组成,这些单元格可以是以下三种类型之一:文本、查询或图表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"文本单元格带有内置的富文本支持,以供用户记下他们的想法或见解。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"查询单元格用于组成和执行查询。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图表单元格用于根据执行结果创建可视化效果。类似Google Docs,授予用户访问DataDoc的权限后,他们可以共同实时协作。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"通过直观的图表UI,用户可以轻松地将DataDoc变成一个展示内容的仪表板。你可以选择多种可视化选项,例如时间序列、饼图、散点图等。然后你可以将可视化连接到DataDoc任意查询的结果上,并按需对它们做排序和聚合预处理。要自动更新这些图表,你可以使用计划选项并选择所需的时间安排。计划程序可以通知用户成功或失败的结果。结合Jinja提供的模板选项,创建实时更新DataDoc的速度非常快。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"计划任务和可视化特性并不是要取代Airflow或Superset之类的工具,而是为用户提供了一种简单快速的方法来对其查询进行实验和迭代。Pinterest工程师通常将Querybook用作撰写查询的第一步,之后再创建生产级工作流和仪表板。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最后一点也很重要,Querybook带有一套自动查询分析系统。它可以对每个执行的查询进行分析,以提取元数据(例如引用的表和查询运行器)。Querybook使用这些信息自动更新其数据模式和搜索排名,并显示表的常用用户和查询示例。查询越多,表的文档化程度就越高。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"架构工程"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c9\/c94af196d4c05044ee50291a5e9bb3db.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图2 Querybook的架构概述"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"为了了解Querybook的工作机制,我们来过一遍编写和执行查询的过程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第一步是创建一个DataDoc并将查询写入一个单元格中。当用户键入内容时,用户的查询将通过Socket.IO流式传输到服务器。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"然后,服务器将这些增量推送给所有通过Redis读取该DataDoc的用户。同时,服务器会将更新的DataDoc保存在数据库中,并为worker创建一个异步作业以更新ElasticSearch中的DataDoc内容,待以后搜索。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"编写完查询后,用户可以单击运行按钮来执行查询,然后服务器将在数据库中创建一条记录,并将一个查询作业插入到Redis任务队列中。上述worker接受任务并将查询发送到查询引擎(Presto、Hive、SparkSQL或任何与Sqlalchemy兼容的引擎)。在查询运行时,worker通过Socket.IO将实时更新推送到UI。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"执行完成后,worker加载查询结果并将其分批上传到一个可配置的存储服务(例如S3)中。最后,浏览器将收到查询完成通知,并向服务器发出一个请求以加载查询结果,显示给用户。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"简短起见,本节仅关注Querybook的一个用户流,但已经涵盖了其所使用的所有基础架构。Querybook允许用户自定义其中的一些部分。例如,你可以选择将执行结果上传到S3、Google Cloud Storage或本地文件。另外,MySQL也可以与任何与Sqlalchemy兼容的数据库(例如Postgres)互换。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"开源之路"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在注意到Querybook在内部取得的成功之后,我们决定将其开源。我们遇到的一个挑战是如何在保留一些特定于Pinterest的集成的同时让它适合通用场景。为此,我们决定通过一套插件系统来做一个两层的组织,并添加一个Admin UI(管理界面)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"借助Admin UI,其他公司可以通过单个友好的界面来配置Querybook的查询引擎、表元数据提取和访问权限。以前,这些配置是在配置文件中完成的,需要更改代码并部署才能生效。有了这个新的UI,管理员无需更改代码或配置文件即可进行实时Querybook更改。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d7\/d7dc5a64baecbc9263f2ad60de738744.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图3 Admin UI"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"插件系统使用Python的importlib将Querybook与Pinterest的内部系统集成在一起。开发人员可以使用插件系统配置认证、自定义查询引擎并实现对内部站点的导出器。插件系统提供的自定义行为让Querybook可以针对用户在Pinterest上的工作流程做出优化,同时确保开源项目适合大众使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"你可以在Querybook.org上查看Querybook的更多特性及文档,也可以通过[email protected]与我们联系。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"原文链接:"},{"type":"link","attrs":{"href":"https:\/\/medium.com\/pinterest-engineering\/open-sourcing-querybook-pinterests-collaborative-big-data-hub-ba2605558883","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/medium.com\/pinterest-engineering\/open-sourcing-querybook-pinterests-collaborative-big-data-hub-ba2605558883"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章