键 | 说明 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ws:user:clients:${uid} | 存储用户和 WebSocket 连接的关系,采用有序集合方式存储 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ws:guid:clients:${guid} | 存储文件和 WebSocket 连接的关系,采用有序结合方式存储 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ws:client:${socket.id} | 存储当前 WebSocket 连接下的全部用户和文件关系数据,采用 Redis Hash 方式进行存储,对应 key 为 user 和 guid"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由客户端触发或组件服务触发的消息推送,通过 Redis 存储的数据结构,在 WS-API 服务查询到返回消息体的目标客户端的 Socket ID,再有 WS-Gateway 服务进行集群消费,如果 Socket ID 不在当前节点,则需要进行节点与会话关系的查询,找到客端户 Socket ID 实际对应的 WS-Gateway 节点,通常有以下两种方案:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
|
石墨文档Websocket百万长连接技术实践
{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Web 服务端推送技术经过了长轮询、短轮询的发展,最终到 HTML5 标准带来的 WebSocket 规范逐步成为了目前业内主流技术方案。它使得消息推送、消息通知等功能的实现变得异常简单,那么百万级别连接下的 Websocket 网关该如何实践呢?本文整理自石墨文档资深工程师杜旻翔在重构石墨websocket网关的技术实践。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"1 引言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在石墨文档的部分业务中,例如文档分享、评论、幻灯片演示和文档表格跟随等场景,涉及到多客户端数据同步和服务端批量数据推送的需求,一般的 HTTP 协议无法满足服务端主动 Push 数据的场景,因此选择采用 WebSocket 方案进行业务开发。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"随着石墨文档业务发展,目前日连接峰值已达百万量级,日益增长的用户连接数和不符合目前量级的架构设计导致了内存和 CPU 使用量急剧增长,因此我们考虑对网关进行重构。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"2 网关 1.0"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"网关 1.0 是使用 Node.js 基于 Socket.IO 进行修改开发的版本,很好的满足了当时用户量级下的业务场景需求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.1 架构"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"网关 1.0 版本架构设计图:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/29\/29ecb547e8148d8b00938726ffa28b81.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"网关 1.0 客户端连接流程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"用户通过 NGINX 连接网关,该操作被业务服务感知;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"业务服务感知到用户连接后,会进行相关用户数据查询,再将消息 Pub 到 Redis;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"网关服务通过 Redis Sub 收到消息;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"查询网关集群中的用户会话数据,向客户端进行消息推送。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.2 痛点"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"虽然 1.0 版本的网关在线上运行良好,但是不能很好的支持后续业务的扩展,并且有以下几个问题需要解决:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"资源消耗:Nginx 仅使用证书,大部分请求被透传,产生了一定的资源浪费,同时之前的 Node 网关性能不好,消耗大量的 CPU、内存。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"维护与观测:未接入石墨的监控体系,无法和现有监控告警联通,维护上存在一定的困难;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"业务耦合问题:业务服务与网关功能被集成到了同一个服务中,无法针对业务部分性能损耗进行针对性水平扩容,为了解决性能问题,以及后续的模块扩展能力,都需要进行服务解耦。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"3 网关 2.0"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"网关 2.0 需要解决很多问题:石墨文档内部有很多组件:文档、表格、幻灯片和表单等等。在 1.0 版本中组件对网关的业务调用可以通过:Redis、Kafka 和 HTTP 接口,来源不可查,管控困难。此外,从性能优化的角度考虑也需要对原有服务进行解耦合,将 1.0 版本网关拆分为网关功能部分和业务处理部分,网关功能部分为 WS-Gateway:集成用户鉴权、TLS 证书验证和 WebSocket 连接管理等;业务处理部分为 WS-API:组件服务直接与该服务进行 gRPC 通信。可针对具体的模块进行针对性扩容;服务重构加上 Nginx 移除,整体硬件消耗显著降低;服务整合到石墨监控体系。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.1 整体架构"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"网关 2.0 版本架构设计图:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4d\/4d9f02fdb5d3f82329a2208e3e823400.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"网关 2.0 客户端连接流程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"客户端与 WS-Gateway 服务通过握手流程建立 WebSocket 连接;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"连接建立成功后,WS-Gateway 服务将会话进行节点存储,将连接信息映射关系缓存到 Redis 中,并通过 Kafka 向 WS-API 推送客户端上线消息;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"WS-API 通过 Kafka 接收客户端上线消息及客户端上行消息;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"WS-API 服务预处理及组装消息,包括从 Redis 获取消息推送的必要数据,并进行完成消息推送的过滤逻辑,然后 Pub 消息到 Kafka;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"WS-Gateway 通过 Sub Kafka 来获取服务端需要返回的消息,逐个推送消息至客户端。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.2 握手流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"网络状态良好的情况下,完成如下图所示步骤 1 到步骤 6 之后,直接进入 WebSocket 流程;网络环境较差的情况下,WebSocket 的通信模式会退化成 HTTP 方式,客户端通过 POST 方式推送消息到服务端,再通过 GET 长轮询的方式从读取服务端返回数据。客户端初次请求服务端连接建立的握手流程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b7\/b7f2855975aa21eb91030197925db441.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Client 发送 GET 请求尝试建立连接;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Server 返回相关连接数据,sid 为本次连接产生的唯一 Socket ID,后续交互作为凭证;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"{\"sid\":\"xxx\",\"upgrades\":[\"websocket\"],\"pingInterval\":xxx,\"pingTimeout\":xxx}"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Client 携带步骤 2 中的 sid 参数再次请求;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Server 返回 40,表示请求接收成功;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"Client 发送 POST 请求确认后期降级通路情况;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"Server 返回 ok,此时第一阶段握手流程完成;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"尝试发起 WebSocket 连接,首先进行 2probe 和 3probe 的请求响应,确认通信通道畅通后,即可进行正常的 WebSocket 通信。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.3 TLS 内存消耗优化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客户端与服务端连接建立采用的 wss 协议,在 1.0 版本中 TLS 证书挂载在 Nginx 上,HTTPS 握手过程由 Nginx 完成,为了降低 Nginx 的机器成本,在 2.0 版本中我们将证书挂载到服务上,通过分析服务内存,如下图所示,TLS 握手过程中消耗的内存占了总内存消耗的大概 30% 左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a4\/a425ea15769e24313301af1757daf747.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个部分的内存消耗无法避免,我们有两个选择:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"采用七层负载均衡,在七层负载上进行 TLS 证书挂载,将 TLS 握手过程移交给性能更好的工具完成;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"优化 Go 对 TLS 握手过程性能,在与业内大佬曹春晖(曹大)的交流中了解到,他最近在 Go 官方库提交的 PR "},{"type":"link","attrs":{"href":"https:\/\/github.com\/golang\/go\/issues\/43563","title":"","type":null},"content":[{"type":"text","text":"https:\/\/github.com\/golang\/go\/issues\/43563"}]},{"type":"text","text":" ,以及相关的性能测试数据 "},{"type":"link","attrs":{"href":"https:\/\/github.com\/golang\/go\/pull\/48229","title":"","type":null},"content":[{"type":"text","text":"https:\/\/github.com\/golang\/go\/pull\/48229"}]},{"type":"text","text":" 。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.4 Socket ID 设计"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对每次连接必须产生一个唯一码,如果出现重复会导致串号,消息混乱推送的问题。选择 SnowFlake 算法作为唯一码生成算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"物理机场景中,对副本所在物理机进行固定编号,即可保证每个副本上的服务产生的 Socket ID 是唯一值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"K8S 场景中,这种方案不可行,于是采用注册下发的方式返回编号,WS-Gateway 所有副本启动后向数据库写入服务的启动信息,获取副本编号,以此作为参数作为 SnowFlake 算法的副本编号进行 Socket ID 生产,服务重启会继承之前已有的副本编号,有新版本下发时会根据自增 ID 下发新的副本编号。于此同时,Ws-Gateway 副本会向数据库写入心跳信息,以此作为网关服务本身的健康检查依据。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.5 集群会话管理方案:事件广播"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客户端完成握手流程后,会话数据在当前网关节点内存存储,部分可序列化数据存储到 Redis,存储结构说明如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.