作业帮的云原生历程与实践

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文源自作业帮基础架构负责人董晓聪的分享。讲述作业帮的云原生历程,并围绕云原生架构和多云架构两大解决方案进行深入延展。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"云原生改造重塑技术体系"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"“之前在传统的互联网公司,大家没法接触到用户,对用户的感知更多的是一个个UV、PV数字,但在线教育不一样,我们通过直播等形式面对的是一个个学生,每一次稳定性的事故都可能会影响他们的学业,所以作业帮对稳定性的要求只能更高。”据董晓聪介绍,作业帮在稳定性层面,主要面对以下三大挑战:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"当出现单机、单机群、单云故障的时候,架构能否很好的应对这些冲击?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"当代码变更导致业务中断的时候,能不能快速止损?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"除了稳定性外,如何控制成本以及提升效率?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"作业帮选择通过云原生来解决上述问题。用基础设施接管业务当中大量非功能的逻辑,以此来实现弹性、可观测性、韧性、自动化、可持续等一些相关特性,通过云原生的架构解决了部署层面的问题,然后在此之上实现了一套多云间自由迁移的能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"“即使从今天来看作业帮当时做的这个决定,选择云原生架构,也是很有魄力的,因为它毕竟是一个技术体系重塑。”董晓聪表示,截至目前,作业帮已经完成了70%左右业务的云原生改造,处于业内领先水平。同时作业帮在弹性扩缩、Serverless、在离线混部等方面都有广泛的应用,在CPU调度、GPU调度、多云管控等方面也有创新型专利产出,解决了开源社区的诸多问题。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在CPU调度方面,2020年上半年,作业帮在完成了一块核心业务的容器化之后,突然发现运维成本增加了。原来在虚机模式下,运维在晚高峰的时候,只需要去做一些稳定性的巡检,运维动作并不多。但容器化后,在晚高峰下需要不断地对一些资源负载比较高的进行封锁,然后把上面的一些比较重的Pod进行驱逐,经分析"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/#:~:text=Kubernetes%20%28K8s%29%20is%20an%20open-source%20system%20for%20automating,into%20logical%20units%20for%20easy%20management%20and%20discovery.","title":"xxx","type":null},"content":[{"type":"text","text":"Kubernetes"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的原生调度器还是以request进行调度,会存在一些问题。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"互联网业务都会有一个明显的"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/cc65U1HxA8Pzql8E6BQm","title":"xxx","type":null},"content":[{"type":"text","text":"波峰波谷"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",在线教育的波峰波谷会更加剧烈,可能会有两个数量级的差异。当研发在波谷的时候进行一次发布,这时候就会触发容器的一次重新调度,比如当服务有几十个Pod,可能会有十多个Pod调度到一台机器,因为这时候的机器的使用率很低,服务怎么调度其实都可以。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是到了晚高峰的时候,每一个Pod资源的使用率就上来了,CPU使用高了,它的吞吐也高了,这十个Pod都在同一个机器上,这台机器就会出现一些资源的瓶颈。原生的调度器只考虑了一些简单的指标,同时也没有考虑未来的变化。基于此,作业帮做了自定义的调度器,对晚高峰进行了预测,将CPU、内存、各种IO等指标都作为因子,同时也会定期的把历史数据进行大数据回归更新。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"GPU是一个相对比较贵的资源,通过调研一些方案并和云厂商进行沟通,了解到目前主要推荐的方案是GPU虚拟化,但是这会至少带来15%的性能损耗,这个是没法接受的。大多数的GPU服务使用的各种资源相对比较固定。鉴于此,作业帮基于算力和显存去进行了一些策略的调度,也就是比较经典的揹包问题,同时夜间也会进行一下预测再重新调度,如果中间出现一些故障,也会执行转移相关的策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"当Web业务完成容器化改造之后,团队把一些定时任务迁移到容器平台。这时候又出现了新的问题,很多任务会涉及到密集的计算,容器本身其实并不是一个隔离的机制,还是在做CPU时间片的分配。这些计算密集的任务多多少少还是会对Web任务造成一定的影响。同时它也会占用主机的IP资源,node上的IP资源是有限的,定时任务调度上来之后就会分配IP,任务销毁时IP资源也不会立刻销毁。如果频繁地把定时任务的Pod调度到主机群的节点上,就会导致主机群的Web服务没有足够的IP资源。此外,大规模的创建跟回收定时任务,也会触发一些内核的问题,比如有些定时任务的内存使用比较大,大规模回收会导致陷入内核态,hang住的时间比较长。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"这方面作业帮做了一些改造:建立了三个池子,Serverless、任务集群、主机群,优先会把定时任务去调度到Serverless上,如果调入失败的话,再依次到任务集群、主集群,Serverless并不是一种完全可靠的计算模式,而是引入了一种资源预占的方式,比较类似于金融领域为保证事务的两阶段提交,预先去申请相关的资源,当完成预占之后,再把真正的把任务调度过去。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多云架构实现秒级别自动切换"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"作业帮解决多云架构主要面临两大挑战。首先在云间互通的专线选型上,作业帮没有选择裸纤的方案,而选择了供应商的组网方案。董晓聪表示,选择组网方案,一方面因为有一层供应商的保护能力,另一方面是组网有一定弹性扩缩的能力。而在此之外,公司自身也做了双链路。每条链路选择不同的供应商,从不同地域进行接入。在这两条链路上,通过"},{"type":"link","attrs":{"href":"https:\/\/networklessons.com\/cisco\/ccie-routing-switching-written\/bgp-multipath-load-sharing-ibgp-and-ebgp#:~:text=Unlike%20most%20routing%20protocols%2C%20BGP%20only%20selects%20a,second%20path%2C%20the%20following%20attributes%20have%20to%20match%3A","title":"xxx","type":null},"content":[{"type":"text","text":"BGP"}]},{"type":"text","text":"+ECMP"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"实现了链路的负载均衡,以及当单条线路出现故障的时候,可以实现秒级别的自动切换。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"“多云还会面临着一个很大的挑战,就是计算资源的管理。”董晓聪说,单个云下就有十几种、几十种机型,多云会直接导致double、trible的工作量。作业帮对一些场景进行了建模,标准的负载型机器、专门的大内存、大存储机型,然后再结合网络的安全域,制定具体的业务套餐。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"“完成了上面的网络、计算的问题之后,作业帮构建出自己的多云架构”。董晓聪说,用户通过DNS\/DoH分流,落到不同的机房。常态下的业务应用之间的请求是单云闭环,不会去跨云通信。当从机房或者专线出现故障的时候,可以通过DNS\/DoH把流量切到主机房上。当主机房出现故障的时候,还是同样的流量调度,除此之外,还要将从机房的数据存储,DB、Redis等进行提主,以此来实现了多云的稳定。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"“完成云原生、多云改造之后,稳定性从之前的99.95%提升到了99.99%,机器故障时间的影响也从分钟级别缩短到秒级。部署的质量也得到大幅度提升。”董晓聪透露,接下来,作业帮的发力重点会在实时音视频的云原生改造,推进无边界云计算,促成云边端应用一体协调。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章