基于Kubernetes实现的大数据采集与存储实践总结

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"一、前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    近来我司部门内部搭建的电商大数据平台一期工程进入了尾声工作,不仅在技术上短期内从零到一搭建起属于团队的大数据平台,而且在业务上可以满足多方诉求。笔者很有幸参与到其中的建设,在给优秀的团队成员点赞的同时,也抽空整理了一下文档,那么今天就和大家来聊一下我们是如何结合Kubernetes实现数据采集与存储的,谈谈里面实现方案、原理和过程。这里笔者放一张我们前期设计时借鉴阿里的大数据架构图:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/89/891e47430533feac72d4cdd78aa0c476.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    本文重点讲述的是上图中『数据采集』部分,暂不涉及『数据计算』和『数据服务』的过程。在数据采集中,我们通过运行在Kubernetes中的清洗服务,不断地消费Kafka中由定时任务爬取的业务数据,并通过Fluentbit、Fluentd等日志采集工具对容器中打印到标准输出的数据压缩存储至AWS S3中。如果你对这块有兴趣的话,那就一起开始今天的内容吧。"}]},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"二、基础篇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2.1 Docker日志管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    我们的应用服务都运行在Docker容器中,Docker的日志有两种:dockerd运行时的引擎日志和容器内服务产生的容器日志。在这里我们不用关心引擎日志,容器日志是指到达标准输出(stdout)和标准错误输出(stderr)的日志,其他来源的日志不归Docker管理,而Docker将所有容器打到 stdout 和 stderr 的日志通过日志驱动统一重定向到某个地方。Docker支持的日志驱动有很多,比如 local、json-file、syslog、journald 等等,不同的日志驱动可以将日志重定向到不同的地方。Docker以热插拔的方式实现日志不同目的地的输出,体现了管理的灵活性。其中默认的日志驱动是json-file,该驱动将日志以json 的形式重定向存储到本地磁盘,其存储格式为:/var/lib/docker/containers//-json.log。笔者画了一张简易的流转"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/16/16884353fdc3139b5a4999ba80f810b3.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    官方支持的日志驱动很多,详情看可以自行查阅Docker Containers Logging(https://docs.docker.com/config/containers/logging/configure)。我们可以通过docker info | grep Loggin命令查看Docker的日志驱动配置,也可以通过--log-driver或者编写/etc/docker/daemon.json 文件配置Docker容器的驱动:"}]},{"type":"codeblock","attrs":{"lang":"javascript"},"content":[{"type":"text","text":"{\n \"log-driver\": \"syslog\"\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    本实践使用的是Docker默认驱动,即json file,这里大家对Docker的日志流转有基本的认识即可。需要关注的是每种Docker日志驱动都有相应的配置项日志轮转,比如根据单个文件大小和日志文件数量配置轮转。json-file 日志驱动支持的配置选项如下:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"max-size:切割之前日志的最大大小,可取值单位为(k,m,g), 默认为-1(表示无限制);"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"max-file:可以存在的最大日志文件数,如果切割日志会创建超过阈值的文件数,则会删除最旧的文件,仅在max-size设置时有效,默认为1;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"labels:适用于启动Docker守护程序时,此守护程序接受的以逗号分隔的与日志记录相关的标签列表;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"env:适用于启动Docker守护程序时,此守护程序接受的以逗号分隔的与日志记录相关的环境变量列表;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"compress:切割的日志是否进行压缩,默认是disabled;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"详见:https://docs.docker.com/config/containers/logging/json-file"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2.2 Kubernetes日志管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    在Kubernetes中日志种类也分为两种:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在容器中运行kube-scheduler和kube-proxy;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不运行在容器中运行时的kubelet和容器运行时(如Docker);"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    在使用 systemd 机制的服务器上,kubelet 和容器运行时将日志写入到 journald;如果没有 systemd,他们将日志写到 /var/log 目录的 .log 文件中。容器中的系统组件通常将日志写到 /var/log 目录,在 kubeadm 安装的集群中它们以静态 Pod 的形式运行在集群中,因此日志一般在 /var/log/pods 目录下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    需要强调的一点是,对于应用POD日志"},{"type":"text","marks":[{"type":"strong"}],"text":"Kuberntes并不管理日志的轮转策略,且日志的存储都是基于Docker的日志管理策略进行"},{"type":"text","text":"。在默认的日志驱动中,kubelet 会为每个容器的日志创建一个软链接,软链接存储路径为:/var/log/containers/,软链接会链接到 /var/log/pods/ 目录下相应 pod 目录的容器日志,被链接的日志文件也是软链接,最终链接到 Docker 容器引擎的日志存储目录:即/var/lib/docker/container 下相应容器的日志。这些软链接文件名称含有 k8s 相关信息,比如:Pod id,名字空间,容器 ID 等信息,这就为日志收集提供了很大的便利。简图如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6d94aa38d054f686bda834a9b0b9e544.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"三、进阶篇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    在Kubernetes中比较流行的数据收集方案是Elasticsearch、Fluentd和Kibana技术栈,也是官方现在比较推荐的一种方案。我们在这里只使用到EFK技术栈中的F,即Fluentd以及它的衍生品Fluent Bit。Fluentd和Fluent Bit都致力于收集、处理和交付日志数据。但是两个项目之间存在一些主要差异,使它们适合于不同的任务:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Fluentd:旨在汇总来自多个输入的日志,对数据进行处理然后路由至不同的输出。它的引擎具有性能卓越的队列处理线程,可快速使用和路由大批日志。同时具有丰富的输入和输出插件生态系统(超过650个);"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Fluent Bit:被设计为在高度受限的计算能力和减少的开销(内存和CPU)成为高度关注的分布式计算环境中运行,因此它非常的轻巧(KB级别)和高性能,适合做日志收集,处理和转发,但不适用于日志聚合;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    我们用官方的一张图来对比它们的差异:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c4/c4199e2ccdb47a2c08f7748c4b5ee97f.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Fluend和FluenBit都有着相似的数据流处理方式,包括Input、Parse、Filter、Buffer、Routing和Ouput等组件,官方图如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c1/c18892ff02bf7e96f7783ce8516c3eab.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Input: 提供了多种的输入插件用于收集不同来源的信息,如日志文件或者操作系统信息等;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Parser: 解析器用来解析原始的字符串成结构化的信息(如json格式);"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"Filter: 过滤器用来在分发事件之前修改或者过滤事件;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"Buffer: 提供数据的缓存机制,优先使用内存,其次是文件系统;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"Router: 将不同分类的数据发送到不同的输出中,一般使用Tag和Match来实现;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"Output: 同样提供了不同的输出插件,包括远程服务,本地文件或者标准输出等;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    在本次实践的过程中,我们使用Fluent Bit作为日志转发器负责数据的采集和转发,Fluentd作为日志聚合器负责数据的聚合与存储,在系统中相互协作,充分发挥它们的特长,更多地文档可以参考Fluentd(https://docs.fluentd.org)和FluentBit(https://docs.fluentbit.io/manual)的官网。"}]},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"四、架构篇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    由容器引擎或runtime提供的原生功能通常不足以满足完整的日志记录需求,当发生容器崩溃、pod 被逐出或节点宕机等情况,如果仍然想访问到应用日志,那么日志就应该具有独立的存储和生命周期。我们利用容器化后的应用写入 stdout 和 stderr 的任何数据,都会被容器引擎捕获并被重定向到某个位置的特性,使用日志转发器Fluentbit负责采集日志并将数据推送到日志聚合器Fluentd后,再由Fluentd负责聚合和存储数据至AWS S3中。由于日志转发器必须在每个节点上运行,因此它可以用DaemonSet副本,而日志聚合器则可以按需扩容缩容,因此我们使用Deployment来部署。笔者简单画的架构图如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b8/b8c53884b64f626939c56e6a4e446934.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"五、实践篇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    前面讲了一大推的基础理论和架构,终于到了实践的时候了,这里我们需要准备基本的实践环境,包括:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拥有DockerHub账号,用于存放Docker镜像;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拥有Kubernetes集群,用于编排容器和部署应用;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    接下来笔者准备了三个服务的代码示例,包括负责接收和清洗业务数据的服务、采集日志并转发的FluentBit还有聚合数据并压缩存储的Fluentd。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"5.1 清洗服务"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    我们使用zap作为日志库,不断进行打印数据以模拟清洗服务处理业务逻辑,编写的Sample代码如下:"}]},{"type":"codeblock","attrs":{"lang":"go"},"content":[{"type":"text","text":"\npackage main\n\nimport (\n \"time\"\n\n \"go.uber.org/zap\"\n)\n\nfunc main() {\n logger, _ := zap.NewProduction()\n defer logger.Sync() // flushes buffer, if any\n sugar := logger.Sugar()\n\n for {\n sugar.Infow(\"just a example\",\n \"author\", \"tony\",\n \"today\", time.Now().Format(\"2006-01-02 15:04:05\"),\n \"yesterday\", time.Now().AddDate(0, 0, -1).Format(\"2006-01-02 15:04:05\"),\n )\n time.Sleep(time.Duration(5) * time.Second)\n }\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    接着我们编写构建脚本,把它打包成镜像以供后续Kubernetes集群部署deployment使用,Dockerfile如下:"}]},{"type":"codeblock","attrs":{"lang":"shell"},"content":[{"type":"text","text":"# build stage\nFROM golang:latest as builder\nLABEL stage=gobuilder\nWORKDIR /build\nCOPY . .\nRUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags=\"-w -s\" -o example\n\n# final stage\nFROM scratch\nCOPY --from=builder /build/example /\nEXPOSE 8080\nENTRYPOINT [\"/example\"]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 执行以下命令完成构建并推上docker hub上:"}]},{"type":"codeblock","attrs":{"lang":"shell"},"content":[{"type":"text","text":"# build\ndocker build -t /logging:latest . && docker image prune -f .\n# push\ndocker push /logging:latest"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 最后我们的代码部署在Kubernetes集群中,模拟运行我们的清洗服务:"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: example\n namespace: logging\n labels:\n app: example\nspec:\n replicas: 1\n selector:\n matchLabels:\n app: example\n template:\n metadata:\n labels:\n app: example\n spec:\n containers:\n - name: example\n image: /logging:latest\n resources:\n limits:\n cpu: 100m\n memory: 200Mi\n requests:\n cpu: 10m\n memory: 20Mi\n ports:\n - containerPort: 24224\n terminationGracePeriodSeconds: 30\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"5.2 日志转发器FluentBit"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    Fluent Bit作为日志转发器需要负责数据的采集和转发,它需要的准备基本授权文件、项目配置以及daemon部署文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"授权文件"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\n# fluentbit_rbac.yaml\napiVersion: rbac.authorization.k8s.io/v1beta1\nkind: ClusterRole\nmetadata:\n name: fluentbit-read\nrules:\n- apiGroups: [\"\"]\n resources:\n - namespaces\n - pods\n verbs: [\"get\", \"list\", \"watch\"]\n---\napiVersion: rbac.authorization.k8s.io/v1beta1\nkind: ClusterRoleBinding\nmetadata:\n name: fluentbit-read\nroleRef:\n apiGroup: rbac.authorization.k8s.io\n kind: ClusterRole\n name: fluentbit-read\nsubjects:\n- kind: ServiceAccount\n name: fluentbit\n namespace: logging\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"项目配置"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"kind: ConfigMap\nmetadata:\n name: fluentbit-config\n namespace: logging\napiVersion: v1\ndata:\n fluent-bit.conf: |-\n [SERVICE]\n Flush 1\n Daemon Off\n Log_Level info\n Parsers_File parsers.conf\n HTTP_Server On\n HTTP_Listen 0.0.0.0\n HTTP_Port 2020\n [INPUT]\n Name tail\n Tag kube.*\n # Path /var/log/containers/*.log\n Path /var/log/containers/*logging_example*.log\n Parser docker\n DB /var/log/flb_kube.db\n Mem_Buf_Limit 5MB\n Skip_Long_Lines On\n Refresh_Interval 10\n Ignore_Older 24h\n [FILTER]\n Name kubernetes\n Match kube.*\n Kube_URL https://kubernetes.default.svc:443\n Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token\n Kube_Tag_Prefix kube.var.log.containers.\n Merge_Log On\n Merge_Log_Key log_processed\n K8S-Logging.Parser On\n K8S-Logging.Exclude Off\n [OUTPUT]\n Name forward\n Match *\n Host ${FLUENTD_HOST}\n Port ${FLUENTD_PORT}\n Time_as_Integer True\n parsers.conf: |-\n [PARSER]\n Name apache\n Format regex\n Regex ^(?[^ ]*) [^ ]* (?[^ ]*) \\[(?
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章