Chaos Mesh 最初作爲開源分佈式數據庫 TiDB 的測試平臺而創建,是一個多功能混沌工程平臺,通過混沌測試驗證分佈式系統的穩定性。本文以萬里安全數據庫軟件 GreatDB 分佈式部署模式爲例,介紹了通過 Chaos Mesh 進行混沌測試的全流程。
需求背景與 GreatDB 介紹
需求背景
混沌測試是檢測分佈式系統不確定性、建立系統彈性信心的一種非常好的方式,因此我們採用開源工具 Chaos Mesh 來做 GreatDB 分佈式集羣的混沌測試。
GreatDB 分佈式部署模式介紹
萬里安全數據庫軟件 GreatDB 是一款關係型數據庫軟件,同時支持集中式和分佈式的部署方式,本文涉及的是分佈式部署方式。
分佈式部署模式採用 shared-nothing 架構;通過數據冗餘與副本管理確保數據庫無單點故障;數據 sharding 與分佈式並行計算實現數據庫系統高性能;可無限制動態擴展數據節點,滿足業務需要。
整體架構如下圖所示:
環境準備
Chaos Mesh 安裝
在安裝 Chaos Mesh 之前請確保已經預先安裝了 helm,docker,並準備好了一個 kubernetes 環境。
- 使用 Helm 安裝
1)在 Helm 倉庫中添加 Chaos Mesh 倉庫:
helm repo add chaos-mesh https://charts.chaos-mesh.org
2)查看可以安裝的 Chaos Mesh 版本:
helm search repo chaos-mesh
3)創建安裝 Chaos Mesh 的命名空間:
kubectl create ns chaos-testing
4)在 docker 環境下安裝 Chaos Mesh:
helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing
- 驗證安裝
執行以下命令查看 Chaos Mesh 的運行情況:
kubectl get pod -n chaos-testing
下面是預期輸出:
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-d7bc9ccb5-dbccq 1/1 Running 0 26d
chaos-daemon-pzxc7 1/1 Running 0 26d
chaos-dashboard-5887f7559b-kgz46 1/1 Running 1 26d
如果 3 個 pod 的狀態都是 Running,表示 Chaos Mesh 已經成功安裝。
準備測試需要的鏡像
準備 MySQL 鏡像
一般情況下,MySQL 使用官方 5.7 版本的鏡像,MySQL 監控採集器使用的是 mysqld-exporter,也可以直接從 docker hub 下載:
docker pull mysql:5.7
docker pull prom/mysqld-exporter
準備 ZooKeeper 鏡像
ZooKeeper 使用的是官方 3.5.5 版本鏡像,ZooKeeper 組件涉及的監控有 jmx-prometheus-exporter 和 zookeeper-exporter,均從 docker hub 下載:
docker pull zookeeper:3.5.5
docker pull sscaling/jmx-prometheus-exporter
docker pull josdotso/zookeeper-exporter
準備 GreatDB 鏡像
選擇一個 GreatDB 的 tar 包,將其解壓得到一個 ./greatdb 目錄,再將 greatdb-service-docker.sh 文件拷貝到這個解壓出來的./greatdb 目錄裏:
cp greatdb-service-docker.sh ./greatdb/
將 greatdb Dockerfile 放到./greatdb 文件夾的同級目錄下,然後執行以下命令構建 GreatDB 鏡像:
docker build -t greatdb/greatdb:tag2021 .
準備 GreatDB 分佈式集羣部署/清理的鏡像
下載集羣部署腳本 cluster-setup,集羣初始化腳本 init-zk 以及集羣 helm charts 包(可諮詢 4.0 開發/測試組獲取)
將上述材料放在同一目錄下,編寫如下 Dockerfile:
FROM debian:buster-slim as init-zk
COPY ./init-zk /root/init-zk
RUN chmod +x /root/init-zk
FROM debian:buster-slim as cluster-setup
\*# Set aliyun repo for speed*
RUN sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list && \
sed -i 's/security.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list
RUN apt-get -y update && \
apt-get -y install \
curl \
wget
RUN curl -L https://storage.googleapis.com/kubernetes-release/release/v1.20.1/bin/linux/amd64/kubectl -o /usr/local/bin/kubectl && \
chmod +x /usr/local/bin/kubectl && \
mkdir /root/.kube && \
wget https://get.helm.sh/helm-v3.5.3-linux-amd64.tar.gz && \
tar -zxvf helm-v3.5.3-linux-amd64.tar.gz && \
mv linux-amd64/helm /usr/local/bin/helm
COPY ./config /root/.kube/
COPY ./helm /helm
COPY ./cluster-setup /
執行以下命令構建所需鏡像:
docker build --target init-zk -t greatdb/initzk:latest .
docker build --target cluster-setup -t greatdb/cluster-setup:v1 .
準備測試用例的鏡像
目前測試支持的用例有:bank、bank2、pbank、tpcc、flashback 等,每個用例都是一個可執行文件。
以 flashback 測例爲例構建測試用例的鏡像,先將用例下載到本地,在用例的同一目錄下編寫如下內容的 Dockerfile:
FROM debian:buster-slim
COPY ./flashback /
RUN cd / && chmod +x ./flashback
執行以下命令構建測試用例鏡像:
docker build -t greatdb/testsuite-flashback:v1 .
將準備好的鏡像上傳到私有倉庫中
創建私有倉庫和上傳鏡像操作請參考:https://zhuanlan.zhihu.com/p/78543733
Chaos Mesh 的使用
搭建 GreatDB 分佈式集羣
在上一章中 cluster-setup 目錄下執行以下命令塊去搭建測試集羣:
./cluster-setup \
-clustername=c0 \
-namespace=test \
-enable-monitor=true \
-mysql-image=mysql:5.7 \
-mysql-replica=3 \
-mysql-auth=1 \
-mysql-normal=1 \
-mysql-global=1 \
-mysql-partition=1 \
-zookeeper-repository=zookeeper \
-zookeeper-tag=3.5.5 \
-zookeeper-replica=3 \
-greatdb-repository=greatdb/greatdb \
-greatdb-tag=tag202110 \
-greatdb-replica=3 \
-greatdb-serviceHost=172.16.70.249
輸出信息:
liuxinle@liuxinle-OptiPlex-5060:~/k8s/cluster-setup$ ./cluster-setup \
\> -clustername=c0 \
\> -namespace=test \
\> -enable-monitor=true \
\> -mysql-image=mysql:5.7 \
\> -mysql-replica=3 \
\> -mysql-auth=1 \
\> -mysql-normal=1 \
\> -mysql-global=1 \
\> -mysql-partition=1 \
\> -zookeeper-repository=zookeeper \
\> -zookeeper-tag=3.5.5 \
\> -zookeeper-replica=3 \
\> -greatdb-repository=greatdb/greatdb \
\> -greatdb-tag=tag202110 \
\> -greatdb-replica=3 \
\> -greatdb-serviceHost=172.16.70.249
INFO[2021-10-14T10:41:52+08:00] SetUp the cluster ... NameSpace=test
INFO[2021-10-14T10:41:52+08:00] create namespace ...
INFO[2021-10-14T10:41:57+08:00] copy helm chart templates ...
INFO[2021-10-14T10:41:57+08:00] setup ... Component=MySQL
INFO[2021-10-14T10:41:57+08:00] exec helm install and update greatdb-cfg.yaml ...
INFO[2021-10-14T10:42:00+08:00] waiting mysql pods running ...
INFO[2021-10-14T10:44:27+08:00] setup ... Component=Zookeeper
INFO[2021-10-14T10:44:28+08:00] waiting zookeeper pods running ...
INFO[2021-10-14T10:46:59+08:00] update greatdb-cfg.yaml
INFO[2021-10-14T10:46:59+08:00] setup ... Component=greatdb
INFO[2021-10-14T10:47:00+08:00] waiting greatdb pods running ...
INFO[2021-10-14T10:47:21+08:00] waiting cluster running ...
INFO[2021-10-14T10:47:27+08:00] waiting prometheus server running...
INFO[2021-10-14T10:47:27+08:00] Dump Cluster Info
INFO[2021-10-14T10:47:27+08:00] SetUp success. ClusterName=c0 NameSpace=test
執行如下命令,查看集羣 pod 狀態:
kubectl get pod -n test -o wide
輸出信息:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
c0-auth0-mysql-0 2/2 Running 0 10m 10.244.87.18 liuxinle-optiplex-5060 <none> <none>
c0-auth0-mysql-1 2/2 Running 0 9m23s 10.244.87.54 liuxinle-optiplex-5060 <none> <none>
c0-auth0-mysql-2 2/2 Running 0 8m39s 10.244.87.57 liuxinle-optiplex-5060 <none> <none>
c0-greatdb-0 2/2 Running 1 5m3s 10.244.87.58 liuxinle-optiplex-5060 <none> <none>
c0-greatdb-1 2/2 Running 0 4m57s 10.244.87.20 liuxinle-optiplex-5060 <none> <none>
c0-greatdb-2 2/2 Running 0 4m50s 10.244.87.47 liuxinle-optiplex-5060 <none> <none>
c0-glob0-mysql-0 2/2 Running 0 10m 10.244.87.51 liuxinle-optiplex-5060 <none> <none>
c0-glob0-mysql-1 2/2 Running 0 9m23s 10.244.87.41 liuxinle-optiplex-5060 <none> <none>
c0-glob0-mysql-2 2/2 Running 0 8m38s 10.244.87.60 liuxinle-optiplex-5060 <none> <none>
c0-nor0-mysql-0 2/2 Running 0 10m 10.244.87.29 liuxinle-optiplex-5060 <none> <none>
c0-nor0-mysql-1 2/2 Running 0 9m29s 10.244.87.4 liuxinle-optiplex-5060 <none> <none>
c0-nor0-mysql-2 2/2 Running 0 8m45s 10.244.87.25 liuxinle-optiplex-5060 <none> <none>
c0-par0-mysql-0 2/2 Running 0 10m 10.244.87.55 liuxinle-optiplex-5060 <none> <none>
c0-par0-mysql-1 2/2 Running 0 9m26s 10.244.87.13 liuxinle-optiplex-5060 <none> <none>
c0-par0-mysql-2 2/2 Running 0 8m42s 10.244.87.21 liuxinle-optiplex-5060 <none> <none>
c0-prometheus-server-6697649b76-fkvh9 2/2 Running 0 4m36s 10.244.87.37 liuxinle-optiplex-5060 <none> <none>
c0-zookeeper-0 1/1 Running 1 7m35s 10.244.87.44 liuxinle-optiplex-5060 <none> <none>
c0-zookeeper-1 1/1 Running 0 6m41s 10.244.87.30 liuxinle-optiplex-5060 <none> <none>
c0-zookeeper-2 1/1 Running 0 6m10s 10.244.87.49 liuxinle-optiplex-5060 <none> <none>
c0-zookeeper-initzk-7hbfs 0/1 Completed 0 7m35s 10.244.87.17 liuxinle-optiplex-5060 <none> <none>
看到 c0-zookeeper-initzk-7hbfs 的狀態是 Completed,其他 pod 的狀態爲 Running,表示集羣搭建成功。
在 GreatDB 分佈式集羣中使用 Chaos Mesh 做混沌測試
Chaos Mesh 在 kubernetes 環境支持注入的故障類型包括:模擬 Pod 故障、模擬網絡故障、模擬壓力場景等,這裏我們以模擬 Pod 故障中的 pod-kill 爲例。
將實驗配置寫入到文件中 pod-kill.yaml,內容示例如下:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos *# 要注入的故障類型*
metadata:
name: pod-failure-example
namespace: test *# 測試集羣pod所在的namespace*
spec:
action: pod-kill *# 要注入的具體故障類型*
mode: all *# 指定實驗的運行方式,all(表示選出所有符合條件的 Pod)*
duration: '30s' *# 指定實驗的持續時間*
selector:
labelSelectors:
"app.kubernetes.io/component": "greatdb" *# 指定注入故障目標pod的標籤,通過kubectl describe pod c0-greatdb-1 -n test 命令返回結果中Labels後的內容得到*
創建故障實驗,命令如下:
kubectl create -n test -f pod-kill.yaml
創建完故障實驗之後,執行命令 kubectl get pod -n test -o wide 結果如下:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
c0-auth0-mysql-0 2/2 Running 0 14m 10.244.87.18 liuxinle-optiplex-5060 <none> <none>
c0-auth0-mysql-1 2/2 Running 0 14m 10.244.87.54 liuxinle-optiplex-5060 <none> <none>
c0-auth0-mysql-2 2/2 Running 0 13m 10.244.87.57 liuxinle-optiplex-5060 <none> <none>
c0-greatdb-0 0/2 ContainerCreating 0 2s <none> liuxinle-optiplex-5060 <none> <none>
c0-greatdb-1 0/2 ContainerCreating 0 2s <none> liuxinle-optiplex-5060 <none> <none>
c0-glob0-mysql-0 2/2 Running 0 14m 10.244.87.51 liuxinle-optiplex-5060 <none> <none>
c0-glob0-mysql-1 2/2 Running 0 14m 10.244.87.41 liuxinle-optiplex-5060 <none> <none>
c0-glob0-mysql-2 2/2 Running 0 13m 10.244.87.60 liuxinle-optiplex-5060 <none> <none>
c0-nor0-mysql-0 2/2 Running 0 14m 10.244.87.29 liuxinle-optiplex-5060 <none> <none>
c0-nor0-mysql-1 2/2 Running 0 14m 10.244.87.4 liuxinle-optiplex-5060 <none> <none>
c0-nor0-mysql-2 2/2 Running 0 13m 10.244.87.25 liuxinle-optiplex-5060 <none> <none>
c0-par0-mysql-0 2/2 Running 0 14m 10.244.87.55 liuxinle-optiplex-5060 <none> <none>
c0-par0-mysql-1 2/2 Running 0 14m 10.244.87.13 liuxinle-optiplex-5060 <none> <none>
c0-par0-mysql-2 2/2 Running 0 13m 10.244.87.21 liuxinle-optiplex-5060 <none> <none>
c0-prometheus-server-6697649b76-fkvh9 2/2 Running 0 9m24s 10.244.87.37 liuxinle-optiplex-5060 <none> <none>
c0-zookeeper-0 1/1 Running 1 12m 10.244.87.44 liuxinle-optiplex-5060 <none> <none>
c0-zookeeper-1 1/1 Running 0 11m 10.244.87.30 liuxinle-optiplex-5060 <none> <none>
c0-zookeeper-2 1/1 Running 0 10m 10.244.87.49 liuxinle-optiplex-5060 <none> <none>
c0-zookeeper-initzk-7hbfs 0/1 Completed 0 12m 10.244.87.17 liuxinle-optiplex-5060 <none> <none>
可以看到有帶 greatdb 名字的 pod 正在被重啓,說明注入故障成功。
在 Argo 中編排測試流程
Argo 是一個開源的容器本地工作流引擎,用於在 Kubernetes 上完成工作,可以將多步驟工作流建模爲一系列任務,完成測試流程編排。
我們使用 argo 定義一個測試任務,基本的測試流程是固定的,如下所示:
測試流程的 step1 是部署測試集羣,接着開啓兩個並行任務,step2 跑測試用例,模擬業務場景,step3 同時使用 Chaos Mesh 注入故障,step2 的測試用例執行結束之後,step4 終止故障注入,最後 step5 清理集羣環境。
用 Argo 編排一個混沌測試工作流(以 flashback 測試用例爲例)
1)修改 cluster-setup.yaml 中的 image 信息,改成步驟“準備測試需要的鏡像”中自己傳上去的集羣部署/清理鏡像名和 tag
2)修改 testsuite-flashback.yaml 中的 image 信息,改成步驟“準備測試需要的鏡像”中自己傳上去的測試用例鏡像名和 tag
3)將集羣部署、測試用例和工具模板的 yaml 文件全部使用 kubectl apply -n argo -f xxx.yaml 命令創建資源 (這些文件定義了一些 argo template,方便用戶寫 workflow 時候使用)
kubectl apply -n argo -f cluster-setup.yaml
kubectl apply -n argo -f testsuite-flashback.yaml
kubectl apply -n argo -f tools-template.yaml
4)複製一份 workflow 模板文件 workflow-template.yaml,將模板文件中註釋提示的部分修改爲自己的設置即可,然後執行以下命令創建混沌測試工作流:
kubectl apply -n argo -f workflow-template.yaml
以下是一份 workflow 模板文件:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: chaostest-c0-0-
name: chaostest-c0-0
namespace: argo
spec:
entrypoint: test-entry #測試入口,在這裏傳入測試參數,填寫clustername、namespace、host、greatdb鏡像名和tag名等基本信息
serviceAccountName: argo
arguments:
parameters:
- name: clustername
value: c0
- name: namespace
value: test
- name: host
value: 172.16.70.249
- name: port
value: 30901
- name: password
value: Bgview@2020
- name: user
value: root
- name: run-time
value: 10m
- name: greatdb-repository
value: greatdb/greatdb
- name: greatdb-tag
value: tag202110
- name: nemesis
value: kill_mysql_normal_master,kill_mysql_normal_slave,kill_mysql_partition_master,kill_mysql_partition_slave,kill_mysql_auth_master,kill_mysql_auth_slave,kill_mysql_global_master,kill_mysql_global_slave,kill_mysql_master,kill_mysql_slave,net_partition_mysql_normal,net_partition_mysql_partition,net_partition_mysql_auth,net_partition_mysql_global
- name: mysql-partition
value: 1
- name: mysql-global
value: 1
- name: mysql-auth
value: 1
- name: mysql-normal
value: 2
templates:
- name: test-entry
steps:
- - name: setup-greatdb-cluster # step.1 集羣部署. 請指定正確的參數,主要是mysql和zookeeper的鏡像名、tag名
templateRef:
name: cluster-setup-template
template: cluster-setup
arguments:
parameters:
- name: namespace
value: "{{workflow.parameters.namespace}}"
- name: clustername
value: "{{workflow.parameters.clustername}}"
- name: mysql-image
value: mysql:5.7.34
- name: mysql-replica
value: 3
- name: mysql-auth
value: "{{workflow.parameters.mysql-auth}}"
- name: mysql-normal
value: "{{workflow.parameters.mysql-normal}}"
- name: mysql-partition
value: "{{workflow.parameters.mysql-partition}}"
- name: mysql-global
value: "{{workflow.parameters.mysql-global}}"
- name: enable-monitor
value: false
- name: zookeeper-repository
value: zookeeper
- name: zookeeper-tag
value: 3.5.5
- name: zookeeper-replica
value: 3
- name: greatdb-repository
value: "{{workflow.parameters.greatdb-repository}}"
- name: greatdb-tag
value: "{{workflow.parameters.greatdb-tag}}"
- name: greatdb-replica
value: 3
- name: greatdb-serviceHost
value: "{{workflow.parameters.host}}"
- name: greatdb-servicePort
value: "{{workflow.parameters.port}}"
- - name: run-flashbacktest # step.2 運行測試用例,請替換爲你要運行的測試用例template並指定正確的參數,主要是測試使用的表個數和大小
templateRef:
name: flashback-test-template
template: flashback
arguments:
parameters:
- name: user
value: "{{workflow.parameters.user}}"
- name: password
value: "{{workflow.parameters.password}}"
- name: host
value: "{{workflow.parameters.host}}"
- name: port
value: "{{workflow.parameters.port}}"
- name: concurrency
value: 16
- name: size
value: 10000
- name: tables
value: 10
- name: run-time
value: "{{workflow.parameters.run-time}}"
- name: single-statement
value: true
- name: manage-statement
value: true
- name: invoke-chaos-for-flashabck-test # step.3 注入故障,請指定正確的參數,這裏run-time和interval分別定義了故障注入的時間和頻次,因此省略掉了終止故障注入步驟
templateRef:
name: chaos-rto-template
template: chaos-rto
arguments:
parameters:
- name: user
value: "{{workflow.parameters.user}}"
- name: host
value: "{{workflow.parameters.host}}"
- name: password
value: "{{workflow.parameters.password}}"
- name: port
value: "{{workflow.parameters.port}}"
- name: k8s-config
value: /root/.kube/config
- name: namespace
value: "{{workflow.parameters.namespace}}"
- name: clustername
value: "{{workflow.parameters.clustername}}"
- name: prometheus
value: ''
- name: greatdb-job
value: greatdb-monitor-greatdb
- name: nemesis
value: "{{workflow.parameters.nemesis}}"
- name: nemesis-duration
value: 1m
- name: nemesis-mode
value: default
- name: wait-time
value: 5m
- name: check-time
value: 5m
- name: nemesis-scope
value: 1
- name: nemesis-log
value: true
- name: enable-monitor
value: false
- name: run-time
value: "{{workflow.parameters.run-time}}"
- name: interval
value: 1m
- name: monitor-log
value: false
- name: enable-rto
value: false
- name: rto-qps
value: 0.1
- name: rto-warm
value: 5m
- name: rto-time
value: 1m
- name: log-level
value: debug
- - name: flashbacktest-output # 輸出測試用例是否通過的結果
templateRef:
name: tools-template
template: output-result
arguments:
parameters:
- name: info
value: "flashback test pass, with nemesis: {{workflow.parameters.nemesis}}"
- - name: clean-greatdb-cluster # step.4 清理測試集羣,這裏的參數和step.1的參數一致
templateRef:
name: cluster-setup-template
template: cluster-setup
arguments:
parameters:
- name: namespace
value: "{{workflow.parameters.namespace}}"
- name: clustername
value: "{{workflow.parameters.clustername}}"
- name: mysql-image
value: mysql:5.7
- name: mysql-replica
value: 3
- name: mysql-auth
value: "{{workflow.parameters.mysql-auth}}"
- name: mysql-normal
value: "{{workflow.parameters.mysql-normal}}"
- name: mysql-partition
value: "{{workflow.parameters.mysql-partition}}"
- name: mysql-global
value: "{{workflow.parameters.mysql-global}}"
- name: enable-monitor
value: false
- name: zookeeper-repository
value: zookeeper
- name: zookeeper-tag
value: 3.5.5
- name: zookeeper-replica
value: 3
- name: greatdb-repository
value: "{{workflow.parameters.greatdb-repository}}"
- name: greatdb-tag
value: "{{workflow.parameters.greatdb-tag}}"
- name: greatdb-replica
value: 3
- name: greatdb-serviceHost
value: "{{workflow.parameters.host}}"
- name: greatdb-servicePort
value: "{{workflow.parameters.port}}"
- name: clean
value: true
- - name: echo-result
templateRef:
name: tools-template
template: echo
arguments:
parameters:
- name: info
value: "{{item}}"
withItems:
- "{{steps.flashbacktest-output.outputs.parameters.result}}"
至此,你已經成功使用 Chaos Mesh 進行了一次混沌測試,併成功驗證了分佈式系統的穩定性。
Now enjoy GreatSQL, and enjoy Chaos Mesh :)