Chaos Mesh 實戰分享丨通過混沌工程驗證 GreatDB 分佈式部署模式的穩定性

Chaos Mesh 最初作爲開源分佈式數據庫 TiDB 的測試平臺而創建,是一個多功能混沌工程平臺,通過混沌測試驗證分佈式系統的穩定性。本文以萬里安全數據庫軟件 GreatDB 分佈式部署模式爲例,介紹了通過 Chaos Mesh 進行混沌測試的全流程。

需求背景與 GreatDB 介紹

需求背景

混沌測試是檢測分佈式系統不確定性、建立系統彈性信心的一種非常好的方式,因此我們採用開源工具 Chaos Mesh 來做 GreatDB 分佈式集羣的混沌測試。

GreatDB 分佈式部署模式介紹

萬里安全數據庫軟件 GreatDB 是一款關係型數據庫軟件,同時支持集中式和分佈式的部署方式,本文涉及的是分佈式部署方式。

分佈式部署模式採用 shared-nothing 架構;通過數據冗餘與副本管理確保數據庫無單點故障;數據 sharding 與分佈式並行計算實現數據庫系統高性能;可無限制動態擴展數據節點,滿足業務需要。

整體架構如下圖所示:

1.jpg

環境準備

Chaos Mesh 安裝

在安裝 Chaos Mesh 之前請確保已經預先安裝了 helm,docker,並準備好了一個 kubernetes 環境。

  • 使用 Helm 安裝

1)在 Helm 倉庫中添加 Chaos Mesh 倉庫:

helm repo add chaos-mesh https://charts.chaos-mesh.org

2)查看可以安裝的 Chaos Mesh 版本:

helm search repo chaos-mesh

3)創建安裝 Chaos Mesh 的命名空間:

kubectl create ns chaos-testing

4)在 docker 環境下安裝 Chaos Mesh:

helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing
  • 驗證安裝

執行以下命令查看 Chaos Mesh 的運行情況:

kubectl get pod -n chaos-testing

下面是預期輸出:

NAME                                       READY   STATUS    RESTARTS   AGE

chaos-controller-manager-d7bc9ccb5-dbccq   1/1     Running   0          26d

chaos-daemon-pzxc7                         1/1     Running   0          26d

chaos-dashboard-5887f7559b-kgz46           1/1     Running   1          26d

如果 3 個 pod 的狀態都是 Running,表示 Chaos Mesh 已經成功安裝。

準備測試需要的鏡像

準備 MySQL 鏡像

一般情況下,MySQL 使用官方 5.7 版本的鏡像,MySQL 監控採集器使用的是 mysqld-exporter,也可以直接從 docker hub 下載:

docker pull mysql:5.7

docker pull prom/mysqld-exporter

準備 ZooKeeper 鏡像

ZooKeeper 使用的是官方 3.5.5 版本鏡像,ZooKeeper 組件涉及的監控有 jmx-prometheus-exporter 和 zookeeper-exporter,均從 docker hub 下載:

docker pull zookeeper:3.5.5

docker pull sscaling/jmx-prometheus-exporter

docker pull josdotso/zookeeper-exporter

準備 GreatDB 鏡像

選擇一個 GreatDB 的 tar 包,將其解壓得到一個 ./greatdb 目錄,再將 greatdb-service-docker.sh 文件拷貝到這個解壓出來的./greatdb 目錄裏:

cp greatdb-service-docker.sh ./greatdb/

將 greatdb Dockerfile 放到./greatdb 文件夾的同級目錄下,然後執行以下命令構建 GreatDB 鏡像:

docker build -t greatdb/greatdb:tag2021 .

準備 GreatDB 分佈式集羣部署/清理的鏡像

下載集羣部署腳本 cluster-setup,集羣初始化腳本 init-zk 以及集羣 helm charts 包(可諮詢 4.0 開發/測試組獲取)

將上述材料放在同一目錄下,編寫如下 Dockerfile:

FROM debian:buster-slim as init-zk



COPY ./init-zk /root/init-zk

RUN chmod +x /root/init-zk



FROM debian:buster-slim as cluster-setup

\*# Set aliyun repo for speed*

RUN sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list && \

  sed -i 's/security.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list



RUN apt-get -y update && \

  apt-get -y install \

  curl \

  wget



RUN curl -L https://storage.googleapis.com/kubernetes-release/release/v1.20.1/bin/linux/amd64/kubectl -o /usr/local/bin/kubectl && \

  chmod +x /usr/local/bin/kubectl && \

  mkdir /root/.kube && \

  wget https://get.helm.sh/helm-v3.5.3-linux-amd64.tar.gz && \

  tar -zxvf helm-v3.5.3-linux-amd64.tar.gz && \

  mv linux-amd64/helm /usr/local/bin/helm



COPY ./config /root/.kube/

COPY ./helm /helm

COPY ./cluster-setup /

執行以下命令構建所需鏡像:

docker build --target init-zk -t greatdb/initzk:latest .


docker build --target cluster-setup -t greatdb/cluster-setup:v1 .

準備測試用例的鏡像

目前測試支持的用例有:bank、bank2、pbank、tpcc、flashback 等,每個用例都是一個可執行文件。

以 flashback 測例爲例構建測試用例的鏡像,先將用例下載到本地,在用例的同一目錄下編寫如下內容的 Dockerfile:

FROM debian:buster-slim

COPY ./flashback /

RUN cd / && chmod +x ./flashback

執行以下命令構建測試用例鏡像:

docker build -t greatdb/testsuite-flashback:v1 .

將準備好的鏡像上傳到私有倉庫中

創建私有倉庫和上傳鏡像操作請參考:https://zhuanlan.zhihu.com/p/78543733

Chaos Mesh 的使用

搭建 GreatDB 分佈式集羣

在上一章中 cluster-setup 目錄下執行以下命令塊去搭建測試集羣:

./cluster-setup  \

-clustername=c0 \

-namespace=test \

-enable-monitor=true \

-mysql-image=mysql:5.7 \

-mysql-replica=3 \

-mysql-auth=1 \

-mysql-normal=1 \

-mysql-global=1 \

-mysql-partition=1 \

-zookeeper-repository=zookeeper \

-zookeeper-tag=3.5.5 \

-zookeeper-replica=3 \

-greatdb-repository=greatdb/greatdb \

-greatdb-tag=tag202110 \

-greatdb-replica=3 \

-greatdb-serviceHost=172.16.70.249

輸出信息:

liuxinle@liuxinle-OptiPlex-5060:~/k8s/cluster-setup$ ./cluster-setup \

\> -clustername=c0 \

\> -namespace=test \

\> -enable-monitor=true \

\> -mysql-image=mysql:5.7 \

\> -mysql-replica=3 \

\> -mysql-auth=1 \

\> -mysql-normal=1 \

\> -mysql-global=1 \

\> -mysql-partition=1 \

\> -zookeeper-repository=zookeeper \

\> -zookeeper-tag=3.5.5 \

\> -zookeeper-replica=3 \

\> -greatdb-repository=greatdb/greatdb \

\> -greatdb-tag=tag202110 \

\> -greatdb-replica=3 \

\> -greatdb-serviceHost=172.16.70.249

INFO[2021-10-14T10:41:52+08:00] SetUp the cluster ...                         NameSpace=test

INFO[2021-10-14T10:41:52+08:00] create namespace ...                         

INFO[2021-10-14T10:41:57+08:00] copy helm chart templates ...                

INFO[2021-10-14T10:41:57+08:00] setup ...                                     Component=MySQL

INFO[2021-10-14T10:41:57+08:00] exec helm install and update greatdb-cfg.yaml ... 

INFO[2021-10-14T10:42:00+08:00] waiting mysql pods running ...               

INFO[2021-10-14T10:44:27+08:00] setup ...                                     Component=Zookeeper

INFO[2021-10-14T10:44:28+08:00] waiting zookeeper pods running ...           

INFO[2021-10-14T10:46:59+08:00] update greatdb-cfg.yaml                      

INFO[2021-10-14T10:46:59+08:00] setup ...                                     Component=greatdb

INFO[2021-10-14T10:47:00+08:00] waiting greatdb pods running ...             

INFO[2021-10-14T10:47:21+08:00] waiting cluster running ...                  

INFO[2021-10-14T10:47:27+08:00] waiting prometheus server running...         

INFO[2021-10-14T10:47:27+08:00] Dump Cluster Info                            

INFO[2021-10-14T10:47:27+08:00] SetUp success.                                ClusterName=c0 NameSpace=test

執行如下命令,查看集羣 pod 狀態:

kubectl get pod -n test -o wide

輸出信息:

NAME                                    READY   STATUS      RESTARTS   AGE     IP             NODE                     NOMINATED NODE   READINESS GATES

c0-auth0-mysql-0                        2/2     Running     0          10m     10.244.87.18   liuxinle-optiplex-5060   <none>           <none>

c0-auth0-mysql-1                        2/2     Running     0          9m23s   10.244.87.54   liuxinle-optiplex-5060   <none>           <none>

c0-auth0-mysql-2                        2/2     Running     0          8m39s   10.244.87.57   liuxinle-optiplex-5060   <none>           <none>

c0-greatdb-0                            2/2     Running     1          5m3s    10.244.87.58   liuxinle-optiplex-5060   <none>           <none>

c0-greatdb-1                            2/2     Running     0          4m57s   10.244.87.20   liuxinle-optiplex-5060   <none>           <none>

c0-greatdb-2                            2/2     Running     0          4m50s   10.244.87.47   liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-0                        2/2     Running     0          10m     10.244.87.51   liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-1                        2/2     Running     0          9m23s   10.244.87.41   liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-2                        2/2     Running     0          8m38s   10.244.87.60   liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-0                         2/2     Running     0          10m     10.244.87.29   liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-1                         2/2     Running     0          9m29s   10.244.87.4    liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-2                         2/2     Running     0          8m45s   10.244.87.25   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-0                         2/2     Running     0          10m     10.244.87.55   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-1                         2/2     Running     0          9m26s   10.244.87.13   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-2                         2/2     Running     0          8m42s   10.244.87.21   liuxinle-optiplex-5060   <none>           <none>

c0-prometheus-server-6697649b76-fkvh9   2/2     Running     0          4m36s   10.244.87.37   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-0                          1/1     Running     1          7m35s   10.244.87.44   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-1                          1/1     Running     0          6m41s   10.244.87.30   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-2                          1/1     Running     0          6m10s   10.244.87.49   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-initzk-7hbfs               0/1     Completed   0          7m35s   10.244.87.17   liuxinle-optiplex-5060   <none>           <none>

看到 c0-zookeeper-initzk-7hbfs 的狀態是 Completed,其他 pod 的狀態爲 Running,表示集羣搭建成功。

在 GreatDB 分佈式集羣中使用 Chaos Mesh 做混沌測試

Chaos Mesh 在 kubernetes 環境支持注入的故障類型包括:模擬 Pod 故障、模擬網絡故障、模擬壓力場景等,這裏我們以模擬 Pod 故障中的 pod-kill 爲例。

將實驗配置寫入到文件中 pod-kill.yaml,內容示例如下:

apiVersion: chaos-mesh.org/v1alpha1

kind: PodChaos   *# 要注入的故障類型*

metadata:

  name: pod-failure-example

  namespace: test   *# 測試集羣pod所在的namespace*

spec:

  action: pod-kill   *# 要注入的具體故障類型*

  mode: all    *# 指定實驗的運行方式,all(表示選出所有符合條件的 Pod)*

  duration: '30s'    *# 指定實驗的持續時間* 

  selector: 

    labelSelectors:

      "app.kubernetes.io/component": "greatdb"    *# 指定注入故障目標pod的標籤,通過kubectl describe pod c0-greatdb-1 -n test 命令返回結果中Labels後的內容得到*

創建故障實驗,命令如下:

kubectl create -n test -f pod-kill.yaml

創建完故障實驗之後,執行命令 kubectl get pod -n test -o wide 結果如下:

NAME                                    READY   STATUS              RESTARTS   AGE     IP             NODE                     NOMINATED NODE   READINESS GATES

c0-auth0-mysql-0                        2/2     Running             0          14m     10.244.87.18   liuxinle-optiplex-5060   <none>           <none>

c0-auth0-mysql-1                        2/2     Running             0          14m     10.244.87.54   liuxinle-optiplex-5060   <none>           <none>

c0-auth0-mysql-2                        2/2     Running             0          13m     10.244.87.57   liuxinle-optiplex-5060   <none>           <none>

c0-greatdb-0                            0/2     ContainerCreating   0          2s      <none>         liuxinle-optiplex-5060   <none>           <none>

c0-greatdb-1                            0/2     ContainerCreating   0          2s      <none>         liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-0                        2/2     Running             0          14m     10.244.87.51   liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-1                        2/2     Running             0          14m     10.244.87.41   liuxinle-optiplex-5060   <none>           <none>

c0-glob0-mysql-2                        2/2     Running             0          13m     10.244.87.60   liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-0                         2/2     Running             0          14m     10.244.87.29   liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-1                         2/2     Running             0          14m     10.244.87.4    liuxinle-optiplex-5060   <none>           <none>

c0-nor0-mysql-2                         2/2     Running             0          13m     10.244.87.25   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-0                         2/2     Running             0          14m     10.244.87.55   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-1                         2/2     Running             0          14m     10.244.87.13   liuxinle-optiplex-5060   <none>           <none>

c0-par0-mysql-2                         2/2     Running             0          13m     10.244.87.21   liuxinle-optiplex-5060   <none>           <none>

c0-prometheus-server-6697649b76-fkvh9   2/2     Running             0          9m24s   10.244.87.37   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-0                          1/1     Running             1          12m     10.244.87.44   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-1                          1/1     Running             0          11m     10.244.87.30   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-2                          1/1     Running             0          10m     10.244.87.49   liuxinle-optiplex-5060   <none>           <none>

c0-zookeeper-initzk-7hbfs               0/1     Completed           0          12m     10.244.87.17   liuxinle-optiplex-5060   <none>           <none>

可以看到有帶 greatdb 名字的 pod 正在被重啓,說明注入故障成功。

在 Argo 中編排測試流程

Argo 是一個開源的容器本地工作流引擎,用於在 Kubernetes 上完成工作,可以將多步驟工作流建模爲一系列任務,完成測試流程編排。

我們使用 argo 定義一個測試任務,基本的測試流程是固定的,如下所示:

2.jpg

測試流程的 step1 是部署測試集羣,接着開啓兩個並行任務,step2 跑測試用例,模擬業務場景,step3 同時使用 Chaos Mesh 注入故障,step2 的測試用例執行結束之後,step4 終止故障注入,最後 step5 清理集羣環境。

用 Argo 編排一個混沌測試工作流(以 flashback 測試用例爲例)

1)修改 cluster-setup.yaml 中的 image 信息,改成步驟“準備測試需要的鏡像”中自己傳上去的集羣部署/清理鏡像名和 tag

2)修改 testsuite-flashback.yaml 中的 image 信息,改成步驟“準備測試需要的鏡像”中自己傳上去的測試用例鏡像名和 tag

3)將集羣部署、測試用例和工具模板的 yaml 文件全部使用 kubectl apply -n argo -f xxx.yaml 命令創建資源 (這些文件定義了一些 argo template,方便用戶寫 workflow 時候使用)

kubectl apply -n argo -f cluster-setup.yaml

kubectl apply -n argo -f testsuite-flashback.yaml

kubectl apply -n argo -f tools-template.yaml

4)複製一份 workflow 模板文件 workflow-template.yaml,將模板文件中註釋提示的部分修改爲自己的設置即可,然後執行以下命令創建混沌測試工作流:

kubectl apply -n argo -f workflow-template.yaml

以下是一份 workflow 模板文件:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: chaostest-c0-0-
  name: chaostest-c0-0
  namespace: argo
spec:
  entrypoint: test-entry #測試入口,在這裏傳入測試參數,填寫clustername、namespace、host、greatdb鏡像名和tag名等基本信息
  serviceAccountName: argo
  arguments:
    parameters:
      - name: clustername
        value: c0
      - name: namespace
        value: test
      - name: host
        value: 172.16.70.249
      - name: port
        value: 30901
      - name: password
        value: Bgview@2020
      - name: user
        value: root
      - name: run-time
        value: 10m
      - name: greatdb-repository
        value: greatdb/greatdb
      - name: greatdb-tag
        value: tag202110
      - name: nemesis
        value: kill_mysql_normal_master,kill_mysql_normal_slave,kill_mysql_partition_master,kill_mysql_partition_slave,kill_mysql_auth_master,kill_mysql_auth_slave,kill_mysql_global_master,kill_mysql_global_slave,kill_mysql_master,kill_mysql_slave,net_partition_mysql_normal,net_partition_mysql_partition,net_partition_mysql_auth,net_partition_mysql_global
      - name: mysql-partition
        value: 1
      - name: mysql-global
        value: 1
      - name: mysql-auth
        value: 1
      - name: mysql-normal
        value: 2
  templates:
    - name: test-entry
      steps:
        - - name: setup-greatdb-cluster  # step.1 集羣部署. 請指定正確的參數,主要是mysql和zookeeper的鏡像名、tag名
            templateRef:
              name: cluster-setup-template
              template: cluster-setup
            arguments:
              parameters:
                - name: namespace
                  value: "{{workflow.parameters.namespace}}"
                - name: clustername
                  value: "{{workflow.parameters.clustername}}"
                - name: mysql-image
                  value: mysql:5.7.34
                - name: mysql-replica
                  value: 3
                - name: mysql-auth
                  value: "{{workflow.parameters.mysql-auth}}"
                - name: mysql-normal
                  value: "{{workflow.parameters.mysql-normal}}"
                - name: mysql-partition
                  value: "{{workflow.parameters.mysql-partition}}"
                - name: mysql-global
                  value: "{{workflow.parameters.mysql-global}}"
                - name: enable-monitor
                  value: false
                - name: zookeeper-repository
                  value: zookeeper
                - name: zookeeper-tag
                  value: 3.5.5
                - name: zookeeper-replica
                  value: 3
                - name: greatdb-repository
                  value: "{{workflow.parameters.greatdb-repository}}"
                - name: greatdb-tag
                  value: "{{workflow.parameters.greatdb-tag}}"
                - name: greatdb-replica
                  value: 3
                - name: greatdb-serviceHost
                  value: "{{workflow.parameters.host}}"
                - name: greatdb-servicePort
                  value: "{{workflow.parameters.port}}"
        - - name: run-flashbacktest    # step.2 運行測試用例,請替換爲你要運行的測試用例template並指定正確的參數,主要是測試使用的表個數和大小
            templateRef:
              name: flashback-test-template
              template: flashback
            arguments:
              parameters:
                - name: user
                  value: "{{workflow.parameters.user}}"
                - name: password
                  value: "{{workflow.parameters.password}}"
                - name: host
                  value: "{{workflow.parameters.host}}"
                - name: port
                  value: "{{workflow.parameters.port}}"
                - name: concurrency
                  value: 16
                - name: size
                  value: 10000
                - name: tables
                  value: 10
                - name: run-time
                  value: "{{workflow.parameters.run-time}}"
                - name: single-statement
                  value: true
                - name: manage-statement
                  value: true
          - name: invoke-chaos-for-flashabck-test    # step.3 注入故障,請指定正確的參數,這裏run-time和interval分別定義了故障注入的時間和頻次,因此省略掉了終止故障注入步驟
            templateRef:
              name: chaos-rto-template
              template: chaos-rto
            arguments:
              parameters:
                - name: user
                  value: "{{workflow.parameters.user}}"
                - name: host
                  value: "{{workflow.parameters.host}}"
                - name: password
                  value: "{{workflow.parameters.password}}"
                - name: port
                  value: "{{workflow.parameters.port}}"
                - name: k8s-config
                  value: /root/.kube/config
                - name: namespace
                  value: "{{workflow.parameters.namespace}}"
                - name: clustername
                  value: "{{workflow.parameters.clustername}}"
                - name: prometheus
                  value: ''
                - name: greatdb-job
                  value: greatdb-monitor-greatdb
                - name: nemesis
                  value: "{{workflow.parameters.nemesis}}"
                - name: nemesis-duration
                  value: 1m
                - name: nemesis-mode
                  value: default
                - name: wait-time
                  value: 5m
                - name: check-time
                  value: 5m
                - name: nemesis-scope
                  value: 1
                - name: nemesis-log
                  value: true
                - name: enable-monitor
                  value: false
                - name: run-time
                  value: "{{workflow.parameters.run-time}}"
                - name: interval
                  value: 1m
                - name: monitor-log
                  value: false
                - name: enable-rto
                  value: false
                - name: rto-qps
                  value: 0.1
                - name: rto-warm
                  value: 5m
                - name: rto-time
                  value: 1m
                - name: log-level
                  value: debug
        - - name: flashbacktest-output         # 輸出測試用例是否通過的結果
            templateRef:
              name: tools-template
              template: output-result
            arguments:
              parameters:
                - name: info
                  value: "flashback test pass, with nemesis: {{workflow.parameters.nemesis}}"
        - - name: clean-greatdb-cluster           # step.4 清理測試集羣,這裏的參數和step.1的參數一致
            templateRef:
              name: cluster-setup-template
              template: cluster-setup
            arguments:
              parameters:
                - name: namespace
                  value: "{{workflow.parameters.namespace}}"
                - name: clustername
                  value: "{{workflow.parameters.clustername}}"
                - name: mysql-image
                  value: mysql:5.7
                - name: mysql-replica
                  value: 3
                - name: mysql-auth
                  value: "{{workflow.parameters.mysql-auth}}"
                - name: mysql-normal
                  value: "{{workflow.parameters.mysql-normal}}"
                - name: mysql-partition
                  value: "{{workflow.parameters.mysql-partition}}"
                - name: mysql-global
                  value: "{{workflow.parameters.mysql-global}}"
                - name: enable-monitor
                  value: false
                - name: zookeeper-repository
                  value: zookeeper
                - name: zookeeper-tag
                  value: 3.5.5
                - name: zookeeper-replica
                  value: 3
                - name: greatdb-repository
                  value: "{{workflow.parameters.greatdb-repository}}"
                - name: greatdb-tag
                  value: "{{workflow.parameters.greatdb-tag}}"
                - name: greatdb-replica
                  value: 3
                - name: greatdb-serviceHost
                  value: "{{workflow.parameters.host}}"
                - name: greatdb-servicePort
                  value: "{{workflow.parameters.port}}"
                - name: clean
                  value: true
        - - name: echo-result
            templateRef:
              name: tools-template
              template: echo
            arguments:
              parameters:
                - name: info
                  value: "{{item}}"
            withItems:
              - "{{steps.flashbacktest-output.outputs.parameters.result}}"

至此,你已經成功使用 Chaos Mesh 進行了一次混沌測試,併成功驗證了分佈式系統的穩定性。

Now enjoy GreatSQL, and enjoy Chaos Mesh :)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章