服務監控之分佈式追蹤一篇讀懂

原文鏈接：https://www.linuxprobe.com/tracking-service-monitoring.html

導讀	現在越來越多的應用遷移到基於微服務的雲原生的架構之上，微服務架構很強大，但是同時也帶來了很多的挑戰，尤其是如何對應用進行調試，如何監控多個服務間的調用關係和狀態。如何有效的對微服務架構進行有效的監控成爲微服務架構運維成功的關鍵。

現在越來越多的應用遷移到基於微服務的雲原生的架構之上，微服務架構很強大，但是同時也帶來了很多的挑戰，尤其是如何對應用進行調試，如何監控多個服務間的調用關係和狀態。如何有效的對微服務架構進行有效的監控成爲微服務架構運維成功的關鍵。用軟件架構的語言來說就是要增強微服務架構的可觀測性(Observability)。

微服務的監控主要包含一下三個方面：

通過收集日誌，對系統和各個服務的運行狀態進行監控
通過收集量度(Metrics)，對系統和各個服務的性能進行監控
通過分佈式追蹤，追蹤服務請求是如何在各個分佈的組件中進行處理的細節

對於是日誌和量度的收集和監控，大家會比較熟悉。常見的日誌收集架構包含利用Fluentd對系統日誌進行收集，然後利用ELK或者Splunk進行日誌分析。而對於性能監控，Prometheus是常見的流行的選擇。

分佈式追蹤正在被越來越多的應用所採用。分佈式追蹤可以通過對微服務調用鏈的跟蹤，構建一個從服務請求開始到各個微服務交互的全部調用過程的視圖。用戶可以從中瞭解到諸如應用調用的時延，網絡調用(HTTP，RPC)的生命週期，系統的性能瓶頸等等信息。那麼分佈式追蹤是如何實現的呢?

1.分佈式追蹤的概念

谷歌在2010年4月發表了一篇論文《Dapper, a Large-Scale Distributed Systems Tracing Infrastructure》(http://1t.click/6EB)，介紹了分佈式追蹤的概念。

對於分佈式追蹤，主要有以下的幾個概念：

追蹤 Trace：就是由分佈的微服務協作所支撐的一個事務。一個追蹤，包含爲該事務提供服務的各個服務請求。
跨度 Span：Span是事務中的一個工作流，一個Span包含了時間戳，日誌和標籤信息。Span之間包含父子關係，或者主從(Followup)關係。
跨度上下文 Span Context：跨度上下文是支撐分佈式追蹤的關鍵，它可以在調用的服務之間傳遞，上下文的內容包括諸如：從一個服務傳遞到另一個服務的時間，追蹤的ID，Span的ID還有其它需要從上游服務傳遞到下游服務的信息。

2.OpenTracing 標準概念

基於谷歌提出的概念OpenTracing(http://1t.click/6tC)定義了一個開放的分佈式追蹤的標準。

Span是分佈式追蹤的基本組成單元，表示一個分佈式系統中的單獨的工作單元。每一個Span可以包含其它Span的引用。多個Span在一起構成了Trace。

OpenTracing的規範定義每一個Span都包含了以下內容：

操作名(Operation Name)，標誌該操作是什麼
標籤 (Tag)，標籤是一個名值對，用戶可以加入任何對追蹤有意義的信息
日誌(Logs)，日誌也定義爲名值對。用於捕獲調試信息，或者相關Span的相關信息
跨度上下文呢 (SpanContext)，SpanContext負責子微服務系統邊界傳遞數據。它主要包含兩部分：

和實現無關的狀態信息，例如Trace ID，Span ID
行李項 (Baggage Item)。如果把微服務調用比做從一個城市到另一個城市的飛行, 那麼SpanContext就可以看成是飛機運載的內容。Trace ID和Span ID就像是航班號，而行李項就像是運送的行李。每次服務調用，用戶都可以決定發送不同的行李。

這裏是一個Span的例子：

t=0 operation name: db_query t=x 
 +-----------------------------------------------------+ 
 | · · · · · · · · · · Span · · · · · · · · · · | 
 +-----------------------------------------------------+ 
Tags: 
- db.instance:"jdbc:mysql://127.0.0.1:3306/customers 
- db.statement: "SELECT * FROM mytable WHERE foo='bar';"  
Logs: 
- message:"Can't connect to mysql server on '127.0.0.1'(10061)"  
SpanContext: 
- trace_id:"abc123" 
- span_id:"xyz789" 
- Baggage Items: 
- special_id:"vsid1738"

要實現分佈式追蹤，如何傳遞SpanContext是關鍵。OpenTracing定義了兩個方法Inject和Extract用於SpanContext的注入和提取。

Inject 僞代碼

span_context = ... 
outbound_request = ...  
# We'll use the (builtin) HTTP_HEADERS carrier format. We 
# start by using an empty map as the carrier prior to the 
# call to `tracer.inject`. 
carrier = {} 
tracer.inject(span_context, opentracing.Format.HTTP_HEADERS, carrier)  
# `carrier` now contains (opaque) key:value pairs which we pass 
# along over whatever wire protocol we already use. 
for key, value in carrier: 
outbound_request.headers[key] = escape(value)

這裏的注入的過程就是把context的所有信息寫入到一個叫Carrier的字典中，然後把字典中的所有名值對寫入 HTTP Header。

Extract 僞代碼

inbound_request = ...  
# We'll again use the (builtin) HTTP_HEADERS carrier format. Per the 
# HTTP_HEADERS documentation, we can use a map that has extraneous data 
# in it and let the OpenTracing implementation look for the subset 
# of key:value pairs it needs. 
# 
# As such, we directly use the key:value `inbound_request.headers` 
# map as the carrier. 
carrier = inbound_request.headers 
span_context = tracer.extract(opentracing.Format.HTTP_HEADERS, carrier) 
# Continue the trace given span_context. E.g., 
span = tracer.start_span("...", child_of=span_context)  
# (If `carrier` held trace data, `span` will now be ready to use.)

抽取過程是注入的逆過程，從carrier，也就是HTTP Headers，構建SpanContext。

整個過程類似客戶端和服務器傳遞數據的序列化和反序列化的過程。這裏的Carrier字典支持Key爲string類型，value爲string或者Binary格式(Bytes)。

3.怎麼用能?

好了講了一大堆的概念，作爲程序猿的你早已經不耐煩了，不要講那些有的沒的，快上代碼。不急我們這就看看具體如何使用Tracing。

我們用一個程序猿喜聞樂見的打印‘hello world’的Python應用來說明OpenTracing是如何工作的。

客戶端代碼

import requests 
import sys 
import time 
from lib.tracing import init_tracer 
from opentracing.ext import tags 
from opentracing.propagation import Format  
def say_hello(hello_to): 
with tracer.start_active_span('say-hello') as scope: 
scope.span.set_tag('hello-to', hello_to) 
hello_str = format_string(hello_to) 
print_hello(hello_str) 
def format_string(hello_to): 
with tracer.start_active_span('format') as scope: 
hello_str = http_get(8081, 'format', 'helloTo', hello_to) 
scope.span.log_kv({'event': 'string-format', 'value': hello_str}) 
return hello_str  
def print_hello(hello_str): 
with tracer.start_active_span('println') as scope: 
http_get(8082, 'publish', 'helloStr', hello_str) 
scope.span.log_kv({'event': 'println'})  
def http_get(port, path, param, value): 
url = 'http://localhost:%s/%s' % (port, path)  
span = tracer.active_span 
span.set_tag(tags.HTTP_METHOD, 'GET') 
span.set_tag(tags.HTTP_URL, url) 
span.set_tag(tags.SPAN_KIND, tags.SPAN_KIND_RPC_CLIENT) 
headers = {} 
tracer.inject(span, Format.HTTP_HEADERS, headers)  
r = requests.get(url, params={param: value}, headers=headers) 
assert r.status_code == 200 
return r.text 
# main 
assert len(sys.argv) == 2  
tracer = init_tracer('hello-world')  
hello_to = sys.argv[1] 
say_hello(hello_to) 
# yield to IOLoop to flush the spans 
time.sleep(2) 
tracer.close()

客戶端完成了以下的工作：

初始化Tracer，trace的名字是‘hello-world’
創建以個客戶端操作say_hello，該操作關聯一個Span，取名‘say-hello’，並調用span.set_tag加入標籤
在操作say_hello中調用第一個HTTP 服務A，format_string，該操作關聯另一個Span取名‘format’，並調用span.log_kv加入日誌
之後調用另一個HTTP 服務B，print_hello，該操作關聯另一個Span取名‘println’，並調用span.log_kv加入日誌
對於每一個HTTP請求，在Span中都加入標籤，標誌http method，http url和span kind。並調用tracer.inject把SpanContext注入到http header 中。

服務A代碼

from flask import Flask 
from flask import request 
from lib.tracing import init_tracer 
from opentracing.ext import tags 
from opentracing.propagation import Format  
app = Flask(__name__) 
tracer = init_tracer('formatter')    
@app.route("/format") 
def format(): 
span_ctx = tracer.extract(Format.HTTP_HEADERS, request.headers) 
span_tags = {tags.SPAN_KIND: tags.SPAN_KIND_RPC_SERVER} 
with tracer.start_active_span('format', child_of=span_ctx, tags=span_tags): 
hello_to = request.args.get('helloTo') 
return 'Hello, %s!' % hello_to 
if __name__ == "__main__": 
app.run(port=8081)

服務A響應format請求，調用tracer.extract從http headers中提取信息，構建spanContext。

服務B代碼

from flask import Flask 
from flask import request 
from lib.tracing import init_tracer 
from opentracing.ext import tags 
from opentracing.propagation import Format  
app = Flask(__name__) 
tracer = init_tracer('publisher') 
@app.route("/publish") 
def publish(): 
span_ctx = tracer.extract(Format.HTTP_HEADERS, request.headers) 
span_tags = {tags.SPAN_KIND: tags.SPAN_KIND_RPC_SERVER} 
with tracer.start_active_span('publish', child_of=span_ctx, tags=span_tags): 
hello_str = request.args.get('helloStr') 
print(hello_str) 
return 'published'  
if __name__ == "__main__": 
app.run(port=8082)

服務B和A類似。

之後在支持分佈式追蹤的軟件UI上(下圖是Jaeger UI)，就可以看到類似下圖的追蹤信息。我們可以看到服務hello-word和三個操作say-hello/format/println的詳細追蹤信息。

當前有很多分佈式追蹤軟件都提供了OpenTracing的支持，包括：Jaeger，LightStep，Instanna，Apache SkyWalking，inspectIT，stagemonitor，Datadog，Wavefront，Elastic APM等等。其中作爲開源軟件的Zipkin(http://1t.click/6Ec)和Jaeger(http://1t.click/6DY)最爲流行。

Zipkin

Zipkin(http://1t.click/6Ec)是Twitter基於Dapper開發的分佈式追蹤系統。它的設計架構如下圖：

藍色實體是Zipkin要追蹤的目標組件，Non-Intrumented Server表示不直接調用Tracing API的微服務。通過Intrumented Client從Non-Intrumented Server中收集信息併發送給Zipkin的收集器Collector。Intrumented Server 直接調用Tracing API，發送數據到Zipkin的收集器。
Transport是傳輸通道，可以通過HTTP直接發送到Zipkin或者通過消息/事件隊列的方式。
Zipkin本身是一個Java應用，包含了：收集器Collector負責數據採集，對外提供數據接口;存儲;API和UI。

Zipkin的用戶界面像這個樣子：

Zipkin官方支持以下幾種語言的客戶端：C#，Go，Java，JavaScript，Ruby，Scala，PHP。開源社區也有其它語言的支持。

Zipkin發展到現在有快4年的時間，是一個相對成熟的項目。

Jaeger

Jaeger(http://1t.click/6DY)最早是由Uber開發的分佈式追蹤系統，同樣基於Dapper的設計理念。現在Jaeger是CNCF(Cloud Native Computing Foundation)的一個項目。如果你對CNCF這個組織有所瞭解，那麼你可以推測出這個項目應該和Kubernetes有非常緊密的集成。

Jaeger基於分佈式的架構設計，主要包含以下幾個組件：

Jaeger Client，負責在客戶端收集跟蹤信息。
Jaeger Agent，負責和客戶端通信，把收集到的追蹤信息上報個收集器 Jaeger Collector
Jaeger Colletor把收集到的數據存入數據庫或者其它存儲器
Jaeger Query 負責對追蹤數據進行查詢
Jaeger UI負責用戶交互

這個架構很像ELK，Collector之前類似Logstash負責採集數據，Query類似Elastic負責搜索，而UI類似Kibana負責用戶界面和交互。這樣的分佈式架構使得Jaeger的擴展性更好，可以根據需要，構建不同的部署。

Jaeger作爲分佈式追蹤的後起之秀，隨着雲原生和K8s的廣泛採用，正變得越來越流行。利用官方給出的K8s部署模版(http://1t.click/6DU)，用戶可以快速的在自己的k8s集羣上部署Jaeger。

4.分佈式跟蹤系統——產品對比

當然除了支持OpenTracing標準的產品之外，還有其它的一些分佈式追蹤產品。這裏引用一些其它博主的分析，給大家一些參考：

調用鏈選型之Zipkin，Pinpoint，SkyWalking，CAT(http://1t.click/6tY)
分佈式調用鏈調研(pinpoint,skywalking,jaeger,zipkin等對比)(http://1t.click/6DK)
分佈式跟蹤系統——產品對比(http://1t.click/6ug)

5.總結

在微服務大行其道，雲原生成爲架構設計的主流的情況下，微服務系統監控，包含日誌，指標和追蹤成爲了系統工程的重中之重。OpenTracing基於Dapper的分佈式追蹤設計理念，定義了分佈式追蹤的實現標準。在開源項目中，Zipkin和Jaeger是相對優秀的選擇。尤其是Jaeger，由於對雲原生框架的良好集成，是構建微服務追蹤系統的必備良器。

原文來自：https://www.linuxprobe.com/tracking-service-monitoring.html

服務監控之分佈式追蹤一篇讀懂

避開日常Kubernetes最常見的10個坑

Linux中使用pigz工具更快的壓縮和解壓文件

sed命令簡單使用示例分享

如何使用Yum安裝MFS分佈式文件系統

Ubuntu18.04 安裝 Docker CE 的方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結