記一次在 mosn 對 dubbo、dubbo-go-hessian2 的性能優化

背景

螞蟻內部對 Service Mesh 的穩定性和性能要求是比較高的，內部 mosn 廣泛用於生產環境。在雲上和開源社區，RPC 領域 dubbo 和 spring cloud 同樣廣泛用於生產環境，我們在 mosn 基礎上，支持了 dubbo 和 spring cloud 流量代理。我們發現在支持 dubbo 協議過程中，經過 Mesh 流量代理後，性能有非常大的性能損耗，在大商戶落地 Mesh 中也對性能有較高要求，因此本文會重點描述在基於 Go 語言庫 dubbo-go-hessian2 、dubbo 協議中對 mosn 所做的性能優化。

性能優化概述

根據實際業務部署場景，並沒有選用高性能機器，使用普通linux機器，配置和壓測參數如下：

Intel® Xeon® Platinum 8163 CPU @ 2.50GHz 4 核 16G 。
pod 配置 2c、1g，JVM 參數 -server -Xms1024m -Xmx1024m。
網絡延遲 0.23 ms, 2 臺 linux 機器，分別部署 server + mosn, 壓測程序 rpc-perfomance。

經過 3 輪性能優化後，使用優化版本 mosn 將會獲得以下性能收益（框架隨機 512 和 1k 字節壓測）：

512 字節數據：mosn + dubbo 服務調用 TPS 整體提升 55-82.8%，RT 降低 45% 左右，內存佔用 40M，
1k 數據：mosn + dubbo 服務調用 TPS 整體提升 51.1-69.3%，RT 降低 41%左右, 內存佔用 41M。

性能優化工具 pprof

磨刀不誤砍柴工，在性能優化前首先要找到性能卡點，找到性能卡點後，另一個難點就是如何用高效代碼優化替代 slow code。因爲螞蟻 Service Mesh 是基於 go 語言實現的，我們首選 go 自帶的 pprof 性能工具，我們簡要介紹這個工具如何使用。如果我們 go 庫自帶 http.Server 時並且在 main 頭部導入import _ "net/http/pprof"，go會幫我們掛載對應的handler , 詳細可以參考 godoc 。

因爲 mosn 默認會在34902端口暴露http服務，通過以下命令輕鬆獲取 mosn 的性能診斷文件：

go tool pprof -seconds 60 http://benchmark-server-ip:34902/debug/pprof/profile
# 會生成類似以下文件，該命令採樣cpu 60秒
# pprof.mosn.samples.cpu.001.pb.gz

然後繼續用 pprof 打開診斷文件，方便在瀏覽器查看，在圖 1-1 給出壓測後 profiler 火焰圖：

# http=:8000代表pprof打開8000端口然後用於web瀏覽器分析
# mosnd代表mosn的二進制可執行文件，用於分析代碼符號
# pprof.mosn.samples.cpu.001.pb.gz是cpu診斷文件
go tool pprof -http=:8000 mosnd pprof.mosn.samples.cpu.001.pb.gz

)

圖 1-1 mosn 性能壓測火焰圖

在獲得診斷數據後，可以切到瀏覽器 Flame Graph（火焰圖，go 1.11以上版本自帶），火焰圖的 x 軸座標代表 CPU 消耗情況， y軸代碼方法調用堆棧。在優化開始之前，我們藉助 go 工具 pprof 可以診斷出大致的性能卡點在以下幾個方面（直接壓 server 端 mosn）：

mosn 在接收 dubbo 請求，CPU 卡點在 streamConnection.Dispatch
mosn 在轉發 dubbo 請求，CPU 卡點在 downStream.Receive

可以點擊火焰圖任意橫條，進去查看長方塊耗時和堆棧明細（請參考圖 1-2 和 1-3 所示）：

圖 1-2 Dispatch 火焰圖明細

圖 1-3 Receive 火焰圖明細

性能優化思路

本文重點記錄優化了哪些 case 才能提升 50%以上的吞吐量和降低 RT，因此後面直接分析當前優化了哪些 case。在此之前，我們以 Dispatch 爲例，看下它爲甚麼那麼喫性能。在 terminal 中通過以下命令可以查看代碼行耗費 CPU 數據（代碼有刪減）：

go tool pprof mosnd pprof.mosn.samples.cpu.001.pb.gz
(pprof) list Dispatch
Total: 1.75mins
     370ms     37.15s (flat, cum) 35.46% of Total
      10ms       10ms    123:func (conn *streamConnection) Dispatch(buffer types.IoBuffer) {
      40ms      630ms    125:    log.DefaultLogger.Tracef("stream connection dispatch data string = %v", buffer.String())
         .          .    126:
         .          .    127:    // get sub protocol codec
         .      250ms    128:    requestList := conn.codec.SplitFrame(buffer.Bytes())
      20ms       20ms    129:    for _, request := range requestList {
      10ms      160ms    134:        headers := make(map[string]string)
         .          .    135:        // support dynamic route
      50ms      920ms    136:        headers[strings.ToLower(protocol.MosnHeaderHostKey)] = conn.connection.RemoteAddr().String()
         .          .    149:
         .          .    150:        // get stream id
      10ms      440ms    151:        streamID := conn.codec.GetStreamID(request)
         .          .    156:        // request route
         .       50ms    157:        requestRouteCodec, ok := conn.codec.(xprotocol.RequestRouting)
         .          .    158:        if ok {
         .     20.11s    159:            routeHeaders := requestRouteCodec.GetMetas(request)
         .          .    165:        }
         .          .    166:
         .          .    167:        // tracing
      10ms       80ms    168:        tracingCodec, ok := conn.codec.(xprotocol.Tracing)
         .          .    169:        var span types.Span
         .          .    170:        if ok {
      10ms      1.91s    171:            serviceName := tracingCodec.GetServiceName(request)
         .      2.17s    172:            methodName := tracingCodec.GetMethodName(request)
         .          .    176:
         .          .    177:            if trace.IsEnabled() {
         .       50ms    179:                tracer := trace.Tracer(protocol.Xprotocol)
         .          .    180:                if tracer != nil {
      20ms      1.66s    181:                    span = tracer.Start(conn.context, headers, time.Now())
         .          .    182:                }
         .          .    183:            }
         .          .    184:        }
         .          .    185:
         .      110ms    186:        reqBuf := networkbuffer.NewIoBufferBytes(request)
         .          .    188:        // append sub protocol header
      10ms      950ms    189:        headers[types.HeaderXprotocolSubProtocol] = string(conn.subProtocol)
      10ms      4.96s    190:        conn.OnReceive(ctx, streamID, protocol.CommonHeader(headers), reqBuf, span, isHearbeat)
      30ms       60ms    191:        buffer.Drain(requestLen)
         .          .    192:    }
         .          .    193:}

通過上面 list Dispatch 命令，性能卡點主要分佈在 159 、 171 、172 、 181 、和 190 等行，主要卡點在解碼 dubbo 參數、重複解參數、tracer、發序列化和 log 等。

1. 優化 dubbo 解碼 GetMetas

我們通過解碼 dubbo 的 body 可以獲得以下信息，調用的目標接口（ interface ）和調用方法的服務分組（ group ）等信息，但是需要跳過所有業務方法參數，目前使用開源的 dubbo-go-hessian2 庫，解析string和map性能較差, 提升 hessian 庫解碼性能，會在本文後面講解。

優化思路：

在 mosn 的 ingress 端（ mosn 直接轉發請求給本地 java server 進程）, 我們根據請求的 path 和 version 窺探用戶使用的 interface 和 group , 構建正確的 dataID 可以進行無腦轉發，無需解碼 body，榨取性能提升。

我們可以在服務註冊時，構建服務發佈的 path 、version 和 group 到 interface 、group 映射。在 mosn 轉發 dubbo 請求時可以通過讀鎖查 cache + 跳過解碼 body，加速 mosn 性能。

因此我們構建以下 cache 實現（數組 + 鏈表數據結構）, 可參見優化代碼diff ：

// metadata.go
// DubboPubMetadata dubbo pub cache metadata
var DubboPubMetadata = &Metadata{}

// DubboSubMetadata dubbo sub cache metadata
var DubboSubMetadata = &Metadata{}

// Metadata cache service pub or sub metadata.
// speed up for decode or encode dubbo peformance.
// please do not use outside of the dubbo framwork.
type Metadata struct {
    data map[string]*Node
    mu   sync.RWMutex // protect data internal
}

// Find cached pub or sub metatada.
// caller should be check match is true
func (m *Metadata) Find(path, version string) (node *Node, matched bool) {
    // we found nothing
    if m.data == nil {
        return nil, false
    }

    m.mu.RLocker().Lock()
    // for performance
    // m.mu.RLocker().Unlock() should be called.

    // we check head node first
    head := m.data[path]
    if head == nil || head.count <= 0 {
        m.mu.RLocker().Unlock()
        return nil, false
    }

    node = head.Next
    // just only once, just return
    // for dubbo framwork, that's what we're expected.
    if head.count == 1 {
        m.mu.RLocker().Unlock()
        return node, true
    }

    var count int
    var found *Node

    for ; node != nil; node = node.Next {
        if node.Version == version {
            if found == nil {
                found = node
            }
            count++
        }
    }

    m.mu.RLocker().Unlock()
    return found, count == 1
}

// Register pub or sub metadata
func (m *Metadata) Register(path string, node *Node) {
    m.mu.Lock()
    // for performance
    // m.mu.Unlock() should be called.

    if m.data == nil {
        m.data = make(map[string]*Node, 4)
    }

    // we check head node first
    head := m.data[path]
    if head == nil {
        head = &Node{
            count: 1,
        }
        // update head
        m.data[path] = head
    }

    insert := &Node{
        Service: node.Service,
        Version: node.Version,
        Group:   node.Group,
    }

    next := head.Next
    if next == nil {
        // fist insert, just insert to head
        head.Next = insert
        // record last element
        head.last = insert
        m.mu.Unlock()
        return
    }

    // we check already exist first
    for ; next != nil; next = next.Next {
        // we found it
        if next.Version == node.Version && next.Group == node.Group {
            // release lock and no nothing
            m.mu.Unlock()
            return
        }
    }

    head.count++
    // append node to the end of the list
    head.last.Next = insert
    // update last element
    head.last = insert
    m.mu.Unlock()
}

通過服務註冊時構建好的 cache，可以在 mosn 的 stream 做解碼時命中 cache , 無需解碼參數獲取接口和 group 信息，可參見優化代碼 diff :

// decoder.go
// for better performance.
// If the ingress scenario is not using group,
// we can skip parsing attachment to improve performance
if listener == IngressDubbo {
    if node, matched = DubboPubMetadata.Find(path, version); matched {
        meta[ServiceNameHeader] = node.Service
        meta[GroupNameHeader] = node.Group
    }
} else if listener == EgressDubbo {
    // for better performance.
    // If the egress scenario is not using group,
    // we can skip parsing attachment to improve performance
    if node, matched = DubboSubMetadata.Find(path, version); matched {
        meta[ServiceNameHeader] = node.Service
        meta[GroupNameHeader] = node.Group
    }
}

在 mosn 的 egress 端（ mosn 直接轉發請求給本地 java client 進程）, 我們採用類似的思路, 我們根據請求的 path 和 version 去窺探用戶使用的 interface 和 group , 構建正確的 dataID 可以進行無腦轉發，無需解碼 body，榨取性能提升。

2. 優化 dubbo 解碼參數

在 dubbo 解碼參數值的時候，mosn 採用的是 hessian 的正則表達式查找，非常耗費性能。我們先看下優化前後benchmark 對比, 性能提升 50 倍。

go test -bench=BenchmarkCountArgCount -run=^$ -benchmem
BenchmarkCountArgCountByRegex-12    200000    6236 ns/op    1472 B/op    24 allocs/op
BenchmarkCountArgCountOptimized-12    10000000    124 ns/op    0 B/op    0 allocs/op

優化思路：

可以消除正則表達式，採用簡單字符串解析識別參數類型個數， dubbo 編碼參數個數字符串實現並不複雜, 主要給對象加L 前綴、數組加[、primitive 類型有單字符代替。採用go可以實現同等解析, 可以參考優化代碼 diff ：

func getArgumentCount(desc string) int {
    len := len(desc)
    if len == 0 {
        return 0
    }

    var args, next = 0, false
    for _, ch := range desc {

        // is array ?
        if ch == '[' {
            continue
        }

        // is object ?
        if next && ch != ';' {
            continue
        }

        switch ch {
        case 'V', // void
            'Z', // boolean
            'B', // byte
            'C', // char
            'D', // double
            'F', // float
            'I', // int
            'J', // long
            'S': // short
            args++
        default:
            // we found object
            if ch == 'L' {
                args++
                next = true
                // end of object ?
            } else if ch == ';' {
                next = false
            }
        }

    }
    return args
}

3. 優化 dubbo hessian go 解碼 string 性能

在圖 1-2 中可以看到 dubbo hessian go 在解碼 string 佔比 CPU 採樣較高，我們在解碼 dubbo 請求時，會解析 dubbo 框架版本、調用 path 、接口版本和方法名，這些都是 string 類型，dubbo hessian go 解析 string 會影響 RPC 性能。

我們首先跑一下 benchmar k前後解碼 string 性能對比，性能提升 56.11%，對應到 RPC 中有 5% 左右提升。

BenchmarkDecodeStringOriginal-12     1967202     613 ns/op     272 B/op     6 allocs/op
BenchmarkDecodeStringOptimized-12     4477216     269 ns/op     224 B/op     5 allocs/op

優化思路：

直接使用 UTF-8 byte 解碼，性能最高，之前先解碼 byte 成 rune , 對 rune 解碼成 string ，及其耗費性能。增加批量 string chunk copy ，降低 read 調用，並且使用 unsafe 轉換 string （避免一些校驗），因爲代碼優化 diff 較多，這裏給出優化代碼 PR 。

go SDK 代碼runtime/string.go#slicerunetostring（ rune轉換成string ），同樣是把 rune 轉成 byte 數組，這裏給了我優化思路啓發。

4. 優化 hessian 庫編解碼對象

雖然消除了 dubbo 的 body 解碼部分，但是 mosn 在處理 dubbo 請求時，必須要藉助 hessian 去 decode 請求頭部的框架版本、請求 path 和接口版本值。但是每次在解碼的時候都會創建序列化對象，開銷非常高，因爲 hessian 每次在創建 reader 的時候會 allocate 4k 數據並 reset。

      10ms       10ms     75:func unSerialize(serializeId int, data []byte, parseCtl unserializeCtl) *dubboAttr {
      10ms      140ms     82:    attr := &dubboAttr{}
      80ms      2.56s     83:    decoder := hessian.NewDecoderWithSkip(data[:])
ROUTINE ======================== bufio.NewReaderSize in /usr/local/go/src/bufio/bufio.go
      50ms      2.44s (flat, cum)  2.33% of Total
         .      220ms     55:    r := new(Reader)
      50ms      2.22s     56:    r.reset(make([]byte, size), rd)
         .          .     57:    return r
         .          .     58:}

我們可以寫個池化內存前後性能對比, 性能提升85.4% , benchmark 用例：

BenchmarkNewDecoder-12    1487685    803 ns/op    4528 B/op    9 allocs/op
BenchmarkNewDecoderOptimized-12    10564024    117 ns/op    128 B/op    3 allocs/op

優化思路：

在每次編解碼時，池化 hessian 的 decoder 對象，新增 NewCheapDecoderWithSkip 並支持 reset 複用 decoder 。

var decodePool = &sync.Pool{
    New: func() interface{} {
        return hessian.NewCheapDecoderWithSkip([]byte{})
    },
}

// 在解碼時按照如下方法調用
decoder := decodePool.Get().(*hessian.Decoder)
// fill decode data
decoder.Reset(data[:])
hessianPool.Put(decoder)

5. 優化重複解碼 service 和 methodName 值

xprotocol 在實現 xprotocol.Tracing 獲取服務名稱和方法時，會觸發調用並解析 2 次，調用開銷比較大。

      10ms      1.91s    171:            serviceName := tracingCodec.GetServiceName(request)
         .      2.17s    172:            methodName := tracingCodec.GetMethodName(request)

優化思路：

因爲在 GetMetas 裏面已經解析過一次了，可以把解析過的 headers 傳進去，如果 headers 有了就不用再去解析了，並且重構接口名稱爲一個，返回值爲二元組，消除一次調用。

6. 優化 streamID 類型轉換

在 go 中將 byte 數組和 streamID 進行互轉的時候，比較費性能。

優化思路：

生產代碼中, 儘量不要使用 fmt.Sprintf 和 fmt.Printf 去做類型轉換和打印信息。可以使用 strconv 去轉換。

   .      430ms    147: reqIDStr := fmt.Sprintf("%d", reqID)
60ms      4.10s    168: fmt.Printf("src=%s, len=%d, reqid:%v\n", streamID, reqIDStrLen, reqIDStr)

7. 優化昂貴的系統調用

mosn 在解碼 dubbo 的請求時，會在 header 中塞一份遠程 host 的地址，並且在 for 循環中獲取 remote IP，系統調用開銷比較高。

優化思路：

     50ms      920ms    136:        headers[strings.ToLower(protocol.MosnHeaderHostKey)] = conn.connection.RemoteAddr().String()

在獲取遠程地址時，儘可能在 streamConnection 中 cache 遠程 IP 值，不要每次都去調用 RemoteAddr。

8. 優化 slice 和 map 觸發擴容和 rehash

在 mosn 處理 dubbo 請求時，會根據接口、版本和分組去構建 dataID ，然後匹配 cluster , 會創建默認 slice 和 map 對象，經過性能診斷，導致不斷 allocate slice 和 grow map 容量比較費性能。

優化思路：

使用 slice 和 map 時，儘可能預估容量大小，使用 make(type, capacity) 去指定初始大小。

9. 優化 trace 日誌級別輸出

mosn 中不少代碼在處理邏輯時，會打很多 trace 級別的日誌，並且會傳遞不少參數值。

優化思路：

調用 trace 輸出前，儘量判斷一下日誌級別，如果有多個 trace 調用，儘可能把所有字符串寫到 buf 中，然後把 buf 內容寫到日誌中，並且儘可能少的調用 trace 日誌方法。

10. 優化 tracer、log 和 metrics

在大促期間，對機器的性能要求較高，經過性能診斷，tracer、mosn log 和 cloud metrics 寫日誌（ IO 操作）非常耗費性能。

優化思路：

通過配置中心下發配置或者增加大促開關，允許 API 調用這些 feature 的開關。

/api/v1/downgrade/on
/api/v1/downgrade/off

11. 優化 route header 解析

mosn 中在做路由前，需要做大量的 header 的 map 訪問，比如 IDC、antvip 等邏輯判斷，商業版或者開源 mosn 不需要這些邏輯，這些也會佔用一些開銷。

優化思路：

如果是雲上邏輯，主站的邏輯都不走。

12. 優化 featuregate 調用

在 mosn 中處理請求時，爲了區分主站和商業版路由邏輯，會通過 featuregate 判斷邏輯走哪部分。通過 featuregate 調用開銷較大，需要頻繁的做類型轉換和多層 map 去獲取。

優化思路：

通過一個 bool 變量記錄 featuregate 對應開關，如果沒有初始化過，就主動調用一下 featuregate。

未來性能優化思考

經過幾輪性能優化，目前看火焰圖，卡點都在 connection 的 read 和 write ，可以優化的空間比較小了。但是可能從以下場景中獲得收益：

減少 connection 的 read 和 write 次數 (syscall) 。
優化 IO 線程模型，減少攜程和上下文切換等。

作爲結束，給出了最終優化後的火焰圖，大部分卡點都在系統調用和網絡讀寫, 請參考圖 1-4。

圖 1-4 優化版本 mosn + dubbo 火線圖

其他

pprof 工具異常強大，可以診斷 CPU、memory、go協程、tracer 和死鎖等，該工具可以參考 godoc，性能優化參考：

關於作者

詣極，github ID zonghaishang，Apache Dubbo PMC，目前就職於螞蟻金服中間件團隊，主攻 RPC 和 Service Mesh方向。《深入理解Apache Dubbo與實戰》一書作者。

記一次在 mosn 對 dubbo、dubbo-go-hessian2 的性能優化

背景

性能優化概述

性能優化工具 pprof

性能優化思路

1. 優化 dubbo 解碼 GetMetas

2. 優化 dubbo 解碼參數

3. 優化 dubbo hessian go 解碼 string 性能

4. 優化 hessian 庫編解碼對象

5. 優化重複解碼 service 和 methodName 值

6. 優化 streamID 類型轉換

7. 優化昂貴的系統調用

8. 優化 slice 和 map 觸發擴容和 rehash

9. 優化 trace 日誌級別輸出

10. 優化 tracer、log 和 metrics

11. 優化 route header 解析

12. 優化 featuregate 調用

未來性能優化思考

其他

關於作者

基於 getty 的分佈式事務框架seata-golang 通信模型詳解

Dubbo-go Client端調用服務過程

Dubbo-go Server端開啓服務過程

Dubbo-go應用維度註冊模型

What's new in Dubbo-go v1.5.1

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結