Golang | Go代碼調優利器-火焰圖

前言

作爲DevOps，我們在日常搞的項目,從開發到測試然後上線，我們基本都侷限在功能的單元測試，對一些性能上的細節很多人包括我自己，往往都選擇視而不見，後果往往讓工具應用產生不可預測的災難（it’s true）。有些人說底層的東西，或者代碼層面的性能調優太深入了，性能提升可以用硬件來補，但我覺得這只是自欺欺人的想法，提升硬件配置這種土豪方法不能一直長存的，更何況現在我們的工具哪個不是分佈式的，哪個不是集羣上跑的，爲了冗餘也好，爲了易於橫向擴展也罷，不可能保證所有的服務器都具備高性能的，我們不能讓某些低配的服務器運行我們有性能缺陷的代碼產生短板，成爲瓶頸。

我記得2016年參與了一些通用服務agent的開發，由於要運行於公司全網幾乎所有服務器中，生產上的環境複雜程度超乎我們想象。

一個問題到達很深入的時候，就已經是共同的問題

更何況Go語言已經爲開發者內置配套了很多性能調優監控的好工具和方法，這大大提升了我們profile分析的效率，除了編碼技巧，不斷在實戰項目中磨鍊自己對性能問題分析的能力，對日後我們在項目的把控力和一些功能佈局都是很有幫助。

Golang的性能調優手段

Go語言內置的CPU和Heap profiler

Go強大之處是它已經在語言層面集成了profile採樣工具,並且允許我們在程序的運行時使用它們，

使用Go的profiler我們能獲取以下的樣本信息：

CPU profiles
Heap profiles
block profile、traces等

Go語言常見的profiling使用場景

基準測試文件：例如使用命令go test . -bench . -cpuprofile prof.cpu 生成採樣文件後，再通過命令 go tool pprof [binary] prof.cpu 來進行分析。
import _ net/http/pprof：如果我們的應用是一個web服務，我們可以在http服務啓動的代碼文件(eg: main.go)添加 import _ net/http/pprof，這樣我們的服務便能自動開啓profile功能，有助於我們直接分析採樣結果。
通過在代碼裏面調用 runtime.StartCPUProfile或者runtime.WriteHeapProfile

更多調試的使用，建議可以閱讀The Go Blog的 Profiling Go Programs

go-torch

在沒有使用go-torch之前，我們要分析一分profile文件的時候，遇到結構簡單的還好，但遇到一些調用關係複雜的，我相信大部分程序員都覺得無從下手，如下圖：

這樣的結構，帶給我們的是晦澀難懂的感覺，我們需要尋求更直觀，更簡單的分析工具。

go-torch是Uber公司開源的一款針對Go語言程序的火焰圖生成工具，能收集 stack traces,並把它們整理成火焰圖，直觀地程序給開發人員。

go-torch是基於使用BrendanGregg創建的火焰圖工具生成直觀的圖像，很方便地分析Go的各個方法所佔用的CPU的時間，火焰圖是一個新的方法來可視化CPU的使用情況，本文中我會展示如何使用它輔助我們排查問題。

go-torch項目首頁

下圖是火焰圖的一個事例展示：

這樣的展示方式相比之前的樹狀的，有了更直觀的表現，

好，我們瞭解應該差不多了，可以開始安裝並使用go-torch了

安裝

1.首先，我們要配置FlameGraph的腳本

FlameGraph 是profile數據的可視化層工具，已被廣泛用於Python和Node

git clone https://github.com/brendangregg/FlameGraph.git

2.檢出完成後，把flamegraph.pl拷到我們機器環境變量$PATH的路徑中去，例如：

cp flamegraph.pl /usr/local/bin

3.在終端輸入 flamegraph.pl -h 是否安裝FlameGraph成功

$ flamegraph.pl -h
Option h is ambiguous (hash, height, help)
USAGE: /usr/local/bin/flamegraph.pl [options] infile > outfile.svg

    --title       # change title text
    --width       # width of image (default 1200)
    --height      # height of each frame (default 16)
    --minwidth    # omit smaller functions (default 0.1 pixels)
    --fonttype    # font type (default "Verdana")
    --fontsize    # font size (default 12)
    --countname   # count type label (default "samples")
    --nametype    # name type label (default "Function:")
    --colors      # set color palette. choices are: hot (default), mem, io,
                  # wakeup, chain, java, js, perl, red, green, blue, aqua,
                  # yellow, purple, orange
    --hash        # colors are keyed by function name hash
    --cp          # use consistent palette (palette.map)
    --reverse     # generate stack-reversed flame graph
    --inverted    # icicle graph
    --negate      # switch differential hues (blue<->red)
    --help        # this message

    eg,
    /usr/local/bin/flamegraph.pl --title="Flame Graph: malloc()" trace.txt > graph.svg

4.安裝go-torch

有了flamegraph的支持，我們接下來要使用go-torch展示profile的輸出，而安裝go-torch很簡單，我們使用下面的命令即可完成安裝

go get -v github.com/uber/go-torch

5.使用go-torch命令

$ go-torch -h
Usage:
  go-torch [options] [binary] <profile source>

pprof Options:
  -u, --url=         Base URL of your Go program (default: http://localhost:8080)
  -s, --suffix=      URL path of pprof profile (default: /debug/pprof/profile)
  -b, --binaryinput= File path of previously saved binary profile. (binary profile is anything accepted by https://golang.org/cmd/pprof)
      --binaryname=  File path of the binary that the binaryinput is for, used for pprof inputs
  -t, --seconds=     Number of seconds to profile for (default: 30)
      --pprofArgs=   Extra arguments for pprof

Output Options:
  -f, --file=        Output file name (must be .svg) (default: torch.svg)
  -p, --print        Print the generated svg to stdout instead of writing to file
  -r, --raw          Print the raw call graph output to stdout instead of creating a flame graph; use with Brendan Gregg's flame graph perl
                     script (see https://github.com/brendangregg/FlameGraph)
      --title=       Graph title to display in the output file (default: Flame Graph)
      --width=       Generated graph width (default: 1200)
      --hash         Colors are keyed by function name hash
      --colors=      set color palette. choices are: hot (default), mem, io, wakeup, chain, java, js, perl, red, green, blue, aqua, yellow,
                     purple, orange
      --cp           Use consistent palette (palette.map)
      --reverse      Generate stack-reversed flame graph
      --inverted     icicle graph

Help Options:
  -h, --help         Show this help message

按照上面的幾個步驟，我們基本可以具備生成我們的火焰圖的前提條件了，但生成火焰圖並不是這篇文章所要表達的目的，記住，我們的目的是：找出問題，分析問題，解決問題！

下面我們就結合案例，介紹如何使用火焰圖輔助性能調優吧

調優實例

demo代碼

demo是一個web的服務端程序，對外提供了兩個用於我們演示的HTTP接口

我們先閱讀 main.go

func main() {
    flag.Parse()

    //高級接口
    http.HandleFunc("/advance", handler.WithAdvanced(handler.Simple))

    //簡單接口
    http.HandleFunc("/simple", handler.Simple)
    http.HandleFunc("/", index)

    fmt.Println("Starting Server on", hostPort)
    if err := http.ListenAndServe(hostPort, nil); err != nil {
        log.Fatalf("HTTP Server Failed: %v", err)
    }
}

啓動服務後, 瀏覽器訪問 http://localhost:9090/simple 和 http://localhost:9090/advance

正常都會輸出

Hello VIP!

雖然輸出的內容是一樣的，但 /advance 接口附加了一些統計功能，我們可以在終端上啓動web服務時，多增加printStats參數：

$ go run main.go -printStats

當我們刷新接口地址的時候，終端都會把訪問信息打印出來，如下：

IncCounter: handler.received.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 1
RecordTimer: handler.latency.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 418.07µs
IncCounter: handler.received.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 1
RecordTimer: handler.latency.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 71.084µs
IncCounter: handler.received.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 1
RecordTimer: handler.latency.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 93.233µs
IncCounter: handler.received.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 1
RecordTimer: handler.latency.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 88.246µs
IncCounter: handler.received.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 1
RecordTimer: handler.latency.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 99.305µs
IncCounter: handler.received.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 1
RecordTimer: handler.latency.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 82.383µs
IncCounter: handler.received.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 1
RecordTimer: handler.latency.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 86.55µs
IncCounter: handler.received.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 1
RecordTimer: handler.latency.lihaoquantekiMacBook-Pro.advance.Mac-OS.Chrome = 109.914µs

OK, 例子很簡單而且表面上看起來web服務都很正常，但背後真的是風平浪靜嗎？畢竟我們的併發量還沒真正上去，cpu和內存都還沒經受考驗呢！

我們繼續保持web服務處於工作狀態，然後輸入以下命令：

kapok -d=35 -c=1000  http://localhost:9090/advance

kapok 是我自己開發用於壓測的工具，除此之外，可使用go-wrk 或者 vegeta等http壓測工具代替

在上面的壓測過程中，我們再新建一個終端窗口輸入以下命令，生成我們的profile文件：

$ go tool pprof --seconds 25 http://localhost:9090/debug/pprof/profile

命令中，我們設置了25秒的採樣時間，當看到(pprof)的時候，我們輸入 web, 表示從瀏覽器打開

Fetching profile from http://localhost:9090/debug/pprof/profile?seconds=25
Please wait... (25s)
Saved profile in /Users/lihaoquan/pprof/pprof.localhost:9090.samples.cpu.014.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) web

這樣我們可以得到一個完整的程序調用性能採樣profile的輸出,如下圖：

就像評分報告一樣，模塊間的調用耗時都能從圖中得到展現，但是, 這種圖有個缺點，就是層次很深的話，這周發散性的層級關係有點不友好，我們可能需要換一種展示方式來告訴我們應用是否有問題

好，我們回調終端上，依舊調用壓力測試工具：

kapok -d=35 -c=1000  http://localhost:9090/advance

不過，我們決定使用go-torch來生成採樣報告:

go-torch -u http://localhost:9090 -t 30

大概等三十秒後，go-torch完成採用後，會輸出以下信息：

Writing svg to torch.svg

torch.svg 是go-torch採樣結束後自動生成的profile文件，我們也照舊用瀏覽器進行打開：

嗯，這樣體驗好多了，接下來我們可以基於這個火焰圖診斷一下我們的web服務是否是“健康”的！

火焰圖的y軸表示cpu調用方法的先後，x軸表示在每個採樣調用時間內，方法所佔的時間百分比，越寬代表佔據cpu時間越多

我們發現

os.Hostname

這個地方很明顯有可疑，因爲按正常理解一個回去hostname的方法，不應該佔據這麼多的資源啊，我們先去代碼裏看下：

func getStatsTags(r *http.Request) map[string]string {
    userBrowser, userOS := parseUserAgent(r.UserAgent())
    stats := map[string]string{
        "browser":  userBrowser,
        "os":       userOS,
        "endpoint": filepath.Base(r.URL.Path),
    }
    host, err := os.Hostname()
    if err == nil {
        if idx := strings.IndexByte(host, '.'); idx > 0 {
            host = host[:idx]
        }
        stats["host"] = host
    }
    return stats
}

getStatsTags 這個方法會在每次訪問 /advance接口的時候都會被調用，而代碼裏也很明顯的使用了 os.Hostname()。一般情況下我們的機器的hostname不應該是頻繁變化的，所以我們應該把這個獲取hostname的代碼單獨拿出來，作爲一個全局性的處理，這樣每次接口調用就不用再新調用它一次了：

改進後的代碼：

var _hostName = getHost()

func getHost() string {
    host, err := os.Hostname()
    if err != nil {
        return ""
    }

    if idx := strings.IndexByte(host, '.'); idx > 0 {
        host = host[:idx]
    }
    return host
}

func getStatsTags(r *http.Request) map[string]string {
    userBrowser, userOS := parseUserAgent(r.UserAgent())
    stats := map[string]string{
        "browser":  userBrowser,
        "os":       userOS,
        "endpoint": filepath.Base(r.URL.Path),
    }
    if _hostName != "" {
        stats["host"] = _hostName
    }
    return stats
}

爲了檢驗我們的診斷是否正確，我們重啓我們的web服務再來調試一下,繼續同時運行以下命令

$ kapok -d=35 -c=1000  http://localhost:9090/advance

依舊在壓測的同時，我們並行採樣：

$ go-torch -u http://localhost:9090 -t 30

生成新的profile後，瀏覽器打開

可以看到，之前的os.Hostname在火焰圖上沒有了，我們解決了一個bug~

想必這裏我們一定認爲安枕無憂了，但是俗語說禍不單行，bug一般不會輕易顯露出來的，我們最好還是深入挖掘它。

我們發現下圖的一個地方（綠色框中的地方）：

從統計數據看到，綠色框標識的地方，採用數只有140，而這個函數應該也是每次調用/advance的時候都會被調用一次的，也就是說這裏出現問題了。

我們在火焰圖上再點進去，發現了可疑的地方了：

綠色標識的地方所示，addTagsToName這個方法調用，爲什麼會出現兩次呢？

知道可能出現問題的地方，但百思不得其解！要怎麼樣才能具體定位問題所在呢？

我們這個時候應該針對addTagsToName，嘗試對症下藥。

我們矛頭指向addTagsToName，做一次基準測試

測試文件如下：

reporter_test.go

package stats

import "testing"

func BenchmarkAddTagsToName(b *testing.B) {
    tags := map[string]string{
        "host":     "myhost",
        "endpoint": "hello",
        "os":       "OS X",
        "browser":  "Chrome",
    }
    for i := 0; i < b.N; i++ {
        addTagsToName("recv.calls", tags)
    }
}

func TestAddTagsToName(t *testing.T) {
    tests := []struct {
        name     string
        tags     map[string]string
        expected string
    }{
        {
            name:     "recvd",
            tags:     nil,
            expected: "recvd.no-endpoint.no-os.no-browser",
        },
        {
            name: "recvd",
            tags: map[string]string{
                "endpoint": "hello",
                "os":       "OS X",
                "browser":  "Chrome",
            },
            expected: "recvd.hello.OS-X.Chrome",
        },
        {
            name: "r.call",
            tags: map[string]string{
                "host":     "my-host-name",
                "endpoint": "hello",
                "os":       "OS{}/\tX",
                "browser":  "Chro\\:me",
            },
            expected: "r.call.my-host-name.hello.OS----X.Chro--me",
        },
    }

    for _, tt := range tests {
        got := addTagsToName(tt.name, tt.tags)
        if got != tt.expected {
            t.Errorf("addTagsToName(%v, %v) got %v, expected %v",
                tt.name, tt.tags, got, tt.expected)
        }
    }
}

我們執行一下benchmark測試

先是cpu的性能分析

$ go test -bench . -benchmem -cpuprofile prof.cpu
BenchmarkAddTagsToName-4         500000          3172 ns/op         480 B/op          16 allocs/op
PASS
ok      github.com/domac/playflame/stats    1.633s

使用go tool分析一下：

$ go tool pprof stats.test  prof.cpu
Entering interactive mode (type "help" for commands)
(pprof) top10
930ms of 1420ms total (65.49%)
Showing top 10 nodes out of 85 (cum >= 60ms)
      flat  flat%   sum%        cum   cum%
     130ms  9.15%  9.15%      420ms 29.58%  regexp.(*machine).tryBacktrack
     120ms  8.45% 17.61%      120ms  8.45%  regexp/syntax.(*Inst).MatchRunePos
     120ms  8.45% 26.06%      300ms 21.13%  runtime.mallocgc
     100ms  7.04% 33.10%      100ms  7.04%  regexp.(*bitState).push
      90ms  6.34% 39.44%      300ms 21.13%  runtime.growslice
      90ms  6.34% 45.77%       90ms  6.34%  runtime.memmove
      80ms  5.63% 51.41%      530ms 37.32%  regexp.(*machine).backtrack
      80ms  5.63% 57.04%       80ms  5.63%  runtime.heapBitsSetType
      60ms  4.23% 61.27%      850ms 59.86%  regexp.(*Regexp).replaceAll
      60ms  4.23% 65.49%       60ms  4.23%  sync/atomic.CompareAndSwapUint32
(pprof)

從排行榜看到，大概regexp很大關係，但這不好看出真正問題，需要再用別的招數

我們在(pprof)後，輸入list addTagsToName，分析基準測試文件中具體的方法

(pprof) list addTagsToName
Total: 1.42s
ROUTINE ======================== github.com/domac/playflame/stats.addTagsToName in /Users/lihaoquan/GoProjects/Playground/src/github.com/domac/playflame/stats/reporter.go
      20ms      1.37s (flat, cum) 96.48% of Total
         .          .     31:    }
         .          .     32:}
         .          .     33:
         .          .     34:func addTagsToName(name string, tags map[string]string) string {
         .          .     35:    var keyOrder []string
         .       10ms     36:    if _, ok := tags["host"]; ok {
         .       20ms     37:        keyOrder = append(keyOrder, "host")
         .          .     38:    }
         .       30ms     39:    keyOrder = append(keyOrder, "endpoint", "os", "browser")
         .          .     40:
         .          .     41:    parts := []string{name}
         .          .     42:    for _, k := range keyOrder {
      20ms       40ms     43:        v, ok := tags[k]
         .          .     44:        if !ok || v == "" {
         .          .     45:            parts = append(parts, "no-"+k)
         .          .     46:            continue
         .          .     47:        }
         .      1.12s     48:        parts = append(parts, clean(v))
         .          .     49:    }
         .          .     50:
         .      150ms     51:    return strings.Join(parts, ".")
         .          .     52:}
         .          .     53:
         .          .     54:var specialChars = regexp.MustCompile(`[{}/\\:\s.]`)
         .          .     55:
         .          .     56:func clean(value string) string {
(pprof)

OK, 我們找到一個耗時比較多的功能調用了

1.12s     48:        parts = append(parts, clean(v))

這個地方就是耗時最多的地方了，也就是接下來我們應該去調優的代碼區域了。我們先別急，因爲這個代碼段內嵌了一次clean方法的調用。

繼續在(pprof) 後輸入 list clean,看是不是在clean出問題

(pprof) list clean
Total: 1.42s
ROUTINE ======================== github.com/domac/playflame/stats.clean in /Users/lihaoquan/GoProjects/Playground/src/github.com/domac/playflame/stats/reporter.go
         0      950ms (flat, cum) 66.90% of Total
         .          .     52:}
         .          .     53:
         .          .     54:var specialChars = regexp.MustCompile(`[{}/\\:\s.]`)
         .          .     55:
         .          .     56:func clean(value string) string {
         .      950ms     57:    return specialChars.ReplaceAllString(value, "-")
         .          .     58:}

沒出意外的話，應該是 clean 方法使用不正確導致的，而且不正確的地方應該是下面的代碼段：

specialChars.ReplaceAllString(value, "-")

這段代碼引起了性能問題！我們着手調優吧。

代碼修復前

var specialChars = regexp.MustCompile(`[{}/\\:\s.]`)

func clean(value string) string {
    return specialChars.ReplaceAllString(value, "-")
}

這段代碼是把指定的特殊字符替換成‘-’，正則模塊雖然靈活正則表達式比純粹的文本匹配效率低，只是做簡單文本替換的話，乾脆自己寫一個替換方法算了

改進後

func clean(value string) string {
    newStr := make([]byte, len(value))
    for i := 0; i < len(value); i++ {
        switch c := value[i]; c {
        case '{', '}', '/', '\\', ':', ' ', '\t', '.':
            newStr[i] = '-'
        default:
            newStr[i] = c
        }
    }
    return string(newStr)
}

我們再觀察基準測試報告的cpu調用分析：

$ go test -bench . -benchmem -cpuprofile prof.cpu
BenchmarkAddTagsToName-4        1000000          1063 ns/op         448 B/op          15 allocs/op
PASS
ok      github.com/domac/playflame/stats    1.087s

對比上一次的測試，性能有了很大的提升：

(pprof) list clean
Total: 1.02s
ROUTINE ======================== github.com/domac/playflame/stats.clean in /Users/lihaoquan/GoProjects/Playground/src/github.com/domac/playflame/stats/reporter.go
      10ms      110ms (flat, cum) 10.78% of Total
         .          .     48:    }
         .          .     49:
         .          .     50:    return strings.Join(parts, ".")
         .          .     51:}
         .          .     52:
      10ms       10ms     53:func clean(value string) string {
         .       60ms     54:    newStr := make([]byte, len(value))
         .          .     55:    for i := 0; i < len(value); i++ {
         .          .     56:        switch c := value[i]; c {
         .          .     57:        case '{', '}', '/', '\\', ':', ' ', '\t', '.':
         .          .     58:            newStr[i] = '-'
         .          .     59:        default:
         .          .     60:            newStr[i] = c
         .          .     61:        }
         .          .     62:    }
         .       40ms     63:    return string(newStr)
         .          .     64:}
(pprof)

但我們還不能放鬆，我們看到其中一項指標: 15 allocs/op

我們功能調用的速度上去了，但對象內存分配好像也沒得到改善啊，這怎麼辦？

我們繼續深入下去, 既然源碼分析不行，試試彙編代碼：

(pprof)disasm

...
...
...

   .          .      a4cfb: MOVQ $0x0, 0(SP)
         .          .      a4d03: MOVQ 0x70(SP), AX
         .          .      a4d08: MOVQ AX, 0x8(SP)
         .          .      a4d0d: MOVQ 0x40(SP), AX
         .          .      a4d12: MOVQ AX, 0x10(SP)
         .          .      a4d17: MOVQ 0x48(SP), AX
         .          .      a4d1c: MOVQ AX, 0x18(SP)
         .       60ms      a4d21: CALL runtime.slicebytetostring(SB)
         .          .      a4d26: MOVQ 0x20(SP), AX
         .          .      a4d2b: MOVQ 0x28(SP), CX
         .          .      a4d30: MOVQ AX, 0xb8(SP)
         .          .      a4d38: MOVQ CX, 0xc0(SP)
         .          .      a4d40: MOVQ 0x80(SP), BP
         .          .      a4d48: ADDQ $0x88, SP
         .          .      a4d4f: RET

...
...
...

我們在這裏定位到 runtime.slicebytetostring(SB) 這裏可能是引起內存分配問題的所在

runtime.slicebytetostring函數正是被函數bytes.(*Buffer).String函數調用的。它實現的功能是把元素類型爲byte的切片轉換爲字符串

我們再詳細看下代碼究竟哪裏涉及到字符串的轉換行爲

(pprof) list addTagsToName
Total: 1.02s
ROUTINE ======================== github.com/domac/playflame/stats.addTagsToName in /Users/lihaoquan/GoProjects/Playground/src/github.com/domac/playflame/stats/reporter.go
      40ms      770ms (flat, cum) 75.49% of Total
         .          .     30:    }
         .          .     31:}
         .          .     32:
         .          .     33:func addTagsToName(name string, tags map[string]string) string {
         .          .     34:    var keyOrder []string
         .       10ms     35:    if _, ok := tags["host"]; ok {
         .       10ms     36:        keyOrder = append(keyOrder, "host")
         .          .     37:    }
         .       30ms     38:    keyOrder = append(keyOrder, "endpoint", "os", "browser")
         .          .     39:
         .          .     40:    parts := []string{name}
      10ms       10ms     41:    for _, k := range keyOrder {
      10ms       40ms     42:        v, ok := tags[k]
         .          .     43:        if !ok || v == "" {
         .          .     44:            parts = append(parts, "no-"+k)
         .          .     45:            continue
         .          .     46:        }
      10ms      520ms     47:        parts = append(parts, clean(v))
         .          .     48:    }
         .          .     49:
      10ms      150ms     50:    return strings.Join(parts, ".")
         .          .     51:}
         .          .     52:
         .          .     53:func clean(value string) string {
         .          .     54:    newStr := make([]byte, len(value))
         .          .     55:    for i := 0; i < len(value); i++ {
(pprof)

留意上面的代碼，爲了拼接字符串，我們原方案是採用slice存放字符串元素，最後通過string.join()來拼接，我們多次調用了append方法，而在go裏面slice其實如果容量不夠的話，就會觸發分配，所以針對這個思路，我們需要對代碼的slice預分配容量，減少動態分配：

func addTagsToName(name string, tags map[string]string) string {
    keyOrder := make([]string, 0, 4)
    if _, ok := tags["host"]; ok {
        keyOrder = append(keyOrder, "host")
    }
    keyOrder = append(keyOrder, "endpoint", "os", "browser")

    parts := make([]string, 1, 5)
    parts[0] = name
    for _, k := range keyOrder {
        v, ok := tags[k]
        if !ok || v == "" {
            parts = append(parts, "no-"+k)
            continue
        }
        parts = append(parts, clean(v))
    }

    return strings.Join(parts, ".")
}

我們執行又一次的基準測試

$ go test -bench . -benchmem -cpuprofile prof.cpu
BenchmarkAddTagsToName-4        3000000           527 ns/op         144 B/op          10 allocs/op
PASS
ok      github.com/domac/playflame/stats    2.142s

可以看到對象分配的性能上去了，但不明顯，而且，耗時好像比上一次還多了。唉~~ 問題還沒徹底解決。

再分析profile:

$ go tool pprof stats.test  prof.cpu
Entering interactive mode (type "help" for commands)
(pprof) list addTagsToName
Total: 1.86s
ROUTINE ======================== github.com/domac/playflame/stats.addTagsToName in /Users/lihaoquan/GoProjects/Playground/src/github.com/domac/playflame/stats/reporter.go
     140ms      1.76s (flat, cum) 94.62% of Total
         .          .     34:}
         .          .     35:
         .          .     36:func addTagsToName(name string, tags map[string]string) string {
         .          .     37:    // The format we want is: host.endpoint.os.browser
         .          .     38:    // if there's no host tag, then we don't use it.
         .       30ms     39:    keyOrder := make([]string, 0, 4)
      10ms       30ms     40:    if _, ok := tags["host"]; ok {
         .          .     41:        keyOrder = append(keyOrder, "host")
         .          .     42:    }
      10ms       10ms     43:    keyOrder = append(keyOrder, "endpoint", "os", "browser")
         .          .     44:
         .          .     45:    parts := make([]string, 1, 5)
         .          .     46:    parts[0] = name
         .          .     47:    for _, k := range keyOrder {
      40ms      240ms     48:        v, ok := tags[k]
         .          .     49:        if !ok || v == "" {
         .          .     50:            parts = append(parts, "no-"+k)
         .          .     51:            continue
         .          .     52:        }
      50ms      820ms     53:        parts = append(parts, clean(v))
         .          .     54:    }
         .          .     55:
      30ms      630ms     56:    return strings.Join(parts, ".")
         .          .     57:}
         .          .     58:
         .          .     59:// clean takes a string that may contain special characters, and replaces these
         .          .     60:// characters with a '-'.
         .          .     61:func clean(value string) string {
(pprof)

可以看到 return strings.Join(parts, “.”) 這裏的時間比之前的還長！！這就是問題之一

parts = append(parts, clean(v)) 這裏也是耗時比較多的，也是問題之一

我們一個一個來：

既然知道拼接字符串，除了把字符串裝在數組裏，再使用join的確很方便把字符串元素拼接，但調用次數很大的時候，可能會導致對象分配低效的問題。這裏我們決定採用緩存buffer來優化字符串拼接：

func addTagsToName(name string, tags map[string]string) string {
    keyOrder := make([]string, 0, 4)
    if _, ok := tags["host"]; ok {
        keyOrder = append(keyOrder, "host")
    }
    keyOrder = append(keyOrder, "endpoint", "os", "browser")

    buf := &bytes.Buffer{}
    buf.WriteString(name)
    for _, k := range keyOrder {
        buf.WriteByte('.')

        v, ok := tags[k]
        if !ok || v == "" {
            buf.WriteString("no-")
            buf.WriteString(k)
            continue
        }

        writeClean(buf, v)
    }

    return buf.String()
}

func writeClean(buf *bytes.Buffer, value string) {
    for i := 0; i < len(value); i++ {
        switch c := value[i]; c {
        case '{', '}', '/', '\\', ':', ' ', '\t', '.':
            buf.WriteByte('-')
        default:
            buf.WriteByte(c)
        }
    }
}

我們引入buff緩衝的支持，看下優化的效果

$ go test -bench . -benchmem -cpuprofile prof.cpu
BenchmarkAddTagsToName-4        3000000           488 ns/op         160 B/op           2 allocs/op
PASS
ok      github.com/domac/playflame/stats    1.981s

不錯。性能指標繼續上去了，而且執行耗時下降了，CPU的問題算是解決了

我們多一個心眼，上面我們關注都是CPU調用性能，很有必要看看內存情況：

$ go test -bench . -benchmem -memprofile prof.mem
BenchmarkAddTagsToName-4        3000000           479 ns/op         160 B/op           2 allocs/op
PASS
ok      github.com/domac/playflame/stats    1.939s

生成prof.mem後，分析查看top10內存消耗排行榜：

$ go tool pprof --alloc_objects  stats.test prof.mem
Entering interactive mode (type "help" for commands)
(pprof) top10
7594956 of 7594956 total (  100%)
      flat  flat%   sum%        cum   cum%
   7594956   100%   100%    7594956   100%  github.com/domac/playflame/stats.addTagsToName
         0     0%   100%    7594956   100%  github.com/domac/playflame/stats.BenchmarkAddTagsToName
         0     0%   100%    7594956   100%  runtime.goexit
         0     0%   100%    7594956   100%  testing.(*B).launch
         0     0%   100%    7594956   100%  testing.(*B).runN
(pprof)

又是addTagsToName引起內存分配問題，只好列出那裏消耗多：

(pprof) list addTagsToName
Total: 7594956
ROUTINE ======================== github.com/domac/playflame/stats.addTagsToName in /Users/lihaoquan/GoProjects/Playground/src/github.com/domac/playflame/stats/reporter.go
   7594956    7594956 (flat, cum)   100% of Total
         .          .     40:    if _, ok := tags["host"]; ok {
         .          .     41:        keyOrder = append(keyOrder, "host")
         .          .     42:    }
         .          .     43:    keyOrder = append(keyOrder, "endpoint", "os", "browser")
         .          .     44:
   3848310    3848310     45:    buf := &bytes.Buffer{}
         .          .     46:    buf.WriteString(name)
         .          .     47:    for _, k := range keyOrder {
         .          .     48:        buf.WriteByte('.')
         .          .     49:
         .          .     50:        v, ok := tags[k]
         .          .     51:        if !ok || v == "" {
         .          .     52:            buf.WriteString("no-")
         .          .     53:            buf.WriteString(k)
         .          .     54:            continue
         .          .     55:        }
         .          .     56:
         .          .     57:        writeClean(buf, v)
         .          .     58:    }
         .          .     59:
   3746646    3746646     60:    return buf.String()
         .          .     61:}
         .          .     62:
         .          .     63:// writeClean cleans value (e.g. replaces special characters with '-') and
         .          .     64:// writes out the cleaned value to buf.
         .          .     65:func writeClean(buf *bytes.Buffer, value string) {
(pprof)

問題定爲在buf := &bytes.Buffer{} ，我們之前用它優化了我們的字符串拼接，cpu是優化了，但每次調用都新建一個buf的話，內存其實沒改善，還有什麼其它的解決手段呢？

我們嘗試使用對象池，把buffer對象池話

var bufPool = sync.Pool{
    New: func() interface{} {
        return &bytes.Buffer{}
    },
}

func addTagsToName(name string, tags map[string]string) string {
    keyOrder := make([]string, 0, 4)
    if _, ok := tags["host"]; ok {
        keyOrder = append(keyOrder, "host")
    }
    keyOrder = append(keyOrder, "endpoint", "os", "browser")

    buf := bufPool.Get().(*bytes.Buffer)
    defer bufPool.Put(buf)
    buf.Reset()
    buf.WriteString(name)
    for _, k := range keyOrder {
        buf.WriteByte('.')

        v, ok := tags[k]
        if !ok || v == "" {
            buf.WriteString("no-")
            buf.WriteString(k)
            continue
        }

        writeClean(buf, v)
    }

    return buf.String()
}

調試一下：

$ go test -bench . -benchmem -memprofile prof.mem
BenchmarkAddTagsToName-4        3000000           564 ns/op          48 B/op           1 allocs/op
PASS
ok      github.com/domac/playflame/stats    2.272s

調用也在正常了

(pprof) list addTagsToName
Total: 4008802
ROUTINE ======================== github.com/domac/playflame/stats.addTagsToName in /Users/lihaoquan/GoProjects/Playground/src/github.com/domac/playflame/stats/reporter.go
   4008802    4008802 (flat, cum)   100% of Total
         .          .     67:        }
         .          .     68:
         .          .     69:        writeClean(buf, v)
         .          .     70:    }
         .          .     71:
   4008802    4008802     72:    return buf.String()
         .          .     73:}
         .          .     74:
         .          .     75:// writeClean cleans value (e.g. replaces special characters with '-') and
         .          .     76:// writes out the cleaned value to buf.
         .          .     77:func writeClean(buf *bytes.Buffer, value string) {
(pprof)

我們再生產新的火焰圖：

從火焰圖看到，我們的性能採用報告也在合理正常的範圍！

總結

經過上面的一系列分析，我們日常開發應用程序後，一定要做好測試：千里之堤毀於蟻穴

代碼中一個看起來很普通的地方，可能就是我們性能的瓶頸了。

日常開發原則

避免過早優化

儘量用快速迭代的方式進行開發，畢竟Go讓我們在基準測試還是生產上對代碼進行profile分析變得容易。加上go-torch極大幫助我們快速定位有問題的代碼。過早優化相對片面，建議先有功能，再不斷完善。
避免在熱點區域進行大量對象分配

對熱點區域編寫基準測試用例，可以使用 -benchmem 和 memory profile來觀察是否我們頻繁進行內存分配，因爲分配的潛臺詞是會發生 GC，GC會很大程度上會有服務延遲的風險。

切忌對彙編代碼談虎色變

一般情況下，對象分配或者調用耗時的細節會體現在匯編出來的代碼上，我們也不需要對彙編太懼怕，掌握基本的指令和操作符知識，我們很大程度能把一些隱藏的問題揪出來。

Golang | Go代碼調優利器-火焰圖

前言

Golang的性能調優手段

Go語言內置的CPU和Heap profiler

Go語言常見的profiling使用場景

go-torch

安裝

調優實例

總結

日常開發原則

Redis | 地理空間(GEO)的一個坑

PHP | php擴展安裝Api Version不匹配問題

Redis | 小技巧 -- 模糊匹配批量刪除

golang | 空結構體struct{}的用法

Go 1.9中值得關注的幾個變化

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結