优雅重启Go服务——gracehttp详解

前言

grace是facebook公司为golang服务开发的优雅重启和零停机部署的开源库。可以实现服务重启时，旧有连接不断，新服务启动后，新连接连入新服务，如此客户端无感知。

使用方法

（1）获取

go get github.com/facebookgo/grace/gracehttp

mod可以使用如下方式引入：

require github.com/facebookgo/grace latest

（2）使用

gracehttp.Serve(
        &http.Server{Addr: *address0, Handler: newHandler("Zero  ")},
        &http.Server{Addr: *address1, Handler: newHandler("First ")},
        &http.Server{Addr: *address2, Handler: newHandler("Second")},
    )

（3）重启命令

sudo kill -USR2 pidof yourservername
pidof yourservername 是指需要重启的服务进程id

源码解析

启动服务

// Serve will serve the given http.Servers and will monitor for signals
// allowing for graceful termination (SIGTERM) or restart (SIGUSR2).
func Serve(servers ...*http.Server) error {
    a := newApp(servers)
    return a.run()
}

func newApp(servers []*http.Server) *app {
    return &app{
        servers:   servers,
        http:      &httpdown.HTTP{},
        net:       &gracenet.Net{},
        listeners: make([]net.Listener, 0, len(servers)),
        sds:       make([]httpdown.Server, 0, len(servers)),

        preStartProcess: func() error { return nil },
        // 2x num servers for possible Close or Stop errors + 1 for possible
        // StartProcess error.
        errors: make(chan error, 1+(len(servers)*2)),
    }
}

构造app，具体的处理逻辑均在app中进行。

注意：

此处注释说明了Serve会启动指定的http.Servers，并且会监听系统信号以允许优雅结束（SIGTERM）或重启服务（SIGUSR2），实际代码中还支持SIGINT结束服务进程，我们可以根据需求指定信号来决定是结束还是重启服务。

func (a *app) run() error {
    // Acquire Listeners
    //获取所有http.Server服务地址的Listeners
    if err := a.listen(); err != nil {
        return err
    }

    // Some useful logging.
    // logger的处理
    if logger != nil {
        if didInherit {
            if ppid == 1 {
                logger.Printf("Listening on init activated %s", pprintAddr(a.listeners))
            } else {
                const msg = "Graceful handoff of %s with new pid %d and old pid %d"
                logger.Printf(msg, pprintAddr(a.listeners), os.Getpid(), ppid)
            }
        } else {
            const msg = "Serving %s with pid %d"
            logger.Printf(msg, pprintAddr(a.listeners), os.Getpid())
        }
    }

    // Start serving.
    // 启动各http服务
    a.serve()

    // Close the parent if we inherited and it wasn't init that started us.
    // 如果已经继承，并且不是从init启动，就说明是重启进程，需要将原进程关闭。
    if didInherit && ppid != 1 {
        if err := syscall.Kill(ppid, syscall.SIGTERM); err != nil {
            return fmt.Errorf("failed to close parent: %s", err)
        }
    }

    // 此处为最核心处理部分
    waitdone := make(chan struct{})
    go func() {
        defer close(waitdone)
        a.wait()
    }()

    select {
    case err := <-a.errors://原有服务处理及关闭时发生错误
        if err == nil {
            panic("unexpected nil error")
        }
        return err
    case <-waitdone://waitdone close后即可取值，此时意味着原有服务已完成且关闭
        if logger != nil {
            logger.Printf("Exiting pid %d.", os.Getpid())
        }
        return nil
    }
}

//依次启动指定的http.Servers
func (a *app) serve() {
    for i, s := range a.servers {
        a.sds = append(a.sds, a.http.Serve(s, a.listeners[i]))
    }
}

具体服务启动过程

过程如下：

构造server
设置ConnState，监听各连接的状态变化
启动新协程，manage server的各chan信号
在新协程中正式启动服务

Serve总入口

// 具体服务启动管理
func (h HTTP) Serve(s *http.Server, l net.Listener) Server {
    stopTimeout := h.StopTimeout
    if stopTimeout == 0 {
        stopTimeout = defaultStopTimeout
    }
    killTimeout := h.KillTimeout
    if killTimeout == 0 {
        killTimeout = defaultKillTimeout
    }
    klock := h.Clock
    if klock == nil {
        klock = clock.New()
    }

    ss := &server{
        stopTimeout:  stopTimeout,//stop超时时间
        killTimeout:  killTimeout,//kill超时时间
        stats:        h.Stats,//数据统计
        clock:        klock,//数据统计的定时器
        oldConnState: s.ConnState,//原ConnState hook
        listener:     l,//服务listener
        server:       s,//具体服务
        serveDone:    make(chan struct{}),//服务启动结束
        serveErr:     make(chan error, 1),//服务启动错误
        new:          make(chan net.Conn),//新连接
        active:       make(chan net.Conn),//连接状态Active
        idle:         make(chan net.Conn),//连接状态Idle
        closed:       make(chan net.Conn),//连接状态Closed
        stop:         make(chan chan struct{}),//stop通知
        kill:         make(chan chan struct{}),//kill通知
    }
    s.ConnState = ss.connState
    // 管理连接conns
    go ss.manage()
    // 启动http服务
    go ss.serve()
    return ss
}

ConnState注入hook

设置ConnState，注入hook监听各连接的状态变化，发送至server的各统计状态的chan中。

// 处理服务的连接状态变化
func (s *server) connState(c net.Conn, cs http.ConnState) {
    if s.oldConnState != nil {
        s.oldConnState(c, cs)
    }

    switch cs {
    case http.StateNew:
        s.new <- c
    case http.StateActive:
        s.active <- c
    case http.StateIdle:
        s.idle <- c
    case http.StateHijacked, http.StateClosed:
        s.closed <- c
    }
}

manage管理各种chan信号的处理

manage处理了所有connState中监听的连接变化，所有的连接会存储在conns map中，随着conn的变化进行增加、修改、删除。

stop chan信号确认所有连接是否关闭，如存在未关闭的连接，则先关闭处理Idle状态的连接，然后等待其他的状态的连接继续处理。

kill chan信号则强制关闭所有连接。

stop、kill chan信号及stopTimeout、killTimeout的使用均来自于Stop（结束服务）过程，后面会详细分析。

stop chan信号在服务Stop时触发，stop会优先关闭处于idle状态的连接，然后等待其他状态的连接继续处理。当killTimeout触发时，意味着
stop、kill chan信号对应stopTimeout、killTimeout参数。初始化中stopTimeout、killTimeout，如果有设置参数则使用设置的参数，没有则使用默认参数（一分钟）。stopTimeout是处理服务stop的超时时间，超时后，会触发kill，killTimeout就是处理kill的超时时间。

特别说明：s.stats一直为nil，其相关的代码并不会执行，阅读时可略过相关代码。

// 管理各种chan信号的处理
// 说明：s.stats一直为nil，此参数仅用来测试时使用，其相关的代码可以略过
func (s *server) manage() {
    defer func() {
        close(s.new)
        close(s.active)
        close(s.idle)
        close(s.closed)
        close(s.stop)
        close(s.kill)
    }()
    var stopDone chan struct{}

    // 存储服务的各连接的状态
    conns := map[net.Conn]http.ConnState{}
    var countNew, countActive, countIdle float64

    // decConn decrements the count associated with the current state of the
    // given connection.
    // 更新计数
    decConn := func(c net.Conn) {
        switch conns[c] {
        default:
            panic(fmt.Errorf("unknown existing connection: %s", c))
        case http.StateNew:
            countNew--
        case http.StateActive:
            countActive--
        case http.StateIdle:
            countIdle--
        }
    }

    // setup a ticker to report various values every minute. if we don't have a
    // Stats implementation provided, we Stop it so it never ticks.
    // 仅在有使用Stats实现接口的情况开始定时器统计数据，实际代码中并未使用，此处开启后会被关闭
    statsTicker := s.clock.Ticker(time.Minute)
    if s.stats == nil {
        statsTicker.Stop()
    }
    // 等待所有连接处理
    for {
        select {
        case <-statsTicker.C: //每分钟统计各状态的连接数量，此项仅在测试源码时生效，正式使用时，不生效
            // we'll only get here when s.stats is not nil
            s.stats.BumpAvg("http-state.new", countNew)
            s.stats.BumpAvg("http-state.active", countActive)
            s.stats.BumpAvg("http-state.idle", countIdle)
            s.stats.BumpAvg("http-state.total", countNew+countActive+countIdle)
        case c := <-s.new:
            conns[c] = http.StateNew //存入New状态连接
            countNew++
        case c := <-s.active:
            decConn(c)
            countActive++

            conns[c] = http.StateActive //存入Active状态连接
        case c := <-s.idle:
            decConn(c)
            countIdle++

            conns[c] = http.StateIdle //存入Idle状态连接

            // if we're already stopping, close it
            if stopDone != nil { //已经完成stop，直接关闭连接
                c.Close()
            }
        case c := <-s.closed:
            stats.BumpSum(s.stats, "conn.closed", 1)
            decConn(c)
            delete(conns, c) //移除关闭的连接

            // if we're waiting to stop and are all empty, we just closed the last
            // connection and we're done.
            if stopDone != nil && len(conns) == 0 {
                close(stopDone)
                return
            }
        case stopDone = <-s.stop:
            // if we're already all empty, we're already done
            // 如果连接处理完毕，关闭chan，
            if len(conns) == 0 {
                close(stopDone)
                return
            }

            // close current idle connections right away
            // 先关闭idle状态的连接
            for c, cs := range conns {
                if cs == http.StateIdle {
                    c.Close()
                }
            }

            // continue the loop and wait for all the ConnState updates which will
            // eventually close(stopDone) and return from this goroutine.

        case killDone := <-s.kill://强杀，超时时使用
            // force close all connections
            stats.BumpSum(s.stats, "kill.conn.count", float64(len(conns)))
            for c := range conns {//强制关闭所有连接
                c.Close()
            }

            // don't block the kill.
            close(killDone)

            // continue the loop and we wait for all the ConnState updates and will
            // return from this goroutine when we're all done. otherwise we'll try to
            // send those ConnState updates on closed channels.

        }
    }
}

正式的服务，则在新协程中调用http.Server的Serve启动，启动的错误发送至serveErr供启动协程处理，如发生错误，则panic。若无错误，通知服务启动结束。若存在Stop，则必须等server启动结束后才能进行，这是必然的逻辑，因为不可能stop一个未启动的server。

// 正式启动服务
func (s *server) serve() {
    stats.BumpSum(s.stats, "serve", 1)
    s.serveErr <- s.server.Serve(s.listener)
    close(s.serveDone)
    close(s.serveErr)
}

系统信号监听

这部分是核心处理部分：

启动信号监听协程
监听当前服务启动及stop状态，如有服务启动及stop发生错误，则通过a.errors，通知启动协程。

wait

// 等待原有http.Servers的连接处理完后，结束服务
func (a *app) wait() {
    var wg sync.WaitGroup
    wg.Add(len(a.sds) * 2) // Wait & Stop
    // 监听信号
    go a.signalHandler(&wg)
    for _, s := range a.sds {
        go func(s httpdown.Server) {
            defer wg.Done()
            if err := s.Wait(); err != nil {
                a.errors <- err
            }
        }(s)
    }
    wg.Wait()
}

wait内的WaitGroup包含了Wait & Stop的处理，Wait用以等待各服务的启动状态结果监听，Stop则是用以结束进程（包含重启）状态监听，只有Wait & Stop均完成后，wait才能真正结束。结合之前的代码，此时启动协程才能算处理结束。

signalHandler

// 系统信号监听服务
func (a *app) signalHandler(wg *sync.WaitGroup) {
    ch := make(chan os.Signal, 10)
    signal.Notify(ch, syscall.SIGINT, syscall.SIGTERM, syscall.SIGUSR2)
    for {
        sig := <-ch
        switch sig {
        case syscall.SIGINT, syscall.SIGTERM:
            // this ensures a subsequent INT/TERM will trigger standard go behaviour of
            // terminating.
            // 停止接收信号
            signal.Stop(ch)
            // 结束原有服务
            a.term(wg)
            return
        case syscall.SIGUSR2:
            err := a.preStartProcess()
            if err != nil {
                a.errors <- err
            }
            // we only return here if there's an error, otherwise the new process
            // will send us a TERM when it's ready to trigger the actual shutdown.
            if _, err := a.net.StartProcess(); err != nil {
                a.errors <- err
            }
        }
    }
}

gracehttp默认监听的信号为：SIGINT、SIGTERM、SIGUSR2。其中SIGINT、SIGTERM用以优雅结束服务，SIGUSR2用以优雅重启服务。

关闭服务

关闭服务，目前由两种形式：

我们通过kill命令操作
进程重启时发现父进程时，主动调用syscall.Kill命令

这两种形式实际结果是一致的，只是调用者不同。

//此处的term使用的WaitGroup就是wait中的wg
func (a *app) term(wg *sync.WaitGroup) {
    for _, s := range a.sds {
        go func(s httpdown.Server) {
            defer wg.Done()
            if err := s.Stop(); err != nil {
                a.errors <- err
            }
        }(s)
    }
}
func (s *server) Stop() error {
    s.stopOnce.Do(func() {
        defer stats.BumpTime(s.stats, "stop.time").End()
        stats.BumpSum(s.stats, "stop", 1)

        // first disable keep-alive for new connections
        // 关闭keep-alive，让原conn处理完关闭
        s.server.SetKeepAlivesEnabled(false)

        // then close the listener so new connections can't connect come thru
        // 关闭listener，让新连接无法连入
        closeErr := s.listener.Close()
        <-s.serveDone//等待服务启动完毕，避免未启动完成调用Stop

        // then trigger the background goroutine to stop and wait for it
        stopDone := make(chan struct{})
        s.stop <- stopDone//通知stop下的连接处理

        // wait for stop
        select {
        case <-stopDone://conn全部处理完毕
        case <-s.clock.After(s.stopTimeout)://触发超时
            defer stats.BumpTime(s.stats, "kill.time").End()
            stats.BumpSum(s.stats, "kill", 1)

            // stop timed out, wait for kill
            killDone := make(chan struct{})
            s.kill <- killDone
            select {
            case <-killDone:
            case <-s.clock.After(s.killTimeout):
                // kill timed out, give up
                stats.BumpSum(s.stats, "kill.timeout", 1)
            }
        }

        if closeErr != nil && !isUseOfClosedError(closeErr) {
            stats.BumpSum(s.stats, "listener.close.error", 1)
            s.stopErr = closeErr
        }
    })
    return s.stopErr
}

term会关闭所有的服务，并将关闭服务发生的错误发送至s.erros供启动协程处理。

Stop的处理流程如下：

关闭keep-alive，新连接不再保持长连接
关闭listener，不再接受新连接
等待新服务进程启动完毕
发送stop信号，确定连接是否处理完毕，未处理完则先关闭处理IDLE的连接
全部连接处理完，处理closeErr后返回
stopTimeout超时，则发送kill信号，强制关闭全部连接
全部连接强制关闭后，处理closeErr后返回
killTimeout超时，不再继续处理，处理closeErr后返回
term会在关闭发生错误则通知启动协程，所有服务全部正常退出后，wait结束，原进程结束。

启动新进程

重启的逻辑整体上就是，新服务进程启动后，会主动发送TERM信号结束原进程，原进程会在处理完原有连接或超时后退出。

func (n *Net) StartProcess() (int, error) {
    // 获取http.Servers的Listeners
    listeners, err := n.activeListeners()
    if err != nil {
        return 0, err
    }

    // Extract the fds from the listeners.
    // 提取文件描述fds
    files := make([]*os.File, len(listeners))
    for i, l := range listeners {
        files[i], err = l.(filer).File()
        if err != nil {
            return 0, err
        }
        defer files[i].Close()
    }

    // Use the original binary location. This works with symlinks such that if
    // the file it points to has been changed we will use the updated symlink.
    // 使用原执行文件路径
    argv0, err := exec.LookPath(os.Args[0])
    if err != nil {
        return 0, err
    }

    // Pass on the environment and replace the old count key with the new one.
    // env添加listeners的数量
    var env []string
    for _, v := range os.Environ() {
        if !strings.HasPrefix(v, envCountKeyPrefix) {
            env = append(env, v)
        }
    }
    env = append(env, fmt.Sprintf("%s%d", envCountKeyPrefix, len(listeners)))

    allFiles := append([]*os.File{os.Stdin, os.Stdout, os.Stderr}, files...)
    // 利用原参数（原执行文件路径、参数、打开的文件描述等）启动新进程
    process, err := os.StartProcess(argv0, os.Args, &os.ProcAttr{
        Dir:   originalWD,
        Env:   env,
        Files: allFiles,
    })
    if err != nil {
        return 0, err
    }
    return process.Pid, nil
}

总结

最后，我们对gracehttp的实现原理及过程做个总结。

原理

gracehttp实现优雅重启的原理：

当前进程接收到重启信号后，启动新进程接收并处理新连接，原进程不再接收新的连接，只接着处理未处理完的连接，处理完（或超时）后原进程退出，仅留下新进程，实现优雅重启。

过程

正常启动程序（非重启）到重启：

所有服务启动，进程监听系统信号，启动协程通过wait监听服务协程启动及stop状态。
监听到USR2信号，标识环境变量LISTEN_FDS，获取服务执行文件路径、参数、打开的文件描述及新增加的环境变量标识LISTEN_FDS，调用StartProcess启动新进程
新进程启动，处理新连接。新进程检测到环境变量LISTEN_FDS及进程的父进程id，调用syscall.Kill结束原进程，新进程等待父进程（原服务进程）的退出。
父进程检测到TERM信号，先停止接收系统信号，开始准备结束进程。若父进程存在未关闭的连接，则先关闭keep-alive，再关闭listener以阻止新连接连入。全部连接处理完关闭或超时后强制关闭所有连接后，wait内wg全部done。
wait处理结束，协程结束，父进程结束，仅留下新启动的子进程服务。

优雅重启Go服务——gracehttp详解

前言

使用方法

（1）获取

（2）使用

（3）重启命令

源码解析

启动服务

具体服务启动过程

Serve总入口

ConnState注入hook

manage管理各种chan信号的处理

系统信号监听

wait

signalHandler

关闭服务

启动新进程

总结

原理

过程

從main入口開始談golang

記一次gin PostForm bug

golang map轉json的順序問題

深入瞭解Go flag

深入瞭解gorm Scan的使用

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結