Linux thundering herd

驚羣的定義（http://en.wikipedia.org/wiki/Thundering_herd_problem）：
The thundering herd problem occurs when a large number of processes waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time. After the processes wake up, they all demand the resource and a decision must be made as to which process can continue. After the decision is made, the remaining processes are put back to sleep, only to all wake up again to request access to the resource.

另一個被引用很多的定義，意思一樣（http://www.catb.org/jargon/html/T/thundering-herd-problem.html）：
Scheduler thrashing. This can happen under Unix when you have a number of processes that are waiting on a single event. When that event (a connection to the web server, say) happens, every process which could possibly handle the event is awakened. In the end, only one of those processes will actually be able to do the work, but, in the meantime, all the others wake up and contend for CPU time before being put back to sleep. Thus the system thrashes briefly while a herd of processes thunders through. If this starts to happen many times per second, the performance impact can be significant.

在《UNIX Network Programming Volume 1》中，advanced sockets部分，也有說明。

這裏有兩點需要強調一下：
1) 定義中的“event”，是廣義的event，不侷限於epoll這種典型的async-event-driven-model中指的event。比如，來了個connection，也就是一個event發生了。
2) 這個定義是classical thundering herd problem的定義，不同於nginx所解決的thundering-herd-like problem（類似與“經典驚羣”，這個名稱是我起的。當我們用中文說nginx驚羣相關的問題時，通常只說“驚羣”，可能引起混淆）。因爲在大多數情況下，classical thundering herd problem，不需要專門解決。之所以說大多數情況，實際上是取決於nginx所依賴的Linux kernel。

================================================================
== 第一部分經典驚羣問題 ==
================================================================

下面，先說一下classical thundering herd，以及與其相關的系統調用accept。
等說完這個，再說一下nginx的thundering-herd-like problem。

先上一段pseudo codes：
// parent process:
// create socket
listen_fd = socket(...);
bind(listen_fd, ...);
listen(listen_fd, ...);
fork();

// child processes:
// COW listen_fd
conn_fd = accept(listen_fd, ...);

因爲COW，每個child中的listen_fd，實際上都是parent中的listen_fd，即同一個TCP socket。當多個process在同一個TCP socket上accept時，kernel把這些process（struct task_struct）放到同一個waiting queue（wait_queue_t）裏。

接下來的處理，按照不同的kernel版本分別來看。

先看kernel v2.2.10，這個版本的內核還沒有解決classical thundering herd。

當一個connection到來，需要被accept時，accept調用了struct sock的3個成員變量，分別是write_space、data_ready、state_change，它們都是函數指針。在初始化struct sock時，這3個成員變量默認被初始化爲了tcp_write_space、sock_def_readable、sock_def_wakeup這3個函數。這3個函數，都調用了wake_up_interruptible。wake_up_interruptible實際是一個宏，封裝了__wake_up這個函數。__wake_up函數會遍歷fd的waiting queue，喚醒裏面所有等待進程。而最終只有一個進程能夠真正accept到這個connection，並被從waiting queue中刪除，其他的被喚醒的進程，重新在waiting queue中sleep。這就是classical thundering herd。

include/net/sock.h：
struct sock {
    // ...
    // all kinds of fields
    // ...

    /* Callbacks */
    void (*state_change)(struct sock *sk);
    void (*data_ready)(struct sock *sk, int bytes);
    void (*write_space)(struct sock *sk);
    void (*error_report)(struct sock *sk);

    int (*backlog_rcv) (struct sock *sk, struct sk_buff *skb);
    void (*destruct)(struct sock *sk);
};

net/core/sock.c：
void sock_init_data(struct socket *sock, struct sock *sk)
{
    // ...
    // some init operatoins
    // ...

    sk->state_change        =       sock_def_wakeup;
    sk->data_ready          =       sock_def_readable;
    sk->write_space         =       sock_def_write_space;
    sk->error_report        =       sock_def_error_report;
    sk->destruct            =       sock_def_destruct;

    // ...
    // more init operations
    // ...
}

void sock_def_wakeup(struct sock *sk)
{
    if(!sk->dead)
        wake_up_interruptible(sk->sleep);
}

include/linux/sched.h：
extern void FASTCALL(__wake_up(struct wait_queue ** p, unsigned int mode));
#define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE)

__wake_up的實現，在kernel/sched.c中，看看相關代碼片段，很明顯全部喚醒：
while (next != head) {
    p = next->task;
    next = next->next;
    if (p->state & mode) {
        /*
         * We can drop the read-lock early if this
         * is the only/last process.
         */
        if (next == head) {
            read_unlock(&waitqueue_lock);
            wake_up_process(p);
            goto out;
        }
        wake_up_process(p);
    }
}

還需要說一下TASK_INTERRUPTIBLE宏，以及其他幾個相關宏的定義。因爲驚羣問題的最終解決方案，不是一蹴而就的，中間有過渡。過渡的方案，就和這組宏的定義有點關係。
這些宏也定義在include/linux/sched.h，比如：
#define TASK_INTERRUPTIBLE      1

緊接着，看一個過渡版本，kernel v2.2.26。
先看include/linux/sched.h，與v2.2.10的相比，增加了一個宏：
#define TASK_EXCLUSIVE          32

再看kernel/sched.c中，__wake_up的代碼片段：
best_exclusive = 0;
do_exclusive = mode & TASK_EXCLUSIVE;
while (next != head) {
    p = next->task;
    next = next->next;
    if (p->state & mode) {
        if (do_exclusive && p->task_exclusive) {
            if (best_exclusive == NULL)
                best_exclusive = p;
        }
        else {
            wake_up_process(p);
        }
    }
}
if (best_exclusive)
    wake_up_process(best_exclusive);

爲了解決驚羣，struct task_struct增加了一個task_exclusive字段，include/linux/sched.h：
unsigned int task_exclusive;    /* task wants wake-one semantics in __wake_up() */

再結合新增加的宏，這個版本的__wake_up保證了兩點，一是正常的process（即未設置task_exclusive字段）都被喚醒，二是隻有第一個exclusive process被喚醒，其他還都留在waiting queue中。以此解決驚羣。

最後，看kernel v2.6.34，最終的解決方案。之所以看這個版本，是因爲baidu都是用的2.6.X的版本。

include/linux/sched.h中，TASK_EXCLUSIVE宏已經沒有了。
而在wait_queue_t的定義中，增加了一個類似的宏，include/linux/wait.h：
struct __wait_queue {
    unsigned int flags;
#define WQ_FLAG_EXCLUSIVE 0x01
    void *private;
    wait_queue_func_t func;
    struct list_head task_list;
};

include/linux/wait.h中，__wake_up的原型也改變了：
void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);

kernel/sched.c：
void __wake_up(wait_queue_head_t *q, unsigned int mode,
               int nr_exclusive, void *key)
{
    // ...
    // some operations
    // ...

    __wake_up_common(q, mode, nr_exclusive, 0, key);

    // ...
}

/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
* number) then we wake all the non-exclusive tasks and one exclusive task.
*
* ...
* ...
*/
static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
                             int nr_exclusive, int wake_flags, void *key)
{
    wait_queue_t *curr, *next;

    list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
        unsigned flags = curr->flags;

        if (curr->func(curr, mode, wake_flags, key) &&
            (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
            break;
    }
}

這代碼的邏輯，和v2.2.26中類似，遍歷waiting queue，喚醒normal process，喚醒指定的exclusive process，退出。

到此爲止，我覺得已把classical thundering herd說清楚了，儘管比較粗略。要想更深入瞭解，需要細追kernel相關部分代碼，it's a big job。

================================================================
== 第二部分新驚羣問題以及nginx的處理 ==
================================================================

下面開始說nginx的新驚羣（classical-thundering-herd-like），kernel 2.6 + epoll。實際上由select和poll而產生的新驚羣現象，本質上和epoll類似。多說一句，我認爲這是kernel authors有意爲之，他們認爲就該這樣。而判斷是否應該“解決”這個新驚羣問題，以及怎麼解決，由user codes決定。

原則上，解決這個新驚羣問題，有兩種方案，一是去除其可能發生的條件，二是串行化，排隊來，避免爭搶，這通常會用到鎖機制。

第一種方案的典型代表，是lighttpd。其官方推薦的進程模型，就是一個master+一個worker，這種情形下，壓根不存在“羣”這個東西，就無所謂“驚”了。
第二種方案，nginx是代表之一。下面我們主要說這個。

從使用說起。nginx允許用戶選擇是否處理新驚羣問題。爲了解決新驚羣問題，nginx需要做必要的計算和鎖操作，這些都是有開銷的。所以，如果用戶認爲其使用場景，可以不care新驚羣問題，可以disable掉nginx這方面的處理。但默認是打開的。
在nginx的core module中，有一個accept_mutex指令，可以打開或關閉新驚羣處理機制。這個指令，只能出現在event {}中。定義如下：
event/ngx_event.c中，
static ngx_command_t ngx_event_core_commands[] = {
    // ...
    // definitions of some event core directives
    // ...

    { ngx_string("accept_mutex"),
      NGX_EVENT_CONF|NGX_CONF_FLAG,
      ngx_conf_set_flag_slot,
      0,
      offsetof(ngx_event_conf_t, accept_mutex),
      NULL },

    // ...
};

在這裏，我想對“accept_mutex”這個名字，吐嘈一下。乍一看，它是爲了accept而mutext，貌似是爲了解決classical thundering herd。而實際上，在nginx的實現中，它並不是直接爲accept，而是直接爲epoll_ctl。可以簡單的認爲是這樣：epoll_wait --> epoll_ctl --> epoll_wait --> accept。九曲迴腸，纔到accept。

爲了方便討論，先在這裏列出與accept_mutex相關的幾個變量，它們都聲明在event/ngx_event.h中，定義在event/ngx_event.c中，都是全局的：
extern ngx_atomic_t          *ngx_accept_mutex_ptr;
extern ngx_shmtx_t            ngx_accept_mutex;
extern ngx_uint_t             ngx_use_accept_mutex;

extern ngx_uint_t             ngx_accept_mutex_held;

extern ngx_int_t              ngx_accept_disabled;

下面的討論，只基於accept_mutex打開的前提下。

nginx的啓動、工作流程，可以簡單用ngx_module_t中的幾個字段來描述：init_master、init_module、init_process。

init_master時，nginx在其master進程裏，創建、綁定、建立監聽的socket。init_master可以認爲是協議（TCP/IP）層面的處理，後面將要提到的init_module，可以認爲是nginx內部“業務邏輯”的處理。
具體來說，ngx_cycle_t是對nginx的進程最終的數據結構，它貫穿於一個nginx進程的生命週期。nginx在啓動時，會對其對應的ngx_cycle做初次初始化操作或者再初始化操作（如restart、reload時）。在對ngx_cycle_t操作時，會進行典型的TCP server初始化操作。簡單流程如下：
main(): core/nginx.c
        |
       \|/
ngx_init_cycle(): core/ngx_cycle.c
        |
       \|/
ngx_open_listening_sockets(): core/ngx_connection.c

在ngx_open_listening_sockets中，socket --> setsockopt --> bind --> listen。注意，用來監聽的socket fd在此已經生成。

init_module時，會遍歷編譯時生成的ngx_modules數組，並對其中的各個模塊做處理。因爲和我們討論的問題相關性很小，不細說。

重點說init_process。因爲master已經建立了用來監聽的socket fd，workers被fork出來後，如果它想，其實已經可以直接處理網絡event，甚至進而accept了。但woker現在做不做這些事，就會受accept_mutex這個指令的值的影響。
現在看ngx_event_process_init函數，在event/ngx_event.c中。這個函數用來初始化worker的event操作。

這個函數中，和accept_mutex相關的片段：

// 根據accept_mutex指令的值，設置與accept_mutex相關的變量
    if (ccf->master && ccf->worker_processes > 1 && ecf->accept_mutex) {
        ngx_use_accept_mutex = 1;
        ngx_accept_mutex_held = 0;
        ngx_accept_mutex_delay = ecf->accept_mutex_delay;

    } else {
        ngx_use_accept_mutex = 0;
    }

// 遍歷所有用來監聽的socket
    for (i = 0; i < cycle->listening.nelts; i++) {
        // ...

        rev->handler = ngx_event_accept;//注意

        if (ngx_use_accept_mutex) {
            continue;
        }

        if (ngx_event_flags & NGX_USE_RTSIG_EVENT) {
            if (ngx_add_conn(c) == NGX_ERROR) {
                return NGX_ERROR;
            }

        } else {
            if (ngx_add_event(rev, NGX_READ_EVENT, 0) == NGX_ERROR) {
                return NGX_ERROR;
            }
        }

        // ...
    }

如果打開了accept_mutex，就continue了。如果沒有打開，就ngx_add_event。也就是說，如果打開了accept_mutex，就在將來的某個時候，再ngx_add_event。總之，就是早晚肯定要ngx_add_event。所以，我們看ngx_add_event做了什麼。

在event/ngx_event.h中，ngx_add_event被定義爲了一個宏：
#define ngx_add_event        ngx_event_actions.add

ngx_event_actions是一個ngx_event_actions_t類型的全局變量。ngx_event_actions_t定義在event/ngx_event.h中，顯然，它是爲了兼容不同平臺的不同事件機制。

直接看epoll，在event/module/ngx_epoll_module.c中。

static ngx_int_t
ngx_epoll_init(ngx_cycle_t *cycle, ngx_msec_t timer)
{
    // ...
    ngx_event_actions = ngx_epoll_module_ctx.actions;
    // ...
}

ngx_event_module_t ngx_epoll_module_ctx = {
    &epoll_name,
    ngx_epoll_create_conf,               /* create configuration */
    ngx_epoll_init_conf,                 /* init configuration */

    {
        ngx_epoll_add_event,             /* add an event */
        ngx_epoll_del_event,             /* delete an event */
        ngx_epoll_add_event,             /* enable an event */
        ngx_epoll_del_event,             /* disable an event */
        ngx_epoll_add_connection,        /* add an connection */
        ngx_epoll_del_connection,        /* delete an connection */
        NULL,                            /* process the changes */
        ngx_epoll_process_events,        /* process the events */
        ngx_epoll_init,                  /* init the events */
        ngx_epoll_done,                  /* done the events */
    }
};

static ngx_int_t
ngx_epoll_add_event(ngx_event_t *ev, ngx_int_t event, ngx_uint_t flags)
{
    // ...
    // epoll_ctl
    // ...
}

現在應該清楚了，可以說，ngx_add_event就是epoll_ctl with ops of EPOLL_CTL_ADD and EPOLL_CTL_MOD。

現在再看，在init_process以後的什麼時候調用了ngx_add_event。
按照常識，可以預見，應該是在worker的時間處理循環中。
實際上，正如我們的預見，在類似與“while (1) { epoll_wait(); // ... }”，但是要複雜的多。

看event/ngx_event.c中的ngx_process_events_and_timers函數：
    if (ngx_use_accept_mutex) {
        if (ngx_accept_disabled > 0) {
            ngx_accept_disabled--;

        } else {
            if (ngx_trylock_accept_mutex(cycle) == NGX_ERROR) {
                return;
            }

            if (ngx_accept_mutex_held) {
                flags |= NGX_POST_EVENTS;

            } else {
                if (timer == NGX_TIMER_INFINITE
                    || timer > ngx_accept_mutex_delay)
                {
                    timer = ngx_accept_mutex_delay;
                }
            }
        }
    }

    delta = ngx_current_msec;

    (void) ngx_process_events(cycle, timer, flags);

    delta = ngx_current_msec - delta;

    ngx_log_debug1(NGX_LOG_DEBUG_EVENT, cycle->log, 0,
                   "timer delta: %M", delta);

    if (ngx_posted_accept_events) {
        ngx_event_process_posted(cycle, &ngx_posted_accept_events);
    }

    if (ngx_accept_mutex_held) {
        ngx_shmtx_unlock(&ngx_accept_mutex);
    }

    if (delta) {
        ngx_event_expire_timers();
    }

    ngx_log_debug1(NGX_LOG_DEBUG_EVENT, cycle->log, 0,
                   "posted events %p", ngx_posted_events);

    if (ngx_posted_events) {
        if (ngx_threaded) {
            ngx_wakeup_worker_thread(cycle);

        } else {
            ngx_event_process_posted(cycle, &ngx_posted_events);
        }
    }

其中的關鍵代碼是，調用ngx_trylock_accept_mutex。

event/ngx_event_accept.c中：
ngx_int_t
ngx_trylock_accept_mutex(ngx_cycle_t *cycle)
{
    // ...
    if (ngx_shmtx_trylock(&ngx_accept_mutex)) {

        ngx_log_debug0(NGX_LOG_DEBUG_EVENT, cycle->log, 0,
                       "accept mutex locked");

        if (ngx_accept_mutex_held
            && ngx_accept_events == 0
            && !(ngx_event_flags & NGX_USE_RTSIG_EVENT))
        {
            return NGX_OK;
        }

        if (ngx_enable_accept_events(cycle) == NGX_ERROR) {
            ngx_shmtx_unlock(&ngx_accept_mutex);
            return NGX_ERROR;
        }

        ngx_accept_events = 0;
        ngx_accept_mutex_held = 1;

        return NGX_OK;
    }

    // ...
}

static ngx_int_t
ngx_enable_accept_events(ngx_cycle_t *cycle)
{
    // ...
            if (ngx_add_event(c->read, NGX_READ_EVENT, 0) == NGX_ERROR) {
                return NGX_ERROR;
            }
    // ...
}

至此，貌似一切都明瞭了。但其實還有一點，注意event/ngx_event.c中ngx_process_events_and_timers函數對ngx_accept_disabled變量的操作。

回憶一下，在ngx_event_process_init函數中，把listening fd的read event的handler設置爲了ngx_event_accept。

看event/ngx_event_accept.c中ngx_event_accept的實現：
void
ngx_event_accept(ngx_event_t *ev)
{
    // ...

#if (NGX_HAVE_ACCEPT4)
        if (use_accept4) {
            s = accept4(lc->fd, (struct sockaddr *) sa, &socklen,
                        SOCK_NONBLOCK);
        } else {
            s = accept(lc->fd, (struct sockaddr *) sa, &socklen);
        }
#else
        s = accept(lc->fd, (struct sockaddr *) sa, &socklen);
#endif

        // ...

            if (err == NGX_EMFILE || err == NGX_ENFILE) {
                // ...
                ngx_accept_disabled = 1;
                // ...
            }

            // ...

        ngx_accept_disabled = ngx_cycle->connection_n / 8
                              - ngx_cycle->free_connection_n;

        // ...
}

總結如下：
首先，nginx對各個woker有負載均衡的處理，如果某些worker足夠閒（通過ngx_accept_disabled變量表示），則可以參與競爭accept_mutext；
然後，爭奪到accept_mutext的worker，可以把listening fd的加入到其事件模型中。

以此，來避免了新驚羣問題。

Linux thundering herd

物理機開關機

mail技術相關

multi-process & cpu with multi-cores

Parameter server anatomy (1)

關於機房交換機故障導致HDFS NameNode掛掉的問題（續）

關於機房交換機故障導致HDFS NameNode掛掉的問題

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結