如果在一個RTO時長內,擁塞窗口沒有被完全的使用,TCP發送端將減小擁塞窗口。因爲此時TCP發送端的擁塞窗口可能並非當前的網絡狀況,所以發送端應減小擁塞窗口。根據RFC2861,ssthresh應設置爲其當前值與3/4倍的擁塞窗口值兩者之間的最大值,而擁塞窗口設置爲實際使用的量和當前擁塞窗口值之和的一半。
在如下發送函數tcp_write_xmit中,如果實際執行了發送報文操作,即sent_pkts數量不爲零,在發送之後,調用tcp_cwnd_validate函數驗證擁塞窗口。其參數is_cwnd_limited表明報文發送是否被擁塞窗口所限,其由兩個部分決定,其一是函數tcp_tso_should_defer中的賦值;其二是判斷網絡中的報文數量是否大於擁塞窗口,兩處賦值爲或的關係。
static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int push_one, gfp_t gfp)
{
max_segs = tcp_tso_segs(sk, mss_now);
while ((skb = tcp_send_head(sk))) {
...
tso_segs = tcp_init_tso_segs(skb, mss_now);
if (tso_segs == 1) {
} else {
if (!push_one &&
tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
&is_rwnd_limited, max_segs))
break;
}
...
}
...
if (likely(sent_pkts)) {
...
is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
tcp_cwnd_validate(sk, is_cwnd_limited);
return false;
}
以下爲tcp_tso_should_defer函數,在發送單個報文時不執行。如果判斷到擁塞窗口小於發送窗口,並且擁塞窗口小於等於報文長度時,意味着當前報文不能發送,設置擁塞窗口限制標誌is_cwnd_limited。
static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
bool *is_cwnd_limited, bool *is_rwnd_limited, u32 max_segs)
{
send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;
/* From in_flight test above, we know that cwnd > in_flight. */
cong_win = (tp->snd_cwnd - in_flight) * tp->mss_cache;
...
/* Ok, it looks like it is advisable to defer.
* Three cases are tracked :
* 1) We are cwnd-limited
* 2) We are rwnd-limited
* 3) We are application limited.
*/
if (cong_win < send_win) {
if (cong_win <= skb->len) {
*is_cwnd_limited = true;
return true;
}
} else {
以下擁塞窗口驗證函數,第一次執行此函數時,max_packets_out和max_packets_seq均未賦值,分別爲兩者賦值爲packets_out和SND.NXT的值。之後再次執行此函數時,只有進入下一個發送窗口期或者發送報文大於記錄值時進行更新。由此可見,前者max_packets_out記錄的爲上一個窗口發送的最大報文數量;而後者max_packets_seq記錄的爲最大的發送序號。
static void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited)
{
const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;
struct tcp_sock *tp = tcp_sk(sk);
/* Track the maximum number of outstanding packets in each
* window, and remember whether we were cwnd-limited then.
*/
if (!before(tp->snd_una, tp->max_packets_seq) ||
tp->packets_out > tp->max_packets_out) {
tp->max_packets_out = tp->packets_out;
tp->max_packets_seq = tp->snd_nxt;
tp->is_cwnd_limited = is_cwnd_limited;
}
參數is_cwnd_limited記錄了上一個發送窗口期是否受到了擁塞窗口的限制。函數tcp_is_cwnd_limited判斷連接的發送是否受限於擁塞窗口,爲真表明當前發送使用了全部可用網絡資源,反之,表明存在空閒的網絡資源。
在後一種情況下,記錄當前網絡中報文數量到變量snd_cwnd_used中。如果內核配置開啓了在空閒時長超過RTO之後,復位擁塞窗口的功能,即tcp_slow_start_after_idle爲真,並且空閒時長大於等於RTO,並且擁塞控制算法未定義相關處理,這裏調用tcp_cwnd_application_limited函數,處理應用限速的情況。
if (tcp_is_cwnd_limited(sk)) {
/* Network is feed fully. */
tp->snd_cwnd_used = 0;
tp->snd_cwnd_stamp = tcp_jiffies32;
} else {
/* Network starves. */
if (tp->packets_out > tp->snd_cwnd_used)
tp->snd_cwnd_used = tp->packets_out;
if (sock_net(sk)->ipv4.sysctl_tcp_slow_start_after_idle &&
(s32)(tcp_jiffies32 - tp->snd_cwnd_stamp) >= inet_csk(sk)->icsk_rto &&
!ca_ops->cong_control)
tcp_cwnd_application_limited(sk);
以下檢查是否爲發送端緩存不足引起的空閒,首先排除擁塞窗口的原因,其次是發送隊列爲空,而且應用程序遇到了緩存限值,記錄下發送緩存限值標誌TCP_CHRONO_SNDBUF_LIMITED。
/* The following conditions together indicate the starvation
* is caused by insufficient sender buffer:
* 1) just sent some data (see tcp_write_xmit)
* 2) not cwnd limited (this else condition)
* 3) no more data to send (tcp_write_queue_empty())
* 4) application is hitting buffer limit (SOCK_NOSPACE)
*/
if (tcp_write_queue_empty(sk) && sk->sk_socket &&
test_bit(SOCK_NOSPACE, &sk->sk_socket->flags) &&
(1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
tcp_chrono_start(sk, TCP_CHRONO_SNDBUF_LIMITED);
對於是否爲擁塞窗口受限,內核的判斷與RFC2861略有不同,RFC2861建議如果CWND沒有全部的使用,不應增加其值,這正符合內核在擁塞避免階段的實現。但是,對於慢啓動階段,內核允許擁塞窗口增長到使用量的一倍。參見tcp_is_cwnd_limited相關注釋,在初始窗口爲10,發送了9個報文後,如果所有報文都被確認了,將窗口增加到18。這將有利於限速應用程序更好的探測網絡帶寬。
/* We follow the spirit of RFC2861 to validate cwnd but implement a more
* flexible approach. The RFC suggests cwnd should not be raised unless
* it was fully used previously. And that's exactly what we do in
* congestion avoidance mode. But in slow start we allow cwnd to grow
* as long as the application has used half the cwnd.
* Example :
* cwnd is 10 (IW10), but application sends 9 frames.
* We allow cwnd to reach 18 when all frames are ACKed.
* This check is safe because it's as aggressive as slow start which already
* risks 100% overshoot. The advantage is that we discourage application to
* either send more filler packets or data to artificially blow up the cwnd
* usage, and allow application-limited process to probe bw more aggressively.
*/
static inline bool tcp_is_cwnd_limited(const struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
/* If in slow start, ensure cwnd grows to twice what was ACKed. */
if (tcp_in_slow_start(tp))
return tp->snd_cwnd < 2 * tp->max_packets_out;
return tp->is_cwnd_limited;
}
以下函數tcp_cwnd_application_limited在網絡空閒(未充分利用)RTO時長之後調整擁塞窗口,此調整不針對重傳階段,以及應用程序受到發送緩存限值的情況。首先獲得窗口的使用情況,取值爲初始窗口值和tcp_cwnd_validate函數中記錄的使用值snd_cwnd_used之間的較大值。擁塞窗口值調整爲原窗口值與窗口使用值之和的一半。
/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
* As additional protections, we do not touch cwnd in retransmission phases,
* and if application hit its sndbuf limit recently.
*/
static void tcp_cwnd_application_limited(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
/* Limited by application or receiver window. */
u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
u32 win_used = max(tp->snd_cwnd_used, init_win);
if (win_used < tp->snd_cwnd) {
tp->snd_ssthresh = tcp_current_ssthresh(sk);
tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
}
tp->snd_cwnd_used = 0;
}
tp->snd_cwnd_stamp = tcp_jiffies32;
}
內核版本 5.0