如下公式,帶寬取值爲計算得出的數據傳輸速率與接收ACK速率兩者之間的較小值。通常情況下,傳輸速率(send_rate-發送並得到確認的數據速率)將大於ACK接收速率(ack_rate),但是,在面對ACK壓縮等的情況下,將導致ACK接收速率意外的增大,此時,帶寬應選取傳輸速率(send_rate)。
send_rate = #pkts_delivered/(last_snd_time - first_snd_time)
ack_rate = #pkts_delivered/(last_ack_time - first_ack_time)
bw = min(send_rate, ack_rate)
發送報文速率記錄
如下函數tcp_rate_skb_sent記錄下發送的skb相關信息,之後,當接收到ACK(SACK)確認報文時,根據這些信息生成速率採樣。首先,看一下采樣週期,當packets_out爲零時,表明網絡中沒有報文,所有發送的報文都已經被確認,從這一刻起,是合適的時間點,記錄之後報文的發送時間,在接收到相應的ACK報文後,計算報文在網絡中的傳播時間,即要採樣的間隔。
這裏使用packets_out而不是tcp_packets_in_flight函數的結果,因爲後者是基於RTO和丟失檢測而得出的網絡中的報文數量,過早的RTO的發生,以及激進的丟失檢測,將導致採樣間隔縮短,進而造成帶寬估算的過高。
void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
/* In general we need to start delivery rate samples from the
* time we received the most recent ACK, to ensure we include
* the full time the network needs to deliver all in-flight
* packets. If there are no packets in flight yet, then we
* know that any ACKs after now indicate that the network was
* able to deliver those packets completely in the sampling
* interval between now and the next ACK.
*
* Note that we use packets_out instead of tcp_packets_in_flight(tp)
* because the latter is a guess based on RTO and loss-marking
* heuristics. We don't want spurious RTOs or loss markings to cause
* a spuriously small time interval, causing a spuriously high
* bandwidth estimate.
*/
if (!tp->packets_out) {
u64 tstamp_us = tcp_skb_timestamp_us(skb);
tp->first_tx_mstamp = tstamp_us;
tp->delivered_mstamp = tstamp_us;
}
反之,如果packets_out不爲空,將套接口之前的記錄值賦予發送報文結構中。函數tcp_rate_check_app_limited用於檢測是否由於應用發送的數據不足,導致的發送受限,稍後介紹。注意這裏的變量
first_tx_mstamp和delivered_mstamp,二者記錄了本次速率採用的起點,在同一個採樣窗口內的後續發送報文中,保存相同的起點時間戳值。
TCP控制塊中的tx.first_tx_mstamp保存發送速率採樣週期的開始時間戳,用於後續計算髮送速率。
TCP控制塊中的tx.delivered保存當前套接口在發送此報文時,已經成功傳輸的報文數量。而TCP控制塊中的tx.delivered_mstamp記錄了成功發送tx.delivered報文時的時間戳,也即初始的確認tx.delivered數據的ACK報文到達的時間戳,隨後用於計算ACK速率。
TCP_SKB_CB(skb)->tx.first_tx_mstamp = tp->first_tx_mstamp;
TCP_SKB_CB(skb)->tx.delivered_mstamp = tp->delivered_mstamp;
TCP_SKB_CB(skb)->tx.delivered = tp->delivered;
TCP_SKB_CB(skb)->tx.is_app_limited = tp->app_limited ? 1 : 0;
}
以上函數tcp_rate_skb_sent在內核中有兩處調用,分別是報文發送和重傳函數。如下發送函數__tcp_transmit_skb中,報文成功發送之後,調用tcp_rate_skb_sent函數,更新速率信息。但是,對於ACK報文等,這裏oskb爲空,不處理。
static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
int clone_it, gfp_t gfp_mask, u32 rcv_nxt)
{
err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
if (unlikely(err > 0)) {
tcp_enter_cwr(sk);
err = net_xmit_eval(err);
}
if (!err && oskb) {
tcp_update_skb_after_send(sk, oskb, prior_wstamp);
tcp_rate_skb_sent(sk, oskb);
}
多數情況下,報文重傳函數調用tcp_transmit_skb進行報文發送,其內部封裝了以上的發送函數。但是如果skb的數據緩存出現對其問題,或者校驗的起始位置太靠後的話,雖然也是使用tcp_transmit_skb發送報文,但是,不在其內部更新報文速率信息,也是在這裏進行更新,調用tcp_rate_skb_sent函數。
int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
{
...
if (unlikely((NET_IP_ALIGN && ((unsigned long)skb->data & 3)) ||
skb_headroom(skb) >= 0xFFFF)) {
struct sk_buff *nskb;
tcp_skb_tsorted_save(skb) {
nskb = __pskb_copy(skb, MAX_TCP_HEADER, GFP_ATOMIC);
err = nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) : -ENOBUFS;
} tcp_skb_tsorted_restore(skb);
if (!err) {
tcp_update_skb_after_send(sk, skb, tp->tcp_wstamp_ns);
tcp_rate_skb_sent(sk, skb);
}
} else {
err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
報文傳輸時長
函數tcp_rate_skb_delivered用於計算報文的傳輸時間,當接收到ACK報文時,調用此函數進行處理。如果delivered_mstamp爲空,表明此報文發送時沒有記錄時間戳,不進行處理。
void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb, struct rate_sample *rs)
{
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
if (!scb->tx.delivered_mstamp)
return;
對於確認多個skb的ACK報文(Stretched-Acks),此函數將被調用多次(每個確認報文調用一次),這裏使用這些確認報文中最近發送的報文的時間信息,即tx.delivered時間較大的報文,使用其信息生成速率採樣rate_sample。
並且,使用此報文的時間戳,更新套接口first_tx_mstamp時間戳變量,開始新的發送速率採樣窗口。隨後,計算此時結束的上一個發送速率採樣階段的長度,即最近確認的報文的發送時間戳,減去最早發送的報文的時間戳(採樣週期的開始時間),得到發送階段的時長。
if (!rs->prior_delivered ||
after(scb->tx.delivered, rs->prior_delivered)) {
rs->prior_delivered = scb->tx.delivered;
rs->prior_mstamp = scb->tx.delivered_mstamp;
rs->is_app_limited = scb->tx.is_app_limited;
rs->is_retrans = scb->sacked & TCPCB_RETRANS;
/* Record send time of most recently ACKed packet: */
tp->first_tx_mstamp = tcp_skb_timestamp_us(skb);
/* Find the duration of the "send phase" of this window: */
rs->interval_us = tcp_stamp_us_delta(tp->first_tx_mstamp, scb->tx.first_tx_mstamp);
}
最後,如果報文被SACK所確認,清空其tx.delivered_mstamp時間戳。反之,在之後接收到ACK確認時,再次使用此報文信息計算速率。參見本函數tcp_rate_skb_delivered開頭,delivered_mstamp爲零的報文,不參與處理。
/* Mark off the skb delivered once it's sacked to avoid being
* used again when it's cumulatively acked. For acked packets
* we don't need to reset since it'll be freed soon.
*/
if (scb->sacked & TCPCB_SACKED_ACKED)
scb->tx.delivered_mstamp = 0;
以上函數tcp_rate_skb_delivered在SACK的處理和ACK確認報文的處理中都由調用,首先看一下在SACK處理過程中的使用。函數tcp_sacktag_walk遍歷skb開始的重傳隊列,如果判斷隊列中的某個skb的數據位於SACK序號塊之內(tcp_match_skb_to_sack),即SACK確認了此報文,調用以上函數tcp_rate_skb_delivered進行處理。
另外,如果in_sack小於等於零,表明SACK沒有完全包含當前SKB的數據,由函數tcp_shift_skb_data處理部分交叉的情況。
static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
struct tcp_sack_block *next_dup, struct tcp_sacktag_state *state,
u32 start_seq, u32 end_seq, bool dup_sack_in)
{
skb_rbtree_walk_from(skb) {
...
if (in_sack <= 0) {
tmp = tcp_shift_skb_data(sk, skb, state, start_seq, end_seq, dup_sack);
...
}
if (unlikely(in_sack < 0)) break;
if (in_sack) {
TCP_SKB_CB(skb)->sacked = tcp_sacktag_one(sk, state,
TCP_SKB_CB(skb)->sacked, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq,
dup_sack,
tcp_skb_pcount(skb), tcp_skb_timestamp_us(skb));
tcp_rate_skb_delivered(sk, skb, state->rate);
對於部分數據被確認的skb,使用函數tcp_shifted_skb將此部分數據分離出來,嘗試與之前已經被SACK確認的報文進行合併。雖然只有部分數據被確認,也表明此報文完成了傳輸,使用其更新速率信息(見函數tcp_rate_skb_delivered)。
static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
struct sk_buff *skb, struct tcp_sacktag_state *state,
unsigned int pcount, int shifted, int mss, bool dup_sack)
{
struct tcp_sock *tp = tcp_sk(sk);
u32 start_seq = TCP_SKB_CB(skb)->seq; /* start of newly-SACKed */
u32 end_seq = start_seq + shifted; /* end of newly-SACKed */
BUG_ON(!pcount);
/* Adjust counters and hints for the newly sacked sequence
* range but discard the return value since prev is already
* marked. We must tag the range first because the seq
* advancement below implicitly advances
* tcp_highest_sack_seq() when skb is highest_sack.
*/
tcp_sacktag_one(sk, state, TCP_SKB_CB(skb)->sacked,
start_seq, end_seq, dup_sack, pcount,
tcp_skb_timestamp_us(skb));
tcp_rate_skb_delivered(sk, skb, state->rate);
最後,看一下ACK報文相關的速率處理,參見以下函數tcp_clean_rtx_queue,無論報文是完全被確認,還是部分確認,都使用函數tcp_rate_skb_delivered更新速率信息。fully_acked在之後進行。
static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
u32 prior_snd_una, struct tcp_sacktag_state *sack)
{
for (skb = skb_rb_first(&sk->tcp_rtx_queue); skb; skb = next) {
...
tp->packets_out -= acked_pcount;
pkts_acked += acked_pcount;
tcp_rate_skb_delivered(sk, skb, sack->rate);
...
if (!fully_acked)
break;
生成速率樣本
以上的函數SACK和ACK報文處理函數都是在tcp_ack函數中調用,在tcp_ack函數最後,調用tcp_rate_gen生成速率樣本。在此之前,由函數tcp_newly_delivered計算ACK報文確認的報文數量。函數tcp_rate_gen的第三個參數lost表示新推倒出來的丟失報文數量(進行了標記)。
函數tcp_ack的最後,調用擁塞控制函數tcp_cong_control,目前只有BBR擁塞算法在使用速率樣本。
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
struct rate_sample rs = { .prior_delivered = 0 };
bool is_sack_reneg = tp->is_sack_reneg;
u32 lost = tp->lost;
u32 delivered = tp->delivered;
sack_state.rate = &rs;
rs.prior_in_flight = tcp_packets_in_flight(tp);
if ((flag & (FLAG_SLOWPATH | FLAG_SND_UNA_ADVANCED)) ==
FLAG_SND_UNA_ADVANCED) {
...
} else {
...
if (TCP_SKB_CB(skb)->sacked)
flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una, &sack_state);
...
}
...
/* See if we can take anything off of the retransmit queue. */
flag |= tcp_clean_rtx_queue(sk, prior_fack, prior_snd_una, &sack_state);
...
delivered = tcp_newly_delivered(sk, delivered, flag);
lost = tp->lost - lost; /* freshly marked lost */
rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED);
tcp_rate_gen(sk, delivered, lost, is_sack_reneg, sack_state.rate);
tcp_cong_control(sk, ack, delivered, flag, sack_state.rate);
如下函數tcp_newly_delivered,使用當前套接口中更新的delivered計數減去之前的舊值prior_delivered,得到本次ACK處理,確認的報文數量。
/* Returns the number of packets newly acked or sacked by the current ACK */
static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
{
delivered = tp->delivered - prior_delivered;
NET_ADD_STATS(net, LINUX_MIB_TCPDELIVERED, delivered);
...
return delivered;
如下速率樣本生成函數tcp_rate_gen,在發送報文數量超過應用程序限制點時,清零app_limited。之後,保存本次確認(ACK & SACK)報文數量到速率樣本結構rate_sample中,保存新評估的丟失報文數量。
如果delivered有值,表明ACK報文確認了新的數據,更新確認數據的時間戳爲當前時間。
void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost, bool is_sack_reneg, struct rate_sample *rs)
{
struct tcp_sock *tp = tcp_sk(sk);
/* Clear app limited if bubble is acked and gone. */
if (tp->app_limited && after(tp->delivered, tp->app_limited))
tp->app_limited = 0;
/* TODO: there are multiple places throughout tcp_ack() to get
* current time. Refactor the code using a new "tcp_acktag_state"
* to carry current time, flags, stats like "tcp_sacktag_state".
*/
if (delivered)
tp->delivered_mstamp = tp->tcp_mstamp;
rs->acked_sacked = delivered; /* freshly ACKed or SACKed */
rs->losses = lost; /* freshly marked lost */
如果沒有記錄報文確認時的時間戳,或者接收端刪除了接收到的亂序報文,返回一個無效的速率樣本。對於後一種情況,計算帶寬時,會包含進了接收端刪除的亂序報文,將導致對帶寬的高估,這裏選擇返回無效速率樣本。
需要注意的是,速率樣本rate_sample結構的acked_sacked變量保存了本次接收的ACK報文新確認(S/ACK)的報文;而另一個變量rs->delivered保存的本次採樣週期內確認的報文數量。
/* Return an invalid sample if no timing information is available or
* in recovery from loss with SACK reneging. Rate samples taken during
* a SACK reneging event may overestimate bw by including packets that
* were SACKed before the reneg.
*/
if (!rs->prior_mstamp || is_sack_reneg) {
rs->delivered = -1;
rs->interval_us = -1;
return;
}
rs->delivered = tp->delivered - rs->prior_delivered;
通常情況,對於一個發送窗口期,ACK接收的時長大於數據發送的時長,正如開始所述,這導致計算的ACK接收速率小於數據發送速率。但是考慮到ACK壓縮的情況,安全的選擇是將interval_us設置爲兩個時間段之間的較大值。
/* Model sending data and receiving ACKs as separate pipeline phases
* for a window. Usually the ACK phase is longer, but with ACK
* compression the send phase can be longer. To be safe we use the
* longer phase.
*/
snd_us = rs->interval_us; /* send phase */
ack_us = tcp_stamp_us_delta(tp->tcp_mstamp, rs->prior_mstamp); /* ack phase */
rs->interval_us = max(snd_us, ack_us);
/* Record both segment send and ack receive intervals */
rs->snd_interval_us = snd_us;
rs->rcv_interval_us = ack_us;
如果interval_us小於RTT的最小值,很有可能帶寬會估算過高,將其設置爲無效值。
/* Normally we expect interval_us >= min-rtt.
* Note that rate may still be over-estimated when a spuriously
* retransmistted skb was first (s)acked because "interval_us"
* is under-estimated (up to an RTT). However continuously
* measuring the delivery rate during loss recovery is crucial
* for connections suffer heavy or prolonged losses.
*/
if (unlikely(rs->interval_us < tcp_min_rtt(tp))) {
if (!rs->is_retrans)
pr_debug("tcp rate: %ld %d %u %u %u\n",
rs->interval_us, rs->delivered, inet_csk(sk)->icsk_ca_state,
tp->rx_opt.sack_ok, tcp_min_rtt(tp));
rs->interval_us = -1;
return;
}
如果app_limited爲空,記錄的速率爲應用程序不限制的速率。否則,app_limited有值,如果當前的速率大於記錄的速率(rate_delivered/rate_interval_us),進行速率更新。
/* Record the last non-app-limited or the highest app-limited bw */
if (!rs->is_app_limited ||
((u64)rs->delivered * tp->rate_interval_us >=
(u64)tp->rate_delivered * rs->interval_us)) {
tp->rate_delivered = rs->delivered;
tp->rate_interval_us = rs->interval_us;
tp->rate_app_limited = rs->is_app_limited;
}
應用層限制
如下函數tcp_rate_check_app_limited,如果套接口發送緩存中的數據長度小於MSS值; 並且,本機Qdisc隊列/網卡發送隊列中沒有數據(小於數據長度爲1的SKB所佔用的空間); 並且,發送到網絡中的報文數量小於擁塞窗口(並非擁塞窗口限制了報文發送); 並且,所有丟失報文都已經被重傳,當滿足以上的所有條件時,認爲套接口的發送受限於應用程序。
如果確認報文數量與網絡中報文數量之和大於零,將結果賦予變量app_limited,否則app_limited賦值爲一。可見,app_limited一方面表示數據發送是否收到了應用層的限制,另一方面,其表示受限發生時的發送的報文數量。
/* If a gap is detected between sends, mark the socket application-limited. */
void tcp_rate_check_app_limited(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
if (/* We have less than one packet to send. */
tp->write_seq - tp->snd_nxt < tp->mss_cache &&
/* Nothing in sending host's qdisc queues or NIC tx queue. */
sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1) &&
/* We are not limited by CWND. */
tcp_packets_in_flight(tp) < tp->snd_cwnd &&
/* All lost packets have been retransmitted. */
tp->lost_out <= tp->retrans_out)
tp->app_limited = (tp->delivered + tcp_packets_in_flight(tp)) ? : 1;
在以上tcp_rate_gen函數中,如果delivered的值已經超過記錄的app_limited值,將其清零。
void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
bool is_sack_reneg, struct rate_sample *rs)
{
struct tcp_sock *tp = tcp_sk(sk);
u32 snd_us, ack_us;
/* Clear app limited if bubble is acked and gone. */
if (tp->app_limited && after(tp->delivered, tp->app_limited))
tp->app_limited = 0;
檢查函數的調用發生在TCP與應用層接口的報文發送函數中,如下tcp_sendpage_locked和tcp_sendmsg_locked函數。
int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
...
tcp_rate_check_app_limited(sk); /* is sending application-limited? */
...
}
int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
{
...
tcp_rate_check_app_limited(sk); /* is sending application-limited? */
擁塞控制
如上所述,在tcp_ack函數的最後,調用擁塞控制函數tcp_cong_control,其調用各個擁塞算法註冊的cong_control函數指針,目前只有BBR算法註冊了此指針。
static void tcp_cong_control(struct sock *sk, u32 ack, u32 acked_sacked,
int flag, const struct rate_sample *rs)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
if (icsk->icsk_ca_ops->cong_control) {
icsk->icsk_ca_ops->cong_control(sk, rs);
return;
}
BBR實現的cong_control指針函數爲bbr_main,其中利用速率採樣信息等進行帶寬更新,設置Pacing速率和調整擁塞窗口等操作。
static void bbr_main(struct sock *sk, const struct rate_sample *rs)
{
struct bbr *bbr = inet_csk_ca(sk);
u32 bw;
bbr_update_model(sk, rs);
bw = bbr_bw(sk);
bbr_set_pacing_rate(sk, bw, bbr->pacing_gain);
bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain);
內核版本 5.0