linuxtcpNagle算法,TCP_NODELAY和TCP_CORK转载
转载⾃:
糊涂窗⼝综合症(Silly Windw Syndrome)
发送⽅: 应⽤程序产⽣数据的速度很慢
发送1字节需要40B(TCP头和IP头), 发送⼤量的⼩包会造成⽹络拥塞,发送窗⼝抖动,⽹络利⽤率低等特性。
当年OTT(over the top)类应⽤(如),由于3G/4G没有⼤规模普及,因为常⽤的⼼跳机制,通常发送⼩的⼼跳包,造成了信令风暴,影响了运营商⽹络的稳定。
解决: nagle和cork算法,尝试延迟发送,积累成⼤包后再发送。当然交互类应⽤需要实时性,不能推迟发送。
接收⽅: 应⽤程序消耗数据的速度很慢
接收窗⼝满了,发送rwnd=0, 再消耗⼀字节,rwnd=1,消耗并发送反复的情况。发送⽅nagle因为推迟发送,可能忽略这部分通告
解决:
clark⽅法:只要数据到达就发送ACK,但在缓存中有⾜够⼤的空间放⼊最⼤长度的报⽂之前,都宣布rwnd=0
推迟确认:优点:减少ACK数量。缺点:可能导致重传
Nagle和Cork
Nagle算法的⽬的:避免发送⼤量的⼩包,⽹络上每次只能⼀个⼩包存在,在⼩包被确认之前,只能积累发送⼤包,如果包长度达到MSS,则允许发送;如果该包含有FIN,则允许发送;但发⽣了超时(⼀般为200ms),则⽴即发送,启动TCP_NODELAY,就意味着禁⽤了Nagle算法
Cork算法的⽬的: CORK就是塞⼦的意思,形象地理解就是⽤CORK将连接塞住,使得数据先不发出去,等到拔去塞⼦后再发出去。
cork是完全避免⼩包的发送,只发送MSS⼤⼩的包及不得不发的⼩包
setsockopt
TCP_CORK的开关,只会影响TCP_NAGLE_CORK选项,当nagle测试关闭(通过TCP_NODELAY设置了TCP_NAGLE_OFF)的情况下,才会设置TCP_NAGLE_PUSH
⽽TCP_NODELAY则通过设置TCP_NAGLE_OFF来开关nagle。
TCP_NAGLE_PUSH是个⼀次性的选项值,每次创建新的skb并放⼊发送队列的时候,TCP_NAGLE_PUSH都会被清除(skb_entail函数)
#define TCP_NAGLE_OFF        1    /* Nagle's algo is disabled */
pending#define TCP_NAGLE_CORK        2    /* Socket is corked        */
#define TCP_NAGLE_PUSH        4    /* Cork is overridden for already queued data */
case TCP_CORK:
/* When set indicates to always queue non-full frames.
* Later the user clears this option and we transmit
* any pending partial frames in the queue.  This is
* meant to be used alongside sendfile() to get properly
* filled frames when the user (for example) must write
* out headers with a write() call first and then use
* sendfile to send out the data parts.
*
* TCP_CORK can be set together with TCP_NODELAY and it is
* stronger than TCP_NODELAY.
*/
if (val) {
tp->nonagle |= TCP_NAGLE_CORK;
} else {
tp->nonagle &= ~TCP_NAGLE_CORK;
if (tp->nonagle&TCP_NAGLE_OFF)
tp->nonagle |= TCP_NAGLE_PUSH;
tcp_push_pending_frames(sk);
}
break;
case TCP_NODELAY:
if (val) {
/* TCP_NODELAY is weaker than TCP_CORK, so that
* this option on corked socket is remembered, but
* it is not activated until cork is cleared.
*
* However, when TCP_NODELAY is set we make
* an explicit push, which overrides even TCP_CORK
* for currently queued segments.
*/
tp->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
tcp_push_pending_frames(sk);
} else {
tp->nonagle &= ~TCP_NAGLE_OFF;
}
数据发送
tcp_sendmsg在这⾥我们忽略很多细节,只需要知道根据GSO的⼤⼩来copy到skb中,按照合适的时机push各个skb, copy所有数据后(或者内存不⾜),则调⽤tcp_push执⾏发送
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
//size_goal表⽰GSO⽀持的⼤⼩,为mss_now的整数倍,不⽀持GSO时则相等
mss_now = tcp_send_mss(sk, &size_goal, flags);
// 把msg的⽤户态数据,按照GSO⽀持的最⼤⼤⼩,尽量copy到⼀个skb中
//skb_entail(sk,skb)到发送队列
//还有数据没copy,但是当前skb已经满了,可以发送了
if (forced_push(tp)) {    //超过最⼤窗⼝的⼀半没有设置push了
tcp_mark_push(tp, skb);    //设置push标记,更新pushed_seq
__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);    //调⽤tcp_write_xmit马上发送
} else if (skb == tcp_send_head(sk))    //第⼀个包,直接发送
tcp_push_one(sk, mss_now);
else{
//说明发送队列前⾯还有skb等待发送,且距离之前push的包还不是⾮常久, 则只是继续放到队列中,继续开始创建下⼀个skb copy
continue
}
out:
//最后的包调⽤tcp_push发送
tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
...
}
static void skb_entail(struct sock *sk, struct sk_buff *skb)
{
...
tcp_add_write_queue_tail(sk, skb);
if (tp->nonagle & TCP_NAGLE_PUSH)
tp->nonagle &= ~TCP_NAGLE_PUSH;    //创建新的skb放⼊发送队列,⽴刻清楚push选项
}
static void tcp_push(struct sock *sk, int flags, int mss_now,
int nonagle, int size_goal)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
if (!tcp_send_head(sk))
return;
skb = tcp_write_queue_tail(sk);
if (!(flags & MSG_MORE) || forced_push(tp))
tcp_mark_push(tp, skb);
tcp_mark_urg(tp, flags);
if (tcp_should_autocork(sk, skb, size_goal)) {
//利⽤tsq机制延后发送
/* avoid atomic op if TSQ_THROTTLED bit is already set */
if (!test_bit(TSQ_THROTTLED, &tp->tsq_flags)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPAUTOCORKING);
set_bit(TSQ_THROTTLED, &tp->tsq_flags);
}
/* It is possible TX completion already happened
* before we set TSQ_THROTTLED.
*/
if (atomic_read(&sk->sk_wmem_alloc) > skb->truesize)
return;
}
if (flags & MSG_MORE)    //应⽤程序标记了很快有新的数据到来,则标记cork,不发送⼩包
nonagle = TCP_NAGLE_CORK;
__tcp_push_pending_frames(sk, mss_now, nonagle);    //最终调⽤tcp_write_xmit
}
tcp_should_autocork
p_autocorking = 1 默认开启
当tcp_autocorking开启后,如果当前skb还没有达到GSO最⼤值,并且前⾯还有数据等待发送,也就是不急着发,
返回true后,利⽤tsq机制,在⽹卡发送完成⼀个包并释放该skb的时候,设置tasklet,在下⼀个softirq中再次尝试发送
/* If a not yet filled skb is pushed, do not send it if
* we have data packets in Qdisc or NIC queues :
* Because TX completion will happen shortly, it gives a chance
* to coalesce future sendmsg() payload into this skb, without
* need for a timer, and with no latency trade off.
* As packets containing data payload have a bigger truesize
* than pure acks (dataless) packets, the last checks prevent
* autocorking if we only have an ACK in Qdisc/NIC queues,
* or if TX completion was delayed after we processed ACK packet.
*/
static bool tcp_should_autocork(struct sock *sk, struct sk_buff *skb,
int size_goal)
{
return skb->len < size_goal &&    //不到最⼤GSO size
sysctl_tcp_autocorking &&    //默认开启
skb != tcp_write_queue_head(sk) &&    //发送队列前⾯还有其他skb
atomic_read(&sk->sk_wmem_alloc) > skb->truesize;    //qdisc中有数据,说明⽹卡发送后完成中断释放内存,会很快有新的数据到来
}
tcp_write_xmit
tcp_push/tcp_push_one/__tcp_push_pending_frames最终都调⽤tcp_write_xmit()
执⾏到tcp_write_xmit说明已经尽最⼤可能在当前send()系统调⽤中作GSO,
在tcp_write_xmit()中,则使⽤nagle来判断是否要等待下⼀个应⽤程序传递更多的数据再发送
如果决定发送则调⽤tcp_transmit_skb()执⾏最终的发送
static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
int push_one, gfp_t gfp)
{
max_segs = tcp_tso_segs(sk, mss_now);    //当前tso⽀持的最⼤segs数量
while ((skb = tcp_send_head(sk))) {    //遍历发送队列
tso_segs = tcp_init_tso_segs(skb, mss_now);    //skb->len/mss,重新设置tcp_gso_segs,因为在tcp_sendmsg中被清零了
...
if (tso_segs == 1) {//tso_segs=1表⽰⽆需tso分段
/* 根据nagle算法,计算是否需要推迟发送数据 */
if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
(tcp_skb_is_last(sk, skb) ?
nonagle : TCP_NAGLE_PUSH))))    //last skb就直接发送
break;    //推迟发送
} else {    //tso分段
if (!push_one &&    //不只⼀个skb
tcp_tso_should_defer(sk, skb, &is_cwnd_limited, //如果发送窗⼝剩余不多,并且预计下⼀个ack将很快到来(意味着可⽤窗⼝会增加),则推迟发送
max_segs))
break;    //可以推迟
}
//不⽤推迟发送,马上发送
limit = mss_now;
...
if (tcp_small_queue_check(sk, skb, 0))    //tsq检查,qdisc是否达到限制
break;
if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))    //发送,如果包被qdisc丢了,则退出循环,不继续发送了
break;
tcp_event_new_data_sent(sk, skb);//更新sk_send_head和packets_out
/* 更新struct tcp_sock中的snd_sml字段。记录⾮全尺⼨发送的最后⼀个字节序号,主要⽤来做nagle测试
*/
tcp_minshall_update(tp, mss_now, skb);
sent_pkts += tcp_skb_pcount(skb);
if (push_one)    //只发⼀个skb的则退出循环
break;
}
...
//没有数据包inflight,并且有数据等待发送,则准备尝试0窗⼝探测
return !tp->packets_out && tcp_send_head(sk);
}
tcp_nagle_test
在GSO没有开启,或者在当前send()中的数据不够⼀个mss的时候,则会调⽤tcp_nagle_test,来判断是否推迟发送.
以下情况将直接发送
设置了TCP_NAGLE_PUSH。⽐如应⽤程序设置了TCP_NODELAY选项;或是当前包是在发送队列中的最后⼀个;或者当前SKB达到GSO的最⼤值了,并超过最⼤窗⼝的⼀半没有设置push了
紧急数据或者fin包
当前包达到了MSS⼤⼩
没有设置TCP_NAGLE_CORK,并且上⼀个发送的⼩包已经被确认
也就是说对于设置了CORK的⼩包就不发;或者没设置CORK但是上⼀个发送的⼩包还未被确认都延迟发送/* Return true if the Nagle test allows this packet to be
* sent now.
*/
static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
unsigned int cur_mss, int nonagle)
{
/* Nagle rule does not apply to frames, which sit in the middle of the
* write_queue (they have no chances to get new data).
*
* This is implemented in the callers, where they modify the 'nonagle'
* argument based upon the location of SKB in the send queue.
*/
if (nonagle & TCP_NAGLE_PUSH)
return true;
/* Don't use the nagle rule for urgent data (or for the final FIN). */
if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
return true;
if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
return true;
//skb->len < cur_mss且设置了TCP_NAGLE_CORK,或者上⼀个发送的⼩包还未被确认,则推迟发送
return false;
}
static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
int nonagle)
{
return partial &&    //skb->len < mss, 也就是说>=mss就直接发送
((nonagle & TCP_NAGLE_CORK) ||    //设置了cork则使⽤nagle
(!nonagle && tp->packets_out && tcp_minshall_check(tp)));    //有inflight数据且上⼀个发送的⼩包还没被确认则进⼊nagle }
/* Minshall's variant of the Nagle send check. */
static bool tcp_minshall_check(const struct tcp_sock *tp)
{
return after(tp->snd_sml, tp->snd_una) &&    //上⼀个发送的⼩包还没确认
!after(tp->snd_sml, tp->snd_nxt);    //没有回绕
}
static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
const struct sk_buff *skb)
{
if (skb->len < tcp_skb_pcount(skb) * mss_now)
tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
}
tcp_tso_should_defer
对于开启了GSO的情况,并且当前skb不只⼀个分段,则需要tcp_tso_should_defer来判断是否延迟发送
在剩余发送窗⼝不⾜且下⼀个ack可能很快到来的情况下,则推迟发送
static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
bool *is_cwnd_limited, u32 max_segs)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
u32 age, send_win, cong_win, limit, in_flight;
struct tcp_sock *tp = tcp_sk(sk);
struct skb_mstamp now;
struct sk_buff *head;
int win_divisor;
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
goto send_now;
if (icsk->icsk_ca_state >= TCP_CA_Recovery)
goto send_now;
/* Avoid bursty behavior by allowing defer
* only if the last write was recent.
*/
if ((s32)(tcp_time_stamp - tp->lsndtime) > 0)
goto send_now;
in_flight = tcp_packets_in_flight(tp);
BUG_ON(tcp_skb_pcount(skb) <= 1 || (tp->snd_cwnd <= in_flight));
send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;    //发送窗⼝
/* From in_flight test above, we know that cwnd > in_flight.  */
cong_win = (tp->snd_cwnd - in_flight) * tp->mss_cache;    //拥塞窗⼝
limit = min(send_win, cong_win);    //最⼤发送窗⼝剩余
/* If a full-sized TSO skb can be sent, do it. */
if (limit >= max_segs * tp->mss_cache)    //⽀持最⼤尺⼨的tso发送
goto send_now;
/* Middle in queue won't get any more data, full sendable already? */
if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len))    //不是发送队列的最后⼀个,且满⾜发送窗⼝
goto send_now;    //直接发送,不会有数据被添加到这个skb了
win_divisor = ACCESS_ONCE(sysctl_tcp_tso_win_divisor);
if (win_divisor) {
u32 chunk = min(tp->snd_wnd, tp->snd_cwnd * tp->mss_cache);
/* If at least some fraction of a window is available,
* just use it.
*/
chunk /= win_divisor;
if (limit >= chunk)    //剩余的窗⼝⼤于总窗⼝的⽐例,默认1/3
goto send_now;
} else {
/* Different approach, try not to defer past a single
* ACK.  Receiver should ACK every other full sized
* frame, so if we have space for more than 3 frames
* then send now.
*/
if (limit > tcp_max_tso_deferred_mss(tp) * tp->mss_cache)
goto send_now;
}
head = tcp_write_queue_head(sk);
skb_mstamp_get(&now);
age = skb_mstamp_us_delta(&now, &head->skb_mstamp);    //最早的未确认包的距离现在的时间
/* If next ACK is likely to come too late (half srtt), do not defer */
if (age < (tp->srtt_us >> 4))    // 也就是说下⼀个ack的到来很可能⼤于1/2的srtt,直接发送
goto send_now;
/* Ok, it looks like it is advisable to defer. */
/
/当前skb的收到cwnd限制
if (cong_win < send_win && cong_win <= skb->len)
*is_cwnd_limited = true;
//可以推迟发送了
return true;
send_now:
return false;
}
应⽤程序Tips
http服务器的response,要发送http头+sendfile()⽂件,
可以先设置TCP_CORK, 然后write() http header, 不让header发出去,
调⽤sendfile(), 这时候如果没有达到GSO⼤⼩,还是不会发出去
最后设置TCP_NODELAY,这时候设置了TCP_NAGLE_PUSH,会马上发出去。如果你只是取消TCP_CORK, 内核还是会继续判断是否需要nagle。
send()的flag参数设置为MSG_MORE,给内核hint,表⽰马上会有其他数据到来,内核会⾃动加上CORK标记,你就不需要多调⽤⼀次setsockopt系统调⽤. 但是设置MSG_EOR并不会马上push数据
启动TCP_NODELAY,就意味着禁⽤了Nagle算法  http server ⼀般禁⽤
wwwblogs/wanpengcoder/p/5366156.html
1. Nagle算法:
是为了减少⼴域⽹的⼩分组数⽬,从⽽减⼩⽹络拥塞的出现;
该算法要求⼀个tcp连接上最多只能有⼀个未被确认的未完成的⼩分组,在该分组ack到达之前不能发送其他的⼩分组,tcp需要收集这些少量的分组,并在ack到来时以⼀个分组的⽅式发送出去;其中⼩分组的定义是⼩于MSS的任何分组;
该算法的优越之处在于它是⾃适应的,确认到达的越快,数据也就发哦送的越快;⽽在希望减少微⼩分组数⽬的低速⼴域⽹上,则会发送更少的分组;
2. 延迟ACK:
如果tcp对每个数据包都发送⼀个ack确认,那么只是⼀个单独的数据包为了发送⼀个ack代价⽐较⾼,所以tcp会延迟⼀段时间,如果这段时间内有数据发送到对端,则捎带发送ack,如果在延迟ack定时器触发时候,发现ack尚未发送,则⽴即单独发送;
延迟ACK好处:

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。