Nginx报错connect()failed(110:Connectiontimedout。。。
转⾃
背景
在对应⽤服务进⾏压⼒测试时,Nginx在持续压测请求1min左右后开始报错,花了⼀些时间对报错的原因进⾏排查,并最终定位到问题,现将过程总结下。
压测⼯具
这⾥压测使⽤的是siege, 其⾮常容易指定并发访问数以及并发时间,以及有⾮常清晰的结果反馈,成功访问数,失败数,吞吐率等性能结果。
压测指标
单接⼝压测,并发100,持续1min。
压测⼯具报错
The server is now
[error] socket: unable to connect sock.c:249: Connection timed out
[error] socket: unable to connect sock.c:249: Connection timed out
Nginx error.log 报错
2018/11/21 17:31:23 [error] 15622#0: *24993920 connect() failed (110: Connection timed out) while connecting to upstream, client: , server: , request: "GET /guide/v1/activities/1107 HTTP/1.1", ups 2018/11/21 18:21:09 [error] 4469#0: *25079420 connect() failed (110: Connection timed out) while connecting to upstream, client: , server: , request: "GET /guide/v1/activities/1107 HTTP/1.1", upstr
排查问题
看到 timed out 第⼀感觉是,应⽤服务存在性能问题,导致并发请求时⽆法响应请求;通过排查应⽤服务的⽇志,发现其实应⽤服务并没有任何报错;
观察应⽤服务的CPU负载(Docker 容器 docker state id) ,发现其在并发请求时CPU使⽤率升⾼,再⽆其他异常,属于正常情况。不过持续观察发现,在压测报错开始后,应⽤服务所在的CPU负载降
低,应⽤服务⽇志⾥也没有了请求⽇志,暂时可以判定⽆法响应请求应该来⾃应⽤服务链路的前⼀节点,也就是Nginx;
通过命令排查Nginx所在服务器,压测时的TCP连接情况
# 查看当前80端⼝的连接数
netstat -nat|grep -i "80"|wc -l
5407
# 查看当前TCP连接的状态
netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
LISTEN 12
SYN_RECV 1
ESTABLISHED 454
FIN_WAIT1 1
TIME_WAIT 5000
发现在TCP的连接有两个异常点
竟然有5k多个连接
TCP状态TIME_WAIT 到5000个后停⽌增长
关于这两点开始进⾏分析:
理论上100个并发⽤户数压测,应该只有100个连接才对,造成这个原因应该是 siege 压测时创建了5000个连接
# 查看siege配置
vim ~/.f
# 真相⼤⽩,原来siege在压测时,连接默认是close,也就是说在持续压测时,每个请求结束后,直接关闭连接,然后再创建新的连接,那么就可以理解为什么压测时Nginx所在服务器TCP连接数5000多,⽽不是100;
# Connection directive. Options "close" and "keep-alive" Starting with
# version 2.57, siege implements persistent connections in accordance
# to RFC 2068 using both chunked encoding and content-length directives
# to determine the page size.
#
# To run siege with persistent connections set this to keep-alive.
#
# CAUTION: Use the keep-alive directive with care.
# DOUBLE CAUTION: This directive does not work well on HPUX
# TRIPLE CAUTION: We don't recommend you set this to keep-alive
# ex: connection = close
# connection = keep-alive
#
connection = close
TIME_WAIT 到5000分析,这要先弄清楚,TCP状态TIME_WAIT是什么含义
TIME-WAIT:等待⾜够的时间以确保远程TCP接收到连接中断请求的确认;TCP要保证在所有可能的情况下使得所有的数据都能够被正确送达。当你关闭⼀个socket时,主动关闭⼀端的socket将进⼊
TIME_WAIT状态,⽽被动关闭⼀⽅则转⼊CLOSED状态,这的确能够保证所有的数据都被传输。
从TIME-WAIT定义中分析得知,当压测⼯具关闭连接后,实际上Nginx所在机器连接并未⽴刻CLOSED,⽽是进⼊TIME-WAIT状态,⽹上可以搜到⾮常多讲解TIME-WAIT过多导致丢包的情况,与我在压
测时所遇到情况⼀样。
# 查看Nginx所在服务器的配置
cat /f
# sysctl settings are defined through files in
# /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.
#
# Vendors settings live in /usr/lib/sysctl.d/.
# To override a whole file, create a new file with the same in
# /etc/sysctl.d/ and put new settings there. To override
# only specific settings, add a file with a lexically later
# name in /etc/sysctl.d/ and put new settings there.
#
# For more information, f(5) and sysctl.d(5).
f.all.disable_ipv6 = 1
f.default.disable_ipv6 = 1
f.lo.disable_ipv6 = 1
vm.swappiness = 0
_stale_time=120
# see details in help.aliyun/knowledge_detail/39428.html
f.all.rp_filter=0
f.default.rp_filter=0
f.default.arp_announce = 2
f.lo.arp_announce=2
nginx停止命令f.all.arp_announce=2
# see details in help.aliyun/knowledge_detail/41334.html
p_max_tw_buckets = 5000
p_syncookies = 1
p_max_syn_backlog = 1024
p_synack_retries = 2
kernel.sysrq = 1
fs.file-max = 65535
net.ipv4.ip_forward = 1
p_fin_timeout = 30
p_max_syn_backlog = 10240
p_keepalive_time = 1200
p_synack_retries = 3
p_syn_retries = 3
p_max_orphans = 8192
p_max_tw_buckets = 5000
p_window_scaling = 0
p_sack = 0
p_timestamps = 0
p_syncookies = 1
p_tw_reuse = 1
p_tw_recycle = 1
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.icmp_echo_ignore_all = 0
p_max_tw_buckets = 50005000表⽰系统同时保持TIME_WAIT套接字的最⼤数量,如果超过这个数字,TIME_WAIT套接字将⽴刻被清除并打印警告信息。
优化⽅案
参照在⽹上搜索获取的信息,调整Linux内核参数优化:
p_syncookies = 1 表⽰开启SYN Cookies。当出现SYN等待队列溢出时,启⽤cookies来处理,可防范少量SYN攻击,默认为0,表⽰关闭;
p_tw_reuse = 1 表⽰开启重⽤。允许将TIME-WAIT sockets重新⽤于新的TCP连接,默认为0,表⽰关闭;
p_tw_recycle = 1 表⽰开启TCP连接中TIME-WAIT sockets的快速回收,默认为0,表⽰关闭。
p_fin_timeout = 30 表⽰如果套接字由本端要求关闭,这个参数决定了它保持在FIN-WAIT-2状态的时间。
p_keepalive_time = 1200 表⽰当keepalive起⽤的时候,TCP发送keepalive消息的频度。缺省是2⼩时,改为20分钟。
net.ipv4.ip_local_port_range = 1024 65000 表⽰⽤于向外连接的端⼝范围。缺省情况下很⼩:32768到61000,改为1024到65000。
p_max_syn_backlog = 8192 表⽰SYN队列的长度,默认为1024,加⼤队列长度为8192,可以容纳更多等待连接的⽹络连接数。
p_max_tw_buckets = 5000表⽰系统同时保持TIME_WAIT套接字的最⼤数量,如果超过这个数字,TIME_WAIT套接字将⽴刻被清除并打印警告信息。默认为180000,改为5000。
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论