[Twisted-Python] intermittent problem: not accepting new connections
Alec Matusis
matusis at yahoo.com
Thu Sep 11 13:57:51 MDT 2008
> How does ulimit -a compare between old and new machine?
ulimit is only higher on the new machine, I upped it. Also, from my
experience, running into ulimits produces log messages.
alecm at serv2 ~> cat /proc/23678/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited ms
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set 1073741824 1073741824 bytes
Max processes 126976 126976
processes
Max open files 25000 25000 files
Max locked memory 32768 32768 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 126976 126976 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
> If you need to accept more than about
> 64k connections (not necessarily concurrent) in less than TIME-WAIT
> seconds, you might run out of ports.
> How does netstat look like?
I have to wait for another outage tonight, to get more definite results on
this.
So far, I have had only one very brief outage at 7:46am
I was asleep, but I have some new data logged.
I have two servers, one on port 5222 (problematic one, let's call it serv1
), and one on 5228 (that has no problems, call it serv2 ).
I made a script that measures the number of connections on each type of
server respectively like this:
while [ 1 ]
do
sleep 5
netstat -ant | grep ESTABLISHED | awk '/:5228 /{a += 1}/:5222 /{b += 1} END
{print strftime(" %d %b %Y %H:%M:%S", systime(
))" serv2: "a" serv1: "b" total: "a+b }'
done
This is what happened around 7:46am (lines that have "+" in front indicate
anomaly):
11 Sep 2008 07:41:29 serv2: 2756 serv1: 1064 total: 3820
11 Sep 2008 07:41:36 serv2: 2764 serv1: 1062 total: 3826
11 Sep 2008 07:41:44 serv2: 2784 serv1: 1049 total: 3833
11 Sep 2008 07:41:51 serv2: 2756 serv1: 1055 total: 3811
11 Sep 2008 07:41:59 serv2: 2760 serv1: 1066 total: 3826
11 Sep 2008 07:42:06 serv2: 2777 serv1: 1050 total: 3827
11 Sep 2008 07:42:14 serv2: 2769 serv1: 1054 total: 3823
11 Sep 2008 07:42:21 serv2: 2751 serv1: 1055 total: 3806
11 Sep 2008 07:42:29 serv2: 2760 serv1: 1050 total: 3810
11 Sep 2008 07:42:36 serv2: 2747 serv1: 1040 total: 3787
11 Sep 2008 07:42:44 serv2: 2737 serv1: 1046 total: 3783
11 Sep 2008 07:42:51 serv2: 2739 serv1: 1046 total: 3785
11 Sep 2008 07:42:59 serv2: 2743 serv1: 1037 total: 3780
11 Sep 2008 07:43:07 serv2: 2720 serv1: 1041 total: 3761
11 Sep 2008 07:43:14 serv2: 2714 serv1: 1047 total: 3761
11 Sep 2008 07:43:22 serv2: 2721 serv1: 1045 total: 3766
11 Sep 2008 07:43:29 serv2: 2697 serv1: 1056 total: 3753
11 Sep 2008 07:43:37 serv2: 2710 serv1: 1059 total: 3769
+11 Sep 2008 07:43:44 serv2: 2765 serv1: 1304 total: 4069
+11 Sep 2008 07:43:53 serv2: 2854 serv1: 1904 total: 4758
+11 Sep 2008 07:44:01 serv2: 2714 serv1: 2190 total: 4904
+11 Sep 2008 07:44:09 serv2: 2715 serv1: 2202 total: 4917
+11 Sep 2008 07:44:17 serv2: 2779 serv1: 1891 total: 4670
+11 Sep 2008 07:44:26 serv2: 2812 serv1: 2284 total: 5096
+11 Sep 2008 07:44:36 serv2: 2828 serv1: 3496 total: 6324
+11 Sep 2008 07:44:46 serv2: 2715 serv1: 4327 total: 7042
+11 Sep 2008 07:44:56 serv2: 2638 serv1: 3499 total: 6137
+11 Sep 2008 07:45:05 serv2: 2714 serv1: 2396 total: 5110
+11 Sep 2008 07:45:15 serv2: 2776 serv1: 1464 total: 4240
+11 Sep 2008 07:45:25 serv2: 2728 serv1: 1604 total: 4332
+11 Sep 2008 07:45:35 serv2: 2708 serv1: 1566 total: 4274
+11 Sep 2008 07:45:45 serv2: 2750 serv1: 1680 total: 4430
+11 Sep 2008 07:45:54 serv2: 2755 serv1: 1311 total: 4066
+11 Sep 2008 07:46:04 serv2: 2704 serv1: 1178 total: 3882
11 Sep 2008 07:46:13 serv2: 2644 serv1: 1024 total: 3668
11 Sep 2008 07:46:22 serv2: 2573 serv1: 981 total: 3554
11 Sep 2008 07:46:30 serv2: 2739 serv1: 1051 total: 3790
11 Sep 2008 07:46:39 serv2: 2773 serv1: 1043 total: 3816
11 Sep 2008 07:46:46 serv2: 2548 serv1: 959 total: 3507
11 Sep 2008 07:46:54 serv2: 2773 serv1: 1044 total: 3817
11 Sep 2008 07:47:02 serv2: 2745 serv1: 1027 total: 3772
11 Sep 2008 07:47:10 serv2: 2591 serv1: 960 total: 3551
11 Sep 2008 07:47:17 serv2: 2721 serv1: 1002 total: 3723
11 Sep 2008 07:47:25 serv2: 2753 serv1: 1030 total: 3783
11 Sep 2008 07:47:32 serv2: 2773 serv1: 1029 total: 3802
11 Sep 2008 07:47:40 serv2: 2768 serv1: 1025 total: 3793
11 Sep 2008 07:47:47 serv2: 2764 serv1: 1021 total: 3785
11 Sep 2008 07:47:55 serv2: 2750 serv1: 1021 total: 3771
Each server also reports the maximum numbers of authenticated connected
clients to the database.
Even though serv1 peaked at 07:44:46 at 4327 "ESTABLISHED" connections, the
number of normally authenticated clients in the database for this time
period is only 1100, which does not reflect this weird connections spike.
Since I was asleep at 7:46am, I did not have a chance to investigate what
those connections actually were.
> Anyone know what happens to new
> connection attempts to a server in this condition?
One thing that I noticed yesterday, that when server was in this condition,
telnet simply hang, as if no ACK packet was received back:
alecm at serv2 ~> telnet localhost 5222
Trying 127.0.0.1...
and nothing
This behavior or telnet seems to be unique to the case when ACK is not
received in the handshake, I verified that with tcpdump. To contrast that,
if nothing listens on the destination port for example, telnet gives:
alecm at serv2 ~> telnet localhost 50007
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
telnet: Unable to connect to remote host: Connection refused
which corresponds to RST packet that kernel returns in that case.
I tried to write a 5 line mucked up server with accept() not returning any
packets, and failed.
> -----Original Message-----
> From: twisted-python-bounces at twistedmatrix.com [mailto:twisted-python-
> bounces at twistedmatrix.com] On Behalf Of Jean-Paul Calderone
> Sent: Thursday, September 11, 2008 6:47 AM
> To: Twisted general discussion
> Subject: Re: [Twisted-Python] intermittent problem: not accepting new
> connections
>
> On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick"
> <paul.nendick at gmail.com> wrote:
> >Not necessarily related to what you've described, but I'll share
> >something that's helped a good deal on my most-heavily hit twisted
> >servers. Presuming you're using Linux:
> >
> > echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
> >
> >
> >from http://lartc.org/howto/lartc.kernel.obscure.html :
> >
> >"Enable fast recycling TIME-WAIT sockets. Default value is 1. It
> >should not be changed without advice/request of technical experts"
> >
> >My expert advice: only use this on machines connected on a low-latency
> >LAN. It *will* break internet-facing interfaces. It halves the
> >constant used by the Nagle algorithm:
> >
> >http://en.wikipedia.org/wiki/Nagle's_algorithm
> >
>
> This is somewhat interesting. It suggests a potential problem which
> I hadn't thought about before. If you need to accept more than about
> 64k connections (not necessarily concurrent) in less than TIME-WAIT
> seconds, you might run out of ports. Anyone know what happens to new
> connection attempts to a server in this condition?
>
> Alec, any idea if your server could be getting into this state every
> once in a while? This is an appealing hypothesis, since it wouldn't
> necessarily happen at peak connection time (but potentially shortly
> after a peak), would resolve itself given a short period of time,
> wouldn't necessarily prevent all new connection attempts, since old
> TIME-WAIT sockets would be gradually timing out (so your other low-
> volume servers might still appear to be working normally), wouldn't
> interfere with already established connections, and might not change
> the userspace-visible syscall behavior (depending on what Linux does
> in this case, but I wouldn't be surprised if connection failures due
> to this never showed up in an accept(2) result).
>
> Jean-Paul
>
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
More information about the Twisted-Python
mailing list