[Twisted-Python] intermittent problem: not accepting new connections

Mon Sep 15 07:43:23 EDT 2008

I wanted to post that I solved the problem, and it's unrelated to Twisted.

I noticed that this server had sudden increases in the numbers of
ESTABLISHED connections in netstat, but not in the number of authenticated
connections reported by twisted to the database. These increases correlated
with the connection outages.
Creating a shell script that monitors the connection rate, and invoking
tcpdump and sorting/counting the source IPs when it increases, showed that
this server was pounded twice a day for 3-5min from a small set of IPs with
new connections at about 100 conns/sec, that sent some junk data (these
connections timed out in twisted, because that junk never contained my line
separator, and I am using LineOnlyReceiver protocol). This was hardly an
attack against my server, because this new server was assigned an IP that I
have never used before- I guess this is just an internet reality ;)

Even though this does not look like SYN attacks (few connections were in
SYN_RCVD state), for some reason enabling net.ipv4.tcp_syncookies=1 made
everything normal. When these bizarre connections began to pile uo, I
started seeing 
"possible SYN flooding on port 5222. Sending cookies."
in /var/log/messages , and the connectivity outages stopped

I am sorry for the confusion, I was emboldened by finding the epoll bug back
in April, but this time Twisted continues to rock, and it's just a stupid
Ubuntu's tcp setting.

From: twisted-python-bounces at twistedmatrix.com
[mailto:twisted-python-bounces at twistedmatrix.com] On Behalf Of Alvin Delagon
Sent: Thursday, September 11, 2008 6:17 PM
To: Twisted general discussion
Subject: Re: [Twisted-Python] intermittent problem: not accepting new
connections

We had a similar problem like this, might as well share it with you guys:

Our Jabber server stopped taking in new connections due to our iptables at
some point in time (even lost some packets). We have an ip routing scheme
where ports 25,80,5223,10873 routed to 5222. Whenever this happens, we get
error error from our syslogs:

""ip_conntrack: table full, dropping packet"

http://support.imagestream.com/Resolving_ip_conntrack_table_full_Errors.html

Might as well check dmesg if you have something like this. What we did is
disable iptables and removed its kernel modules.
On Fri, Sep 12, 2008 at 3:57 AM, Alec Matusis <matusis at yahoo.com> wrote:
> How does ulimit -a compare between old and new machine?
ulimit is only higher on the new machine, I upped it. Also, from my
experience, running into  ulimits produces log messages.

alecm at serv2 ~> cat /proc/23678/limits

Limit                     Soft Limit           Hard Limit           Units

Max cpu time              unlimited            unlimited            ms

Max file size             unlimited            unlimited            bytes

Max data size             unlimited            unlimited            bytes

Max stack size            8388608              unlimited            bytes

Max core file size        0                    unlimited            bytes

Max resident set          1073741824           1073741824           bytes

Max processes             126976               126976
processes
Max open files            25000                25000                files

Max locked memory         32768                32768                bytes

Max address space         unlimited            unlimited            bytes

Max file locks            unlimited            unlimited            locks

Max pending signals       126976               126976               signals

Max msgqueue size         819200               819200               bytes

Max nice priority         0                    0
Max realtime priority     0                    0

> If you need to accept more than about
> 64k connections (not necessarily concurrent) in less than TIME-WAIT
> seconds, you might run out of ports.
> How does netstat look like?
I have to wait for another outage tonight, to get more definite results on
this.
So far, I have had only one very brief outage at 7:46am
I was asleep, but I have some new data logged.
I have two servers, one on port 5222 (problematic one, let's call it serv1
), and one on 5228 (that has no problems, call it serv2 ).
I made a script that measures the number of connections on each type of
server respectively like this:

while [ 1 ]
do
 sleep 5
 netstat -ant | grep ESTABLISHED | awk '/:5228 /{a += 1}/:5222 /{b += 1} END
{print strftime(" %d %b %Y %H:%M:%S", systime(
))"   serv2: "a"   serv1: "b" total: "a+b }'
done

This is what happened around 7:46am (lines that have "+" in front indicate
anomaly):

 11 Sep 2008 07:41:29   serv2: 2756   serv1: 1064 total: 3820
 11 Sep 2008 07:41:36   serv2: 2764   serv1: 1062 total: 3826
 11 Sep 2008 07:41:44   serv2: 2784   serv1: 1049 total: 3833
 11 Sep 2008 07:41:51   serv2: 2756   serv1: 1055 total: 3811
 11 Sep 2008 07:41:59   serv2: 2760   serv1: 1066 total: 3826
 11 Sep 2008 07:42:06   serv2: 2777   serv1: 1050 total: 3827
 11 Sep 2008 07:42:14   serv2: 2769   serv1: 1054 total: 3823
 11 Sep 2008 07:42:21   serv2: 2751   serv1: 1055 total: 3806
 11 Sep 2008 07:42:29   serv2: 2760   serv1: 1050 total: 3810
 11 Sep 2008 07:42:36   serv2: 2747   serv1: 1040 total: 3787
 11 Sep 2008 07:42:44   serv2: 2737   serv1: 1046 total: 3783
 11 Sep 2008 07:42:51   serv2: 2739   serv1: 1046 total: 3785
 11 Sep 2008 07:42:59   serv2: 2743   serv1: 1037 total: 3780
 11 Sep 2008 07:43:07   serv2: 2720   serv1: 1041 total: 3761
 11 Sep 2008 07:43:14   serv2: 2714   serv1: 1047 total: 3761
 11 Sep 2008 07:43:22   serv2: 2721   serv1: 1045 total: 3766
 11 Sep 2008 07:43:29   serv2: 2697   serv1: 1056 total: 3753
 11 Sep 2008 07:43:37   serv2: 2710   serv1: 1059 total: 3769
+11 Sep 2008 07:43:44   serv2: 2765   serv1: 1304 total: 4069
+11 Sep 2008 07:43:53   serv2: 2854   serv1: 1904 total: 4758
+11 Sep 2008 07:44:01   serv2: 2714   serv1: 2190 total: 4904
+11 Sep 2008 07:44:09   serv2: 2715   serv1: 2202 total: 4917
+11 Sep 2008 07:44:17   serv2: 2779   serv1: 1891 total: 4670
+11 Sep 2008 07:44:26   serv2: 2812   serv1: 2284 total: 5096
+11 Sep 2008 07:44:36   serv2: 2828   serv1: 3496 total: 6324
+11 Sep 2008 07:44:46   serv2: 2715   serv1: 4327 total: 7042
+11 Sep 2008 07:44:56   serv2: 2638   serv1: 3499 total: 6137
+11 Sep 2008 07:45:05   serv2: 2714   serv1: 2396 total: 5110
+11 Sep 2008 07:45:15   serv2: 2776   serv1: 1464 total: 4240
+11 Sep 2008 07:45:25   serv2: 2728   serv1: 1604 total: 4332
+11 Sep 2008 07:45:35   serv2: 2708   serv1: 1566 total: 4274
+11 Sep 2008 07:45:45   serv2: 2750   serv1: 1680 total: 4430
+11 Sep 2008 07:45:54   serv2: 2755   serv1: 1311 total: 4066
+11 Sep 2008 07:46:04   serv2: 2704   serv1: 1178 total: 3882
 11 Sep 2008 07:46:13   serv2: 2644   serv1: 1024 total: 3668
 11 Sep 2008 07:46:22   serv2: 2573   serv1: 981 total: 3554
 11 Sep 2008 07:46:30   serv2: 2739   serv1: 1051 total: 3790
 11 Sep 2008 07:46:39   serv2: 2773   serv1: 1043 total: 3816
 11 Sep 2008 07:46:46   serv2: 2548   serv1: 959 total: 3507
 11 Sep 2008 07:46:54   serv2: 2773   serv1: 1044 total: 3817
 11 Sep 2008 07:47:02   serv2: 2745   serv1: 1027 total: 3772
 11 Sep 2008 07:47:10   serv2: 2591   serv1: 960 total: 3551
 11 Sep 2008 07:47:17   serv2: 2721   serv1: 1002 total: 3723
 11 Sep 2008 07:47:25   serv2: 2753   serv1: 1030 total: 3783
 11 Sep 2008 07:47:32   serv2: 2773   serv1: 1029 total: 3802
 11 Sep 2008 07:47:40   serv2: 2768   serv1: 1025 total: 3793
 11 Sep 2008 07:47:47   serv2: 2764   serv1: 1021 total: 3785
 11 Sep 2008 07:47:55   serv2: 2750   serv1: 1021 total: 3771

Each server also reports the maximum numbers of authenticated connected
clients to the database.
Even though serv1 peaked at 07:44:46  at 4327 "ESTABLISHED" connections, the
number of normally authenticated clients in the database for this time
period is only 1100, which does not reflect this weird connections spike.
Since I was asleep at 7:46am, I did not have a chance to investigate what
those connections actually were.

> Anyone know what happens to new
> connection attempts to a server in this condition?
One thing that I noticed yesterday, that when server was in this condition,
telnet simply hang, as if no ACK packet was received back:

alecm at serv2 ~> telnet localhost 5222
Trying 127.0.0.1...
and nothing

This behavior or telnet seems to be unique to the case when ACK is not
received in the handshake, I verified that with tcpdump. To contrast that,
if nothing listens on the destination port for example, telnet gives:

alecm at serv2 ~> telnet localhost 50007
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
telnet: Unable to connect to remote host: Connection refused

which corresponds to RST packet that kernel returns in that case.
I tried to write a 5 line mucked up server with accept() not returning any
packets, and failed.

> -----Original Message-----
> From: twisted-python-bounces at twistedmatrix.com [mailto:twisted-python-
> bounces at twistedmatrix.com] On Behalf Of Jean-Paul Calderone
> Sent: Thursday, September 11, 2008 6:47 AM
> To: Twisted general discussion
> Subject: Re: [Twisted-Python] intermittent problem: not accepting new
> connections
>
> On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick"
> <paul.nendick at gmail.com> wrote:
> >Not necessarily related to what you've described, but I'll share
> >something that's helped a good deal on my most-heavily hit twisted
> >servers. Presuming you're using Linux:
> >
> > echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
> >
> >
> >from http://lartc.org/howto/lartc.kernel.obscure.html :
> >
> >"Enable fast recycling TIME-WAIT sockets. Default value is 1. It
> >should not be changed without advice/request of technical experts"
> >
> >My expert advice: only use this on machines connected on a low-latency
> >LAN. It *will* break internet-facing interfaces. It halves the
> >constant used by the Nagle algorithm:
> >
> >http://en.wikipedia.org/wiki/Nagle's_algorithm
> >
>
> This is somewhat interesting.  It suggests a potential problem which
> I hadn't thought about before.  If you need to accept more than about
> 64k connections (not necessarily concurrent) in less than TIME-WAIT
> seconds, you might run out of ports.  Anyone know what happens to new
> connection attempts to a server in this condition?
>
> Alec, any idea if your server could be getting into this state every
> once in a while?  This is an appealing hypothesis, since it wouldn't
> necessarily happen at peak connection time (but potentially shortly
> after a peak), would resolve itself given a short period of time,
> wouldn't necessarily prevent all new connection attempts, since old
> TIME-WAIT sockets would be gradually timing out (so your other low-
> volume servers might still appear to be working normally), wouldn't
> interfere with already established connections, and might not change
> the userspace-visible syscall behavior (depending on what Linux does
> in this case, but I wouldn't be surprised if connection failures due
> to this never showed up in an accept(2) result).
>
> Jean-Paul
>
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

_______________________________________________
Twisted-Python mailing list
Twisted-Python at twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

-- 
http://www.alvinatorsplayground.blogspot.com/