[Twisted-Python] intermittent problem: not accepting new connections

Thu Sep 11 19:17:13 MDT 2008

We had a similar problem like this, might as well share it with you guys:

Our Jabber server stopped taking in new connections due to our iptables at
some point in time (even lost some packets). We have an ip routing scheme
where ports 25,80,5223,10873 routed to 5222. Whenever this happens, we get
error error from our syslogs:

""ip_conntrack: table full, dropping packet"

http://support.imagestream.com/Resolving_ip_conntrack_table_full_Errors.html

Might as well check dmesg if you have something like this. What we did is
disable iptables and removed its kernel modules.

On Fri, Sep 12, 2008 at 3:57 AM, Alec Matusis <matusis at yahoo.com> wrote:

> > How does ulimit -a compare between old and new machine?
>
> ulimit is only higher on the new machine, I upped it. Also, from my
> experience, running into  ulimits produces log messages.
>
> alecm at serv2 ~> cat /proc/23678/limits
>
> Limit                     Soft Limit           Hard Limit           Units
>
> Max cpu time              unlimited            unlimited            ms
>
> Max file size             unlimited            unlimited            bytes
>
> Max data size             unlimited            unlimited            bytes
>
> Max stack size            8388608              unlimited            bytes
>
> Max core file size        0                    unlimited            bytes
>
> Max resident set          1073741824           1073741824           bytes
>
> Max processes             126976               126976
> processes
> Max open files            25000                25000                files
>
> Max locked memory         32768                32768                bytes
>
> Max address space         unlimited            unlimited            bytes
>
> Max file locks            unlimited            unlimited            locks
>
> Max pending signals       126976               126976               signals
>
> Max msgqueue size         819200               819200               bytes
>
> Max nice priority         0                    0
> Max realtime priority     0                    0
>
>
> > If you need to accept more than about
> > 64k connections (not necessarily concurrent) in less than TIME-WAIT
> > seconds, you might run out of ports.
>
> > How does netstat look like?
>
> I have to wait for another outage tonight, to get more definite results on
> this.
> So far, I have had only one very brief outage at 7:46am
> I was asleep, but I have some new data logged.
> I have two servers, one on port 5222 (problematic one, let's call it serv1
> ), and one on 5228 (that has no problems, call it serv2 ).
> I made a script that measures the number of connections on each type of
> server respectively like this:
>
> while [ 1 ]
> do
>  sleep 5
>  netstat -ant | grep ESTABLISHED | awk '/:5228 /{a += 1}/:5222 /{b += 1}
> END
> {print strftime(" %d %b %Y %H:%M:%S", systime(
> ))"   serv2: "a"   serv1: "b" total: "a+b }'
> done
>
> This is what happened around 7:46am (lines that have "+" in front indicate
> anomaly):
>
>  11 Sep 2008 07:41:29   serv2: 2756   serv1: 1064 total: 3820
>  11 Sep 2008 07:41:36   serv2: 2764   serv1: 1062 total: 3826
>  11 Sep 2008 07:41:44   serv2: 2784   serv1: 1049 total: 3833
>  11 Sep 2008 07:41:51   serv2: 2756   serv1: 1055 total: 3811
>  11 Sep 2008 07:41:59   serv2: 2760   serv1: 1066 total: 3826
>  11 Sep 2008 07:42:06   serv2: 2777   serv1: 1050 total: 3827
>  11 Sep 2008 07:42:14   serv2: 2769   serv1: 1054 total: 3823
>  11 Sep 2008 07:42:21   serv2: 2751   serv1: 1055 total: 3806
>  11 Sep 2008 07:42:29   serv2: 2760   serv1: 1050 total: 3810
>  11 Sep 2008 07:42:36   serv2: 2747   serv1: 1040 total: 3787
>  11 Sep 2008 07:42:44   serv2: 2737   serv1: 1046 total: 3783
>  11 Sep 2008 07:42:51   serv2: 2739   serv1: 1046 total: 3785
>  11 Sep 2008 07:42:59   serv2: 2743   serv1: 1037 total: 3780
>  11 Sep 2008 07:43:07   serv2: 2720   serv1: 1041 total: 3761
>  11 Sep 2008 07:43:14   serv2: 2714   serv1: 1047 total: 3761
>  11 Sep 2008 07:43:22   serv2: 2721   serv1: 1045 total: 3766
>  11 Sep 2008 07:43:29   serv2: 2697   serv1: 1056 total: 3753
>  11 Sep 2008 07:43:37   serv2: 2710   serv1: 1059 total: 3769
> +11 Sep 2008 07:43:44   serv2: 2765   serv1: 1304 total: 4069
> +11 Sep 2008 07:43:53   serv2: 2854   serv1: 1904 total: 4758
> +11 Sep 2008 07:44:01   serv2: 2714   serv1: 2190 total: 4904
> +11 Sep 2008 07:44:09   serv2: 2715   serv1: 2202 total: 4917
> +11 Sep 2008 07:44:17   serv2: 2779   serv1: 1891 total: 4670
> +11 Sep 2008 07:44:26   serv2: 2812   serv1: 2284 total: 5096
> +11 Sep 2008 07:44:36   serv2: 2828   serv1: 3496 total: 6324
> +11 Sep 2008 07:44:46   serv2: 2715   serv1: 4327 total: 7042
> +11 Sep 2008 07:44:56   serv2: 2638   serv1: 3499 total: 6137
> +11 Sep 2008 07:45:05   serv2: 2714   serv1: 2396 total: 5110
> +11 Sep 2008 07:45:15   serv2: 2776   serv1: 1464 total: 4240
> +11 Sep 2008 07:45:25   serv2: 2728   serv1: 1604 total: 4332
> +11 Sep 2008 07:45:35   serv2: 2708   serv1: 1566 total: 4274
> +11 Sep 2008 07:45:45   serv2: 2750   serv1: 1680 total: 4430
> +11 Sep 2008 07:45:54   serv2: 2755   serv1: 1311 total: 4066
> +11 Sep 2008 07:46:04   serv2: 2704   serv1: 1178 total: 3882
>  11 Sep 2008 07:46:13   serv2: 2644   serv1: 1024 total: 3668
>  11 Sep 2008 07:46:22   serv2: 2573   serv1: 981 total: 3554
>  11 Sep 2008 07:46:30   serv2: 2739   serv1: 1051 total: 3790
>  11 Sep 2008 07:46:39   serv2: 2773   serv1: 1043 total: 3816
>  11 Sep 2008 07:46:46   serv2: 2548   serv1: 959 total: 3507
>  11 Sep 2008 07:46:54   serv2: 2773   serv1: 1044 total: 3817
>  11 Sep 2008 07:47:02   serv2: 2745   serv1: 1027 total: 3772
>  11 Sep 2008 07:47:10   serv2: 2591   serv1: 960 total: 3551
>  11 Sep 2008 07:47:17   serv2: 2721   serv1: 1002 total: 3723
>  11 Sep 2008 07:47:25   serv2: 2753   serv1: 1030 total: 3783
>  11 Sep 2008 07:47:32   serv2: 2773   serv1: 1029 total: 3802
>  11 Sep 2008 07:47:40   serv2: 2768   serv1: 1025 total: 3793
>  11 Sep 2008 07:47:47   serv2: 2764   serv1: 1021 total: 3785
>  11 Sep 2008 07:47:55   serv2: 2750   serv1: 1021 total: 3771
>
>
> Each server also reports the maximum numbers of authenticated connected
> clients to the database.
> Even though serv1 peaked at 07:44:46  at 4327 "ESTABLISHED" connections,
> the
> number of normally authenticated clients in the database for this time
> period is only 1100, which does not reflect this weird connections spike.
> Since I was asleep at 7:46am, I did not have a chance to investigate what
> those connections actually were.
>
> > Anyone know what happens to new
> > connection attempts to a server in this condition?
>
> One thing that I noticed yesterday, that when server was in this condition,
> telnet simply hang, as if no ACK packet was received back:
>
> alecm at serv2 ~> telnet localhost 5222
> Trying 127.0.0.1...
>
> and nothing
>
> This behavior or telnet seems to be unique to the case when ACK is not
> received in the handshake, I verified that with tcpdump. To contrast that,
> if nothing listens on the destination port for example, telnet gives:
>
> alecm at serv2 ~> telnet localhost 50007
> Trying 127.0.0.1...
> telnet: connect to address 127.0.0.1: Connection refused
> telnet: Unable to connect to remote host: Connection refused
>
> which corresponds to RST packet that kernel returns in that case.
> I tried to write a 5 line mucked up server with accept() not returning any
> packets, and failed.
>
> > -----Original Message-----
> > From: twisted-python-bounces at twistedmatrix.com [mailto:twisted-python-
> > bounces at twistedmatrix.com] On Behalf Of Jean-Paul Calderone
> > Sent: Thursday, September 11, 2008 6:47 AM
> > To: Twisted general discussion
> > Subject: Re: [Twisted-Python] intermittent problem: not accepting new
> > connections
> >
> > On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick"
> > <paul.nendick at gmail.com> wrote:
> > >Not necessarily related to what you've described, but I'll share
> > >something that's helped a good deal on my most-heavily hit twisted
> > >servers. Presuming you're using Linux:
> > >
> > > echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
> > >
> > >
> > >from http://lartc.org/howto/lartc.kernel.obscure.html :
> > >
> > >"Enable fast recycling TIME-WAIT sockets. Default value is 1. It
> > >should not be changed without advice/request of technical experts"
> > >
> > >My expert advice: only use this on machines connected on a low-latency
> > >LAN. It *will* break internet-facing interfaces. It halves the
> > >constant used by the Nagle algorithm:
> > >
> > >http://en.wikipedia.org/wiki/Nagle's_algorithm<http://en.wikipedia.org/wiki/Nagle%27s_algorithm>
> > >
> >
> > This is somewhat interesting.  It suggests a potential problem which
> > I hadn't thought about before.  If you need to accept more than about
> > 64k connections (not necessarily concurrent) in less than TIME-WAIT
> > seconds, you might run out of ports.  Anyone know what happens to new
> > connection attempts to a server in this condition?
> >
> > Alec, any idea if your server could be getting into this state every
> > once in a while?  This is an appealing hypothesis, since it wouldn't
> > necessarily happen at peak connection time (but potentially shortly
> > after a peak), would resolve itself given a short period of time,
> > wouldn't necessarily prevent all new connection attempts, since old
> > TIME-WAIT sockets would be gradually timing out (so your other low-
> > volume servers might still appear to be working normally), wouldn't
> > interfere with already established connections, and might not change
> > the userspace-visible syscall behavior (depending on what Linux does
> > in this case, but I wouldn't be surprised if connection failures due
> > to this never showed up in an accept(2) result).
> >
> > Jean-Paul
> >
> > _______________________________________________
> > Twisted-Python mailing list
> > Twisted-Python at twistedmatrix.com
> > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>
>
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>

-- 
http://www.alvinatorsplayground.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/twisted-python/attachments/20080912/f608c826/attachment.html>