[Twisted-Python] intermittent problem: not accepting new connections
Alvin Delagon
adelagon at gmail.com
Thu Sep 11 19:17:13 MDT 2008
We had a similar problem like this, might as well share it with you guys:
Our Jabber server stopped taking in new connections due to our iptables at
some point in time (even lost some packets). We have an ip routing scheme
where ports 25,80,5223,10873 routed to 5222. Whenever this happens, we get
error error from our syslogs:
""ip_conntrack: table full, dropping packet"
http://support.imagestream.com/Resolving_ip_conntrack_table_full_Errors.html
Might as well check dmesg if you have something like this. What we did is
disable iptables and removed its kernel modules.
On Fri, Sep 12, 2008 at 3:57 AM, Alec Matusis <matusis at yahoo.com> wrote:
> > How does ulimit -a compare between old and new machine?
>
> ulimit is only higher on the new machine, I upped it. Also, from my
> experience, running into ulimits produces log messages.
>
> alecm at serv2 ~> cat /proc/23678/limits
>
> Limit Soft Limit Hard Limit Units
>
> Max cpu time unlimited unlimited ms
>
> Max file size unlimited unlimited bytes
>
> Max data size unlimited unlimited bytes
>
> Max stack size 8388608 unlimited bytes
>
> Max core file size 0 unlimited bytes
>
> Max resident set 1073741824 1073741824 bytes
>
> Max processes 126976 126976
> processes
> Max open files 25000 25000 files
>
> Max locked memory 32768 32768 bytes
>
> Max address space unlimited unlimited bytes
>
> Max file locks unlimited unlimited locks
>
> Max pending signals 126976 126976 signals
>
> Max msgqueue size 819200 819200 bytes
>
> Max nice priority 0 0
> Max realtime priority 0 0
>
>
> > If you need to accept more than about
> > 64k connections (not necessarily concurrent) in less than TIME-WAIT
> > seconds, you might run out of ports.
>
> > How does netstat look like?
>
> I have to wait for another outage tonight, to get more definite results on
> this.
> So far, I have had only one very brief outage at 7:46am
> I was asleep, but I have some new data logged.
> I have two servers, one on port 5222 (problematic one, let's call it serv1
> ), and one on 5228 (that has no problems, call it serv2 ).
> I made a script that measures the number of connections on each type of
> server respectively like this:
>
> while [ 1 ]
> do
> sleep 5
> netstat -ant | grep ESTABLISHED | awk '/:5228 /{a += 1}/:5222 /{b += 1}
> END
> {print strftime(" %d %b %Y %H:%M:%S", systime(
> ))" serv2: "a" serv1: "b" total: "a+b }'
> done
>
> This is what happened around 7:46am (lines that have "+" in front indicate
> anomaly):
>
> 11 Sep 2008 07:41:29 serv2: 2756 serv1: 1064 total: 3820
> 11 Sep 2008 07:41:36 serv2: 2764 serv1: 1062 total: 3826
> 11 Sep 2008 07:41:44 serv2: 2784 serv1: 1049 total: 3833
> 11 Sep 2008 07:41:51 serv2: 2756 serv1: 1055 total: 3811
> 11 Sep 2008 07:41:59 serv2: 2760 serv1: 1066 total: 3826
> 11 Sep 2008 07:42:06 serv2: 2777 serv1: 1050 total: 3827
> 11 Sep 2008 07:42:14 serv2: 2769 serv1: 1054 total: 3823
> 11 Sep 2008 07:42:21 serv2: 2751 serv1: 1055 total: 3806
> 11 Sep 2008 07:42:29 serv2: 2760 serv1: 1050 total: 3810
> 11 Sep 2008 07:42:36 serv2: 2747 serv1: 1040 total: 3787
> 11 Sep 2008 07:42:44 serv2: 2737 serv1: 1046 total: 3783
> 11 Sep 2008 07:42:51 serv2: 2739 serv1: 1046 total: 3785
> 11 Sep 2008 07:42:59 serv2: 2743 serv1: 1037 total: 3780
> 11 Sep 2008 07:43:07 serv2: 2720 serv1: 1041 total: 3761
> 11 Sep 2008 07:43:14 serv2: 2714 serv1: 1047 total: 3761
> 11 Sep 2008 07:43:22 serv2: 2721 serv1: 1045 total: 3766
> 11 Sep 2008 07:43:29 serv2: 2697 serv1: 1056 total: 3753
> 11 Sep 2008 07:43:37 serv2: 2710 serv1: 1059 total: 3769
> +11 Sep 2008 07:43:44 serv2: 2765 serv1: 1304 total: 4069
> +11 Sep 2008 07:43:53 serv2: 2854 serv1: 1904 total: 4758
> +11 Sep 2008 07:44:01 serv2: 2714 serv1: 2190 total: 4904
> +11 Sep 2008 07:44:09 serv2: 2715 serv1: 2202 total: 4917
> +11 Sep 2008 07:44:17 serv2: 2779 serv1: 1891 total: 4670
> +11 Sep 2008 07:44:26 serv2: 2812 serv1: 2284 total: 5096
> +11 Sep 2008 07:44:36 serv2: 2828 serv1: 3496 total: 6324
> +11 Sep 2008 07:44:46 serv2: 2715 serv1: 4327 total: 7042
> +11 Sep 2008 07:44:56 serv2: 2638 serv1: 3499 total: 6137
> +11 Sep 2008 07:45:05 serv2: 2714 serv1: 2396 total: 5110
> +11 Sep 2008 07:45:15 serv2: 2776 serv1: 1464 total: 4240
> +11 Sep 2008 07:45:25 serv2: 2728 serv1: 1604 total: 4332
> +11 Sep 2008 07:45:35 serv2: 2708 serv1: 1566 total: 4274
> +11 Sep 2008 07:45:45 serv2: 2750 serv1: 1680 total: 4430
> +11 Sep 2008 07:45:54 serv2: 2755 serv1: 1311 total: 4066
> +11 Sep 2008 07:46:04 serv2: 2704 serv1: 1178 total: 3882
> 11 Sep 2008 07:46:13 serv2: 2644 serv1: 1024 total: 3668
> 11 Sep 2008 07:46:22 serv2: 2573 serv1: 981 total: 3554
> 11 Sep 2008 07:46:30 serv2: 2739 serv1: 1051 total: 3790
> 11 Sep 2008 07:46:39 serv2: 2773 serv1: 1043 total: 3816
> 11 Sep 2008 07:46:46 serv2: 2548 serv1: 959 total: 3507
> 11 Sep 2008 07:46:54 serv2: 2773 serv1: 1044 total: 3817
> 11 Sep 2008 07:47:02 serv2: 2745 serv1: 1027 total: 3772
> 11 Sep 2008 07:47:10 serv2: 2591 serv1: 960 total: 3551
> 11 Sep 2008 07:47:17 serv2: 2721 serv1: 1002 total: 3723
> 11 Sep 2008 07:47:25 serv2: 2753 serv1: 1030 total: 3783
> 11 Sep 2008 07:47:32 serv2: 2773 serv1: 1029 total: 3802
> 11 Sep 2008 07:47:40 serv2: 2768 serv1: 1025 total: 3793
> 11 Sep 2008 07:47:47 serv2: 2764 serv1: 1021 total: 3785
> 11 Sep 2008 07:47:55 serv2: 2750 serv1: 1021 total: 3771
>
>
> Each server also reports the maximum numbers of authenticated connected
> clients to the database.
> Even though serv1 peaked at 07:44:46 at 4327 "ESTABLISHED" connections,
> the
> number of normally authenticated clients in the database for this time
> period is only 1100, which does not reflect this weird connections spike.
> Since I was asleep at 7:46am, I did not have a chance to investigate what
> those connections actually were.
>
> > Anyone know what happens to new
> > connection attempts to a server in this condition?
>
> One thing that I noticed yesterday, that when server was in this condition,
> telnet simply hang, as if no ACK packet was received back:
>
> alecm at serv2 ~> telnet localhost 5222
> Trying 127.0.0.1...
>
> and nothing
>
> This behavior or telnet seems to be unique to the case when ACK is not
> received in the handshake, I verified that with tcpdump. To contrast that,
> if nothing listens on the destination port for example, telnet gives:
>
> alecm at serv2 ~> telnet localhost 50007
> Trying 127.0.0.1...
> telnet: connect to address 127.0.0.1: Connection refused
> telnet: Unable to connect to remote host: Connection refused
>
> which corresponds to RST packet that kernel returns in that case.
> I tried to write a 5 line mucked up server with accept() not returning any
> packets, and failed.
>
> > -----Original Message-----
> > From: twisted-python-bounces at twistedmatrix.com [mailto:twisted-python-
> > bounces at twistedmatrix.com] On Behalf Of Jean-Paul Calderone
> > Sent: Thursday, September 11, 2008 6:47 AM
> > To: Twisted general discussion
> > Subject: Re: [Twisted-Python] intermittent problem: not accepting new
> > connections
> >
> > On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick"
> > <paul.nendick at gmail.com> wrote:
> > >Not necessarily related to what you've described, but I'll share
> > >something that's helped a good deal on my most-heavily hit twisted
> > >servers. Presuming you're using Linux:
> > >
> > > echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
> > >
> > >
> > >from http://lartc.org/howto/lartc.kernel.obscure.html :
> > >
> > >"Enable fast recycling TIME-WAIT sockets. Default value is 1. It
> > >should not be changed without advice/request of technical experts"
> > >
> > >My expert advice: only use this on machines connected on a low-latency
> > >LAN. It *will* break internet-facing interfaces. It halves the
> > >constant used by the Nagle algorithm:
> > >
> > >http://en.wikipedia.org/wiki/Nagle's_algorithm<http://en.wikipedia.org/wiki/Nagle%27s_algorithm>
> > >
> >
> > This is somewhat interesting. It suggests a potential problem which
> > I hadn't thought about before. If you need to accept more than about
> > 64k connections (not necessarily concurrent) in less than TIME-WAIT
> > seconds, you might run out of ports. Anyone know what happens to new
> > connection attempts to a server in this condition?
> >
> > Alec, any idea if your server could be getting into this state every
> > once in a while? This is an appealing hypothesis, since it wouldn't
> > necessarily happen at peak connection time (but potentially shortly
> > after a peak), would resolve itself given a short period of time,
> > wouldn't necessarily prevent all new connection attempts, since old
> > TIME-WAIT sockets would be gradually timing out (so your other low-
> > volume servers might still appear to be working normally), wouldn't
> > interfere with already established connections, and might not change
> > the userspace-visible syscall behavior (depending on what Linux does
> > in this case, but I wouldn't be surprised if connection failures due
> > to this never showed up in an accept(2) result).
> >
> > Jean-Paul
> >
> > _______________________________________________
> > Twisted-Python mailing list
> > Twisted-Python at twistedmatrix.com
> > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>
>
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>
--
http://www.alvinatorsplayground.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/twisted-python/attachments/20080912/f608c826/attachment.html>
More information about the Twisted-Python
mailing list