[Twisted-Python] intermittent problem: not accepting new connections

Werner Thie wthie at thiengineering.ch
Thu Sep 11 04:23:10 MDT 2008


Hi Alec

just wondering...

How does netstat look like?

We ran into problems resource wise with the number of open connections 
aka file descriptors per process and observed similar problems to yours, 
also by switching to a new server.

How does ulimit -a compare between old and new machine?

HTH, Werner

Alec Matusis wrote:
>> You could try running strace against the server while it's not
>> accepting
>> connections.  
> 
> 
> It went into this regime again tonight, and I captured strace.
> 
> A few remarks first:
> 
> The outage was brief, it lasted about 3 min. It did not correspond to the peak number of connections.
> The application has several protocols (like manhole and various control ports), that are listening on different ports.
> When the server stopped accepting connections on the main port (where external internet clients connect), telnet connection to manhole and control ports are still possible.
> Also, I think when it is in this bad state, it does accept some external connections, maybe 50%.
> 
> Anyway, here's the strace in the bad state:
> 
> accept(9, {sa_family=AF_INET, sin_port=htons(1474), sin_addr=inet_addr("83.153.87.198")}, [13980580626831507472]) = 491
> accept(13, {sa_family=AF_INET, sin_port=htons(40953), sin_addr=inet_addr("10.18.2.10")}, [17987095445330984976]) = 491
> --- SIGCHLD (Child exited) @ 0 (0) ---
> accept(9, {sa_family=AF_INET, sin_port=htons(50679), sin_addr=inet_addr("24.166.18.207")}, [17853676306370134032]) = 491
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(52873), sin_addr=inet_addr("67.87.61.193")}, [9929874236988456976]) = 493
> accept(9, {sa_family=AF_INET, sin_port=htons(54305), sin_addr=inet_addr("200.114.45.212")}, [2437573306904215568]) = 491
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(2246), sin_addr=inet_addr("76.11.66.253")}, [14269655427913351184]) = 491
> accept(9, {sa_family=AF_INET, sin_port=htons(32824), sin_addr=inet_addr("75.62.105.81")}, [4071254071732862992]) = 491
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(2262), sin_addr=inet_addr("76.11.66.253")}, [15422576932520198160]) = 491
> accept(9, {sa_family=AF_INET, sin_port=htons(3316), sin_addr=inet_addr("24.127.249.90")}, [17585430653564878864]) = 491
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(4466), sin_addr=inet_addr("72.241.229.233")}, [8219350803517800464]) = 491
> accept(9, {sa_family=AF_INET, sin_port=htons(1651), sin_addr=inet_addr("67.82.141.97")}, [8288312172811911184]) = 491
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(2669), sin_addr=inet_addr("24.94.231.135")}, [7857092508491186192]) = 398
> accept(9, {sa_family=AF_INET, sin_port=htons(50252), sin_addr=inet_addr("97.119.52.164")}, [5531546250907746320]) = 398
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(2675), sin_addr=inet_addr("70.4.222.71")}, [8289438072718753808]) = 398
> accept(9, {sa_family=AF_INET, sin_port=htons(63671), sin_addr=inet_addr("69.22.255.174")}, [13256345511754989584]) = 491
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(54520), sin_addr=inet_addr("92.130.148.171")}, [17929956025058721808]) = 491
> accept(13, {sa_family=AF_INET, sin_port=htons(41056), sin_addr=inet_addr("10.18.2.10")}, [6962565032504721424]) = 491
> accept(13, 0x7fffc3d5ca30, [16])        = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(49296), sin_addr=inet_addr("66.214.238.42")}, [10430336745580003344]) = 161
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> --- SIGCHLD (Child exited) @ 0 (0) ---
> accept(13, {sa_family=AF_INET, sin_port=htons(41199), sin_addr=inet_addr("10.18.2.10")}, [17266800979928416272]) = 161
> accept(9, {sa_family=AF_INET, sin_port=htons(49376), sin_addr=inet_addr("68.34.251.86")}, [16194944268614238224]) = 161
> --- SIGCHLD (Child exited) @ 0 (0) ---
> accept(13, {sa_family=AF_INET, sin_port=htons(41497), sin_addr=inet_addr("10.18.2.10")}, [1847038805765259280]) = 161
> accept(13, 0x7fffc3d5ca30, [16])        = -1 EAGAIN (Resource temporarily unavailable)
> --- SIGCHLD (Child exited) @ 0 (0) ---
> accept(9, {sa_family=AF_INET, sin_port=htons(50013), sin_addr=inet_addr("71.241.76.86")}, [6756243874575810576]) = 161
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> --- SIGCHLD (Child exited) @ 0 (0) ---
> --- SIGCHLD (Child exited) @ 0 (0) ---
> accept(9, {sa_family=AF_INET, sin_port=htons(2594), sin_addr=inet_addr("24.27.123.223")}, [2452772955646590992]) = 161
> accept(9, {sa_family=AF_INET, sin_port=htons(1690), sin_addr=inet_addr("67.242.75.46")}, [11098558340291100688]) = 276
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(13, {sa_family=AF_INET, sin_port=htons(41825), sin_addr=inet_addr("10.18.2.10")}, [7035467051472781328]) = 276
> accept(13, {sa_family=AF_INET, sin_port=htons(41910), sin_addr=inet_addr("10.18.2.10")}, [13160362544696655888]) = 276
> 
> 
> 
> here's the strace in the normal state:
> 
> 
> accept(9, {sa_family=AF_INET, sin_port=htons(40442), sin_addr=inet_addr("74.140.139.210")}, [18058590089415491600]) = 2218
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(58477), sin_addr=inet_addr("99.194.115.132")}, [7918454053414109200]) = 431
> accept(13, {sa_family=AF_INET, sin_port=htons(50878), sin_addr=inet_addr("10.18.2.10")}, [13746674921184952336]) = 356
> accept(9, {sa_family=AF_INET, sin_port=htons(1154), sin_addr=inet_addr("190.37.68.45")}, [9368613133427408912]) = 356
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> --- SIGCHLD (Child exited) @ 0 (0) ---
> accept(9, {sa_family=AF_INET, sin_port=htons(1212), sin_addr=inet_addr("125.60.240.211")}, [13547953587627229200]) = 356
> accept(9, {sa_family=AF_INET, sin_port=htons(1158), sin_addr=inet_addr("190.37.68.45")}, [9656843509579120656]) = 2223
> accept(9, {sa_family=AF_INET, sin_port=htons(52008), sin_addr=inet_addr("24.189.205.204")}, [2939443190379315216]) = 2224
> accept(9, {sa_family=AF_INET, sin_port=htons(50808), sin_addr=inet_addr("72.155.68.155")}, [8702643338529996816]) = 2253
> accept(9, {sa_family=AF_INET, sin_port=htons(2545), sin_addr=inet_addr("24.10.61.223")}, [17368413446520963088]) = 2263
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(2547), sin_addr=inet_addr("24.10.61.223")}, [17512528634596818960]) = 356
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(52012), sin_addr=inet_addr("24.189.205.204")}, [3227673566531026960]) = 180
> accept(9, {sa_family=AF_INET, sin_port=htons(50809), sin_addr=inet_addr("72.155.68.155")}, [8774700932567924752]) = 2224
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(1223), sin_addr=inet_addr("125.60.240.211")}, [14340587122044436496]) = 2253
> accept(9, {sa_family=AF_INET, sin_port=htons(2462), sin_addr=inet_addr("206.74.170.35")}, [11387633141372944400]) = 2263
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(2464), sin_addr=inet_addr("206.74.170.35")}, [11531748329448800272]) = 2263
> accept(9, {sa_family=AF_INET, sin_port=htons(1645), sin_addr=inet_addr("61.145.215.253")}, [7855966608584343568]) = 2224
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(4172), sin_addr=inet_addr("68.188.149.222")}, [5480880755099828240]) = 1283
> accept(9, {sa_family=AF_INET, sin_port=htons(50931), sin_addr=inet_addr("70.129.57.218")}, [17565727405195132944]) = 1283
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(4173), sin_addr=inet_addr("68.188.149.222")}, [5552938349137756176]) = 2224
> accept(9, {sa_family=AF_INET, sin_port=htons(1122), sin_addr=inet_addr("76.127.204.185")}, [7062770124213714960]) = 1283
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(13, {sa_family=AF_INET, sin_port=htons(51062), sin_addr=inet_addr("10.18.2.10")}, [8558809625430851600]) = 2292
> accept(13, 0x7fffc3d5ca30, [16])        = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(1646), sin_addr=inet_addr("61.145.215.253")}, [7928024202622271504]) = 2292
> accept(9, {sa_family=AF_INET, sin_port=htons(50932), sin_addr=inet_addr("70.129.57.218")}, [17637784999233060880]) = 2310
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("99.13.103.221")}, [11605494773346992144]) = 507
> accept(9, {sa_family=AF_INET, sin_port=htons(4004), sin_addr=inet_addr("99.13.103.221")}, [11821667555460775952]) = 507
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(1131), sin_addr=inet_addr("76.127.204.185")}, [7711288470555066384]) = 1283
> accept(9, {sa_family=AF_INET, sin_port=htons(2143), sin_addr=inet_addr("202.156.50.129")}, [6847723242006773776]) = 1894
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(2146), sin_addr=inet_addr("202.156.50.129")}, [7063896024120557584]) = 1894
> accept(9, {sa_family=AF_INET, sin_port=htons(1779), sin_addr=inet_addr("222.165.66.136")}, [17511684209666686992]) = 1278
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(61674), sin_addr=inet_addr("99.230.169.52")}, [16929031007875629072]) = 1278
> accept(9, {sa_family=AF_INET, sin_port=htons(61679), sin_addr=inet_addr("99.230.169.52")}, [17289318978065268752]) = 1278
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(1780), sin_addr=inet_addr("222.165.66.136")}, [17583741803704614928]) = 1913
> accept(9, {sa_family=AF_INET, sin_port=htons(52495), sin_addr=inet_addr("190.40.52.74")}, [1138566289384538128]) = 2310
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(13, {sa_family=AF_INET, sin_port=htons(51181), sin_addr=inet_addr("10.18.2.10")}, [17133663315944275984]) = 2310
> accept(9, {sa_family=AF_INET, sin_port=htons(60995), sin_addr=inet_addr("88.105.44.33")}, [4894849853588242448]) = 2310
> accept(9, {sa_family=AF_INET, sin_port=htons(3636), sin_addr=inet_addr("71.64.15.79")}, [3750935548236136464]) = 2310
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(13, {sa_family=AF_INET, sin_port=htons(51273), sin_addr=inet_addr("10.18.2.10")}, [5316499368700805136]) = 2334
> accept(13, 0x7fffc3d5ca30, [16])        = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(52498), sin_addr=inet_addr("190.40.52.74")}, [1354739071498321936]) = 2334
> accept(9, {sa_family=AF_INET, sin_port=htons(1440), sin_addr=inet_addr("24.21.117.196")}, [11530622429541957648]) = 2334
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(9, {sa_family=AF_INET, sin_port=htons(1442), sin_addr=inet_addr("24.21.117.196")}, [11674737617617813520]) = 2334
> accept(9, {sa_family=AF_INET, sin_port=htons(60996), sin_addr=inet_addr("88.105.44.33")}, [4966907447626170384]) = 2336
> accept(9, 0x7fffc3d5ca30, [16])         = -1 EAGAIN (Resource temporarily unavailable)
> accept(13, {sa_family=AF_INET, sin_port=htons(51345), sin_addr=inet_addr("10.18.2.10")}, [10504646139431616528]) = 2337
> accept(13, {sa_family=AF_INET, sin_port=htons(51348), sin_addr=inet_addr("10.18.2.10")}, [10720818921545400336]) = 2337
> 
> The only noticeable difference is the frequency of 
> --- SIGCHLD (Child exited) @ 0 (0) ---
> but it occurred in the normal state as well.
> The server sends emails via child subprocesses, so I suspect Child exited may be due to that.
> 
> If that's the case, the strace output is identical, and I do not understand what's causing the problem at all.
> 
> 
>> -----Original Message-----
>> From: twisted-python-bounces at twistedmatrix.com [mailto:twisted-python-
>> bounces at twistedmatrix.com] On Behalf Of Jean-Paul Calderone
>> Sent: Wednesday, September 10, 2008 2:47 PM
>> To: Twisted general discussion
>> Subject: RE: [Twisted-Python] intermittent problem: not accepting new
>> connections
>>
>> On Wed, 10 Sep 2008 14:01:27 -0700, Alec Matusis <matusis at yahoo.com>
>> wrote:
>>>> Do non-error log messages continue to appear in the Twisted log?
>> ie,
>>>> is
>>>> it clear that the logging system is still working, or could it have
>>>> failed
>>>> in some way, obscuring any exception reports?
>>> Yes, I print the results of garbage collection every 10 min like this:
>>>
>>> print "number of objects tracked by the garbage collector is:",
>> len(gc.get_objects())
>>> and they appear in the main twisted log.
>>>
>>> 2008/09/10 13:21 -0700 [-] number of objects tracked by the garbage
>> collector is: 860021
>>> 2008/09/10 13:31 -0700 [-] number of objects tracked by the garbage
>> collector is: 864316
>>>> Any new unhandled errno values should definitely result in an
>> exception
>>>> being logged (notice that the `raiseĀ“ which follows the checks for
>>>> various errno values is inside a try/except which logs any
>> exception).
>>> I noticed that raise too... Could it then be EWOULDBLOCK, EAGAIN or
>> EPERM?
>>>                except socket.error, e:
>>>                    if e.args[0] in (EWOULDBLOCK, EAGAIN):
>>>                        self.numberAccepts = i
>>>                        break
>>>                    elif e.args[0] == EPERM:
>>>                        # Netfilter on Linux may have rejected the
>>>                        # connection, but we get told to try to
>> accept()
>>>                        # anyway.
>>>                        continue
>>>
>>>
>>> I am not sure how to debug this problem- I have another twisted server
>> of a different type on that machine, and while the problematic server
>> stops accepting connections, the second one works just fine, so this is
>> not a machine-wide issue.
>>> What could it be?
>> You could try running strace against the server while it's not
>> accepting
>> connections.  You can limit the output to just accept calls, too:
>>
>>    strace -p <pid> -etrace=accept
>>
>> Maybe this will show you that accept calls are happening but failing in
>> some way.  If it produces no output, you can widen it and see if you
>> can
>> tell what the server is doing if it isn't calling accept().
>>
>> Jean-Paul
>>
>> _______________________________________________
>> Twisted-Python mailing list
>> Twisted-Python at twistedmatrix.com
>> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
> 
> 
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python




More information about the Twisted-Python mailing list