Opened 9 years ago

Last modified 8 years ago

#3290 defect new

Epoll reactor pauses for 10-15 seconds during high connection volume

Reported by: kvogt Owned by:
Priority: normal Milestone:
Component: core Keywords: epoll
Cc: exarkun Branch:


I've included sample code below.

Basically, when you flood the reactor with a couple thousand connections, the strace shows that the process blocks on epoll for 10-15 seconds. Some connections eventually time out instead of being accept()'d by the reactor.

Make sure you have lots of fd's handy:

# ulimit -n 65536

Run the attached snippet with:

# twistd --python --reactor epoll

Then test with:

# ab -c10000 -n10000

And tail the log:

# tail -f twistd.log

Attachments (3) (688 bytes) - added by kvogt 9 years ago.
Test file for demonstrating reactor pausing
server.tac (1.2 KB) - added by exarkun 8 years ago.
modified version of which does more reporting (1.5 KB) - added by exarkun 8 years ago.
ab replacement which does reporting and probably has fewer bugs

Download all attachments as: .zip

Change History (10)

Changed 9 years ago by kvogt

Test file for demonstrating reactor pausing

comment:1 Changed 8 years ago by exarkun

  • Cc exarkun added

I did a bit of playing around with this. First I tried ab, as you suggested, and reproduced the problem fairly well. I also tried using selectreactor in the server and saw some similar pauses, although much shorter ones since the total connection limit is much lower.

Then I wrote a custom client to do the same thing as ab was supposed to be doing, and the pauses were drastically diminished. With my custom client, up to 20k connections, the longest pause is about 0.15 seconds. Between 20k and 21k, there seems to be a reproducable delay of about 2 seconds, but above 21k that goes away.

The custom client I wrote only ends up keeping about 50 - 75 outstanding connection attempts (that is, sockets which are connecting but not yet connected) in parallel. I don't know of any way to make ab tell me how many concurrent connection attempts it has.

What do you think about the possibility that this is a bug in ab, not epollreactor?

Changed 8 years ago by exarkun

modified version of which does more reporting

Changed 8 years ago by exarkun

ab replacement which does reporting and probably has fewer bugs

comment:2 Changed 8 years ago by exarkun

  • Owner changed from glyph to kvogt

Can you provide any further information? A non-ab based reproduction would be great, just to rule out the possibility of this being just an ab bug.

comment:3 Changed 8 years ago by kvogt

  • Owner changed from kvogt to exarkun

Actually, the point of this test is to demonstrate the reactor problems I've seen with high concurrency. That -c flag on ab specifies the number of connections to do in parallel. I'm trying to closely simulate the scenario where a twisted app with 10-20K clients crashes or restarts. In many cases, such an event would trigger reconnect logic in all of those clients, so you actually see several thousand concurrent socket connections being created.

ab is the best tool I know of for this job, but I'd be open to other suggestions. A test client based on twisted might suffer from the same limitations, so I'd rather not go down that road.

comment:4 Changed 8 years ago by exarkun

Take a look at the attached I realize that a Twisted client could have issues which prevent it from properly testing the problem, so I included some extra output so a visual inspection could confirm that the problem is not showing up. For example, one of the server will now emit is the amount of time which passes between a connection being made and data being received. Additionally, the client will report how many connection attempts it has made and not yet seen satisfied. We can add more diagnostic information if you think there might be some other case it is missing.

There may well be a problem here, but I'm suspicious of the quality of the results ab produces, since I've seen it produce clearly wrong values (eg, negatives for things which cannot physically be negative) in the past.

comment:5 Changed 8 years ago by exarkun

Ah, of course, another possibility is a simple client based on just the socket module.

comment:6 Changed 8 years ago by exarkun

I wonder if latencytop would help narrow down the problem.

comment:7 Changed 6 years ago by <automation>

  • Owner exarkun deleted
Note: See TracTickets for help on using tickets.