Opened 8 years ago

Closed 19 months ago

#5368 defect closed fixed (fixed)

epoll (and perhaps other reactors) busy loops when file descriptor limit is hit

Reported by: Itamar Turner-Trauring Owned by: mark williams
Priority: high Milestone:
Component: core Keywords:
Cc: Thijs Triemstra Branch: 5368-break-on-emfile
branch-diff, diff-cov, branch-cov, buildbot
Author: itamarst

Description (last modified by Thijs Triemstra)

Lifted from the comments in #816 - note that #816 covers the select()-specific issue of FD_SETSIZE limits, whereas this is related to running out of file descriptors when doing accept():

If you run out of FDs in twisted and you've got a connection that you're about to accept twisted will get stuck in an infinite busy loop. epoll will signal that theres data pending on a certain fd, tcp.py will attempt to open the connection but will fail with EMFILE/ENFILE ... then epoll will immediately signal on that fd again.

Allowing a user policy to do something useful in this case (closing idle connections, say) is covered by #2122. This ticket is focused merely on preventing the 100% CPU usage.

Change History (22)

comment:1 Changed 8 years ago by Jean-Paul Calderone

Description: modified (diff)

comment:2 Changed 8 years ago by Thijs Triemstra

Cc: Thijs Triemstra added
Description: modified (diff)
Priority: normalhigh

comment:3 Changed 8 years ago by itamarst

Author: itamarst
Branch: branches/EMFILE-5368

(In [33078]) Branching to 'EMFILE-5368'

comment:4 Changed 8 years ago by Itamar Turner-Trauring

Keywords: review added

comment:5 Changed 8 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: set to Itamar Turner-Trauring

Thanks. It'll be pretty cool when reactors are more robust in resource exhaustion conditions.

  1. That said, a new implicit state for tcp Ports doesn't sound like a step in the right direction. What happens if application code decides to pause the port in the one second disabled interval? This code will silently unpause it, probably surprising the application.
  2. The timed call outliving the port's listening lifetime is also potentially troublesome. What if the port is stopped and then started again across the one second disabled interval? Even more fun, what if it runs into EMFILE a second time before the restore delayed call fires?
  3. I suppose you can address these points by just being more careful about cleaning up state across state transitions. Maybe that's even okay, since we're going to move on to get rid of this code and replace it with application hooks for deciding what to do. I hope we do that soon. Maybe the literal 1 hard coded in several places can be replaced by a symbolic constant of some sort, though.
  4. Thanks for cleaning up the use of the global reactor in SelectReactorTestCase/ResourceExhaustionTestCase. The tests are still posixbase-specific though, since they directly instantiate twisted.internet.tcp.Port. Maybe not now, but at some point I think the module should be renamed/relocated to reflect this.
  5. The ResourceExhaustionTestCase.ports list is unused now and can be removed.
  6. ResourceExhaustionTestCase._acceptFailureTest doesn't need to clean up the port it creates.
  7. The shouldStopListening argument to that method needs some documentation.
  8. Those FakeSocket and FakeReactor should probably be generalized, tested, and re-used elsewhere in our test suite.

comment:6 Changed 8 years ago by Itamar Turner-Trauring

I don't think we're going replace this code with user hooks; the user hooks are an additional, different part of the solution.

comment:7 Changed 8 years ago by Itamar Turner-Trauring

Another potential solution is to close the socket and immediately reopen it, thus (presumably) clearing any notifications. Unfortunately some other thread might grab this last available file descriptor in between the two steps, so probably not doable in practice.

comment:8 Changed 8 years ago by Jean-Paul Calderone

I don't think we're going replace this code with user hooks; the user hooks are an additional, different part of the solution.

Why aren't the user hooks enough? This feature doesn't need access to any nasty tcp.Port guts. Implemented on the factory or another application-specified object, it's simpler and more general (consider UNIX sockets, or any other source of new file descriptors), as well as customizable.

Another potential solution is to close the socket and immediately reopen it, thus (presumably) clearing any notifications.

That doesn't sound very desirable. There are all kinds of other cases where it's not possible to re-open the socket properly (maybe we lost the privilege to bind the address it's listening too, maybe we're sharing it with another process, etc).

comment:9 Changed 8 years ago by Itamar Turner-Trauring

As far as user hooks, I was thinking more in terms of optional user hooks, which 99% users won't use and therefore we're back to square one in terms of vulnerability. But, if there's a default that does the same thing as this, only using the user hook, then yes, that seems better.

In which case, maybe I should write the user hook first.

comment:10 Changed 8 years ago by Itamar Turner-Trauring

Once #5400 is merged, an additional ticket should be opened for moving TCP ports to an explicit state machine. Once *that* is done, the concerns raised in the review should be much easier to address.

comment:11 Changed 8 years ago by Itamar Turner-Trauring

Moving to state machine is going to take a while. How about this approach? On startup, grab extra file descriptor. If we get out-of-fds error in accept(), close the extra fd, do accept() as usual, except closing the sockets we get back from accept(), and then re-grab an extra file descriptor. Doing accept() will clear the read flag on the port socket.

Obviously this privileges existing connections, but improving that is covered by that other ticket for having better, user-defined policies.

comment:12 Changed 20 months ago by mark williams

Branch: branches/EMFILE-53685368-break-on-emfile
Owner: changed from Itamar Turner-Trauring to Jean-Paul Calderone

comment:13 Changed 20 months ago by Jean-Paul Calderone

Keywords: review added

Should have been marked for review.

comment:14 Changed 20 months ago by Jean-Paul Calderone

Keywords: review removed
Owner: changed from Jean-Paul Calderone to mark williams

Reviewed.

comment:15 Changed 20 months ago by mark williams

Keywords: review added
Owner: changed from mark williams to Jean-Paul Calderone

comment:16 Changed 20 months ago by Jean-Paul Calderone

Keywords: review removed
Owner: changed from Jean-Paul Calderone to mark williams

comment:18 Changed 20 months ago by Jean-Paul Calderone

Owner: changed from mark williams to Jean-Paul Calderone
Status: newassigned

comment:19 Changed 19 months ago by Jean-Paul Calderone

Keywords: review removed
Owner: changed from Jean-Paul Calderone to mark williams
Status: assignednew

comment:20 Changed 19 months ago by mark williams

Keywords: review added
Owner: changed from mark williams to Jean-Paul Calderone

comment:21 Changed 19 months ago by Jean-Paul Calderone

Keywords: review removed
Owner: changed from Jean-Paul Calderone to mark williams

comment:22 Changed 19 months ago by Mark Williams <mrw@…>

Resolution: fixed
Status: newclosed

In f0b04e45:

Merge pull request #996 from twisted/5368-break-on-emfile

Author: markrwilliams

Reviewers: exarkun

Fixes: ticket:5368

On UNIX-like platforms, Twisted attempts to recover from EMFILE
when accepting connections on TCP and UNIX ports by shedding
incoming clients.

Note: See TracTickets for help on using tickets.