Ticket #816 (new defect )

Opened 4 years ago

Last modified 5 months ago

select fails with too many open file descriptiors -> kills server

Reported by: ndecker Assigned to: itamarst
Type: defect Priority: high
Milestone: Component: core
Keywords: Cc: glyph, exarkun, itamarst, anthony, jknight, ndecker
Branch: Author:
Launchpad Bug:

Attachments

Change History

  2004-12-08 21:11:05+00:00 changed by ndecker

When too many connections are opened using the default reactor with windows,
this message is printed repeatedly:
Traceback (most recent call last):
  File "x.py", line 244, in ?
    main()
  File "x.py", line 240, in main
    reactor.run()
  File ".\lib\Twisted-1.3.0\twisted\internet\default.py", line 126, in run
    self.mainLoop()
  File ".\lib\Twisted-1.3.0\twisted\internet\default.py", line 137, in mainLoop
    self.doIteration(t)
--- <exception caught here> ---
  File ".\lib\Twisted-1.3.0\twisted\internet\default.py", line 490, in doSelect
    [], timeout)
  File ".\lib\Twisted-1.3.0\twisted\internet\default.py", line 441, in
win32select
    r, w, e = select.select(r, w, w, timeout)
exceptions.ValueError: too many file descriptors in select()
The limit on my system seems to be 252 open connections. Maybe the reactor
should refuse to add connections above this limit.

  2004-12-08 21:42:06+00:00 changed by glyph

This isn't a win32-specific issue.  You're right that Twisted should refuse to
open connections beyond this limit (although given that the limit is impossible
to determine beforehand, really the correct thing to do is to handle the
exception that you are seeing being logged, and remove the
last-most-recently-added FD and cause *it* to error out with a traceback rather
than the whole reactor.)

  2004-12-08 21:48:21+00:00 changed by ndecker

Good. This issue is no urgent problem to me. It was just a syntetic load test.
It just worries me, that twisted is completely stopped by this with no hope for
recovery.

  2004-12-12 05:46:15+00:00 changed by jknight

Retitled. Is there really no way to know the limit ahead of time? Killing
the most-recently-added seems somewhat poor. (note: not necessarily most
recently _connected_, because it may be that you had a bunch of idle
connections that just got re-added to the select set to write some data).

  2004-12-14 00:36:45+00:00 changed by jknight

Of course, FD_SETSIZE is what you need to look at, but that isn't exposed
to python. Not directly at least. But you *can* get at it via something
like the following binomial search code. So, getting at the number is not
a problem. Might even want to do min(findMaxFDs(), getrlimit(RLIMIT_NOFILE)
- 20), or something like that.
def tryMaxFDs(num):
    """Returns True if select can handle that num fds."""
    try:
        print "Trying",num
        select.select(range(num), [], [], 0)
    except ValueError:
        return False
    except select.error:
        return True
    return True
def findMaxFDs():
    """Find the maximum number of fds that can be passed to select"""
    left = 0
    right = 512
    # First search upwards
    while tryMaxFDs(right):
        left = right
        right *= 2
    # Now binary search through that range
    while left <= right:
        mid = int((left+right)//2)
        if tryMaxFDs(mid):
            left = mid + 1
        else:
            right = mid - 1
    return left - 1

  2004-12-14 04:11:17+00:00 changed by itamarst

We probably should have pluggable hooks so users can determine policy. E.g., you
might want to stop accepting TCP connections on your webserver, and kill a
couple of connections so you can still connect using manhole, stuff like that.
Any automated policy is bound to be broken in some cases.

  2004-12-14 04:46:23+00:00 changed by jknight

That'd be a nice enhancement. This, however, is a denial of service
and therefore much more serious.

  2005-01-12 18:47:33+00:00 changed by anthony

Well, using the pollreactor, say, should make that problem go away. select is
just a broken API.

  2006-06-04 00:56:12+00:00 changed by glyph

select is limited not only in the quantity of file descriptors that can be passed to it, but also in the literal value of the FDs passed to it. e.g. you cannot do select([1025], [], []) on most platforms although that is only one file descriptor, 1025 > FD_SETSIZE and so you're boned.

This is only true on UNIX, however; on Windows, it appears that select can somehow handle large fileno()s but only 512 at a time.

We should add checks to addReader and addWriter in select-based reactors. On UNIX those checks can just look at the FD to see if it's greater than FD_SETSIZE, on Windows and for reactor like pollreactor, it will need to check the current size of the readers and writers lists. The first phase of the solution, just to eliminate the most obvious failure, is to disallow the particular connection that exceeds the limit. This still allows a DoS, since if you open lots of connections, no new connections will be able to be made, but you could monkey around with iptables and eventually restore the server to working order without killing it. Longer-term we will want some sort of policy hook so that factories can shut down protocols that they don't think are important if new connections are higher-priority.

  2006-06-04 19:13:40+00:00 changed by jknight

To fix this properly for windows, we ought to have the reactor tracking associated fds separately from read/write, as per #365 and #1662. Without that you'll end up with random connections aborting in the middle from being unable to restart reading/writing after having stopped. That's somewhat poor. It'd be better to just reject any new sockets when the reactor already has too many.

  2006-07-16 00:53:24+00:00 changed by ghazel

The pre-process search for max # of file-descriptors is not a complete way of finding the limit, at least on Windows. It's true that it will prevent this error:

exceptions.ValueError: too many file descriptors in select()

But not this one (#1228):

select.error: (10055, 'An operation on a socket could not be performed because
the system lacked sufficient buffer space or because a queue was full')

The method I use is to detect this error at runtime, and postpone sockets until select can chew through them. This might cause timeouts, but it also might not. I preferred this over closing a few of them.

  2008-06-26 22:35:03+00:00 changed by lukatmyshu

  • branch deleted
  • author deleted

If you run out of FDs in twisted and you've got a connection that you're about to accept twisted will get stuck in an infinite busy loop. epoll will signal that theres data pending on a certain fd, tcp.py will attempt to open the connection but will fail with EMFILE/ENFILE ... then epoll will immediately signal on that fd again.

There are a couple different ways to handle this situation. 1) When we get to 90% of open FDs stop accepting connections and wait until the number falls below some acceptable threshold. 2) When accept fails w/ EMFILE/ENFILE remove the accept fd from the epoll set and add it back when the number of FDs has fallen (add it to 'loseConnection?'). 3) Tell the app and let it decide .... if it doesn't have any good ideas, do (1) or (2).

Note: See TracTickets for help on using tickets.