Opened 14 years ago

Last modified 4 years ago

#816 defect new

select fails with too many open file descriptiors -> kills server

Reported by: ndecker Owned by:
Priority: high Milestone:
Component: core Keywords:
Cc: Glyph, Jean-Paul Calderone, itamarst, anthony, jknight, ndecker Branch:
Author:

Description


Change History (15)

comment:1 Changed 14 years ago by ndecker

When too many connections are opened using the default reactor with windows, 
this message is printed repeatedly:

Traceback (most recent call last):
  File "x.py", line 244, in ?
    main()
  File "x.py", line 240, in main
    reactor.run()
  File ".\lib\Twisted-1.3.0\twisted\internet\default.py", line 126, in run
    self.mainLoop()
  File ".\lib\Twisted-1.3.0\twisted\internet\default.py", line 137, in mainLoop
    self.doIteration(t)
--- <exception caught here> ---
  File ".\lib\Twisted-1.3.0\twisted\internet\default.py", line 490, in doSelect
    [], timeout)
  File ".\lib\Twisted-1.3.0\twisted\internet\default.py", line 441, in 
win32select
    r, w, e = select.select(r, w, w, timeout)
exceptions.ValueError: too many file descriptors in select()

The limit on my system seems to be 252 open connections. Maybe the reactor 
should refuse to add connections above this limit.

comment:2 Changed 14 years ago by Glyph

This isn't a win32-specific issue.  You're right that Twisted should refuse to
open connections beyond this limit (although given that the limit is impossible
to determine beforehand, really the correct thing to do is to handle the
exception that you are seeing being logged, and remove the
last-most-recently-added FD and cause *it* to error out with a traceback rather
than the whole reactor.)

comment:3 Changed 14 years ago by ndecker

Good. This issue is no urgent problem to me. It was just a syntetic load test.

It just worries me, that twisted is completely stopped by this with no hope for 
recovery.

comment:4 Changed 14 years ago by jknight

Retitled. Is there really no way to know the limit ahead of time? Killing 
the most-recently-added seems somewhat poor. (note: not necessarily most 
recently _connected_, because it may be that you had a bunch of idle 
connections that just got re-added to the select set to write some data).

comment:5 Changed 14 years ago by jknight

Of course, FD_SETSIZE is what you need to look at, but that isn't exposed 
to python. Not directly at least. But you *can* get at it via something 
like the following binomial search code. So, getting at the number is not
a problem. Might even want to do min(findMaxFDs(), getrlimit(RLIMIT_NOFILE) 
- 20), or something like that.

def tryMaxFDs(num):
    """Returns True if select can handle that num fds."""
    try:
        print "Trying",num
        select.select(range(num), [], [], 0)
    except ValueError:
        return False
    except select.error:
        return True
    return True

def findMaxFDs():
    """Find the maximum number of fds that can be passed to select"""
    left = 0
    right = 512
    # First search upwards                                                                        
    while tryMaxFDs(right):
        left = right
        right *= 2
    # Now binary search through that range                                                        
    while left <= right:
        mid = int((left+right)//2)
        if tryMaxFDs(mid):
            left = mid + 1
        else:
            right = mid - 1
    return left - 1

comment:6 Changed 14 years ago by itamarst

We probably should have pluggable hooks so users can determine policy. E.g., you
might want to stop accepting TCP connections on your webserver, and kill a
couple of connections so you can still connect using manhole, stuff like that. 
Any automated policy is bound to be broken in some cases.

comment:7 Changed 14 years ago by jknight

That'd be a nice enhancement. This, however, is a denial of service 
and therefore much more serious.

comment:8 Changed 13 years ago by anthony

Well, using the pollreactor, say, should make that problem go away. select is
just a broken API.

comment:9 Changed 12 years ago by Glyph

select is limited not only in the quantity of file descriptors that can be passed to it, but also in the literal value of the FDs passed to it. e.g. you cannot do select([1025], [], []) on most platforms although that is only one file descriptor, 1025 > FD_SETSIZE and so you're boned.

This is only true on UNIX, however; on Windows, it appears that select can somehow handle large fileno()s but only 512 at a time.

We should add checks to addReader and addWriter in select-based reactors. On UNIX those checks can just look at the FD to see if it's greater than FD_SETSIZE, on Windows and for reactor like pollreactor, it will need to check the current size of the readers and writers lists. The first phase of the solution, just to eliminate the most obvious failure, is to disallow the particular connection that exceeds the limit. This still allows a DoS, since if you open lots of connections, no new connections will be able to be made, but you could monkey around with iptables and eventually restore the server to working order without killing it. Longer-term we will want some sort of policy hook so that factories can shut down protocols that they don't think are important if new connections are higher-priority.

comment:10 Changed 12 years ago by jknight

To fix this properly for windows, we ought to have the reactor tracking associated fds separately from read/write, as per #365 and #1662. Without that you'll end up with random connections aborting in the middle from being unable to restart reading/writing after having stopped. That's somewhat poor. It'd be better to just reject any new sockets when the reactor already has too many.

comment:11 Changed 12 years ago by ghazel

The pre-process search for max # of file-descriptors is not a complete way of finding the limit, at least on Windows. It's true that it will prevent this error:

exceptions.ValueError: too many file descriptors in select()

But not this one (#1228):

select.error: (10055, 'An operation on a socket could not be performed because 
the system lacked sufficient buffer space or because a queue was full')

The method I use is to detect this error at runtime, and postpone sockets until select can chew through them. This might cause timeouts, but it also might not. I preferred this over closing a few of them.

comment:12 Changed 10 years ago by lukatmyshu

If you run out of FDs in twisted and you've got a connection that you're about to accept twisted will get stuck in an infinite busy loop. epoll will signal that theres data pending on a certain fd, tcp.py will attempt to open the connection but will fail with EMFILE/ENFILE ... then epoll will immediately signal on that fd again.

There are a couple different ways to handle this situation. 1) When we get to 90% of open FDs stop accepting connections and wait until the number falls below some acceptable threshold. 2) When accept fails w/ EMFILE/ENFILE remove the accept fd from the epoll set and add it back when the number of FDs has fallen (add it to 'loseConnection?'). 3) Tell the app and let it decide .... if it doesn't have any good ideas, do (1) or (2).

comment:13 Changed 7 years ago by <automation>

Owner: itamarst deleted

comment:14 Changed 7 years ago by Itamar Turner-Trauring

The last comment talks about issues that are not select()-related. To clarify, there are a number of distinct issues:

  1. Resource exhaustion policies - notifying the user that e.g. fds have run out. This is covered by #2122.
  1. select() has a FD_SETSIZE limit, where file descriptors > than FD_SETSIZE are rejected. This is covered by this ticket.
  1. A Windows-specific select() issue - #1228.
  1. Running out of file descriptors causes a busy loop in epoll and maybe elsewhere. This is covered by the new ticket #5368.

comment:15 Changed 4 years ago by Jean-Paul Calderone

We should add checks to addReader and addWriter in select-based reactors. On UNIX those checks can just look at the FD to see if it's greater than FD_SETSIZE, on Windows and for reactor like pollreactor, it will need to check the current size of the readers and writers lists.

Hm. Check and then do what, I wonder? Raising an exception is the obvious thing but that seems quite scary to me!

On the other hand, perhaps getting an exception from addReader is not as scary as going into an infinite loop in doSelect.

Let's see... The way TCP ports are currently implemented, this will cause an exception when trying to create the new transport object (transport = self.transport(skt, protocol, addr, self, s, self.reactor)). This is handled and logged. The socket is not explicitly closed, instead the garbage collector is left to deal with it. That means there's still potentially a DoS (particularly on PyPy; I expect reference counting will make quick work of the socket on CPython).

That's fixable, though.

Note: See TracTickets for help on using tickets.