Opened 5 years ago

Closed 5 years ago

#6259 defect closed fixed (fixed)

epoll reactor continually errors when given a delayed call with an excessive timeout

Reported by: Tom Cocagne Owned by: Jean-Paul Calderone
Priority: normal Milestone:
Component: core Keywords: epoll reactor timeout
Cc: Branch: branches/epoll-distant-delayed-call-6259
branch-diff, diff-cov, branch-cov, buildbot
Author: exarkun

Description (last modified by Glyph)

Found in 12.1/python2.6 & 12.3/python2.7. Occurs when using the epoll function from the stdlib select module.

To repeat:

#echo > t.tap <<< EOF
from twisted.application import service
from twisted.internet import reactor

application = service.Application('blah')

reactor.callLater(1000000000000, lambda : None)
EOF
#twistd -noy t

This generates a continual stream of the following message:

2013-01-17 14:40:11-0600 [-] Unhandled Error
	Traceback (most recent call last):
	  File "/usr/lib64/python2.7/site-packages/twisted/scripts/_twistd_unix.py", line 231, in postApplication
	    self.startReactor(None, self.oldstdout, self.oldstderr)
	  File "/usr/lib64/python2.7/site-packages/twisted/application/app.py", line 402, in startReactor
	    self.config, oldstdout, oldstderr, self.profiler, reactor)
	  File "/usr/lib64/python2.7/site-packages/twisted/application/app.py", line 323, in runReactorWithLogging
	    reactor.run()
	  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1173, in run
	    self.mainLoop()
	--- <exception caught here> ---
	  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1185, in mainLoop
	    self.doIteration(t)
	  File "/usr/lib64/python2.7/site-packages/twisted/internet/epollreactor.py", line 364, in doPoll
	    l = self._poller.poll(timeout, len(self._selectables))
	exceptions.OverflowError: timeout is too large

The problem does not occur if the "select" reactor is used instead:

#twistd --reactor=select -noy t

Change History (14)

comment:1 Changed 5 years ago by Tom Cocagne

Found in 12.1/python2.6 & 12.3/python2.7. Occurs when using the epoll function from the stdlib select module.

To repeat:

#echo > t.tap <<< EOF 
from twisted.application import service from twisted.internet import reactor

application = service.Application('blah')

reactor.callLater(1000000000000, lambda : None) 
EOF 

#twistd -noy t.tap

This generates a continual stream of the following message:

2013-01-17 14:40:11-0600 [-] Unhandled Error

Traceback (most recent call last):

File "/usr/lib64/python2.7/site-packages/twisted/scripts/_twistd_unix.py", line 231, in postApplication

self.startReactor(None, self.oldstdout, self.oldstderr)

File "/usr/lib64/python2.7/site-packages/twisted/application/app.py", line 402, in startReactor

self.config, oldstdout, oldstderr, self.profiler, reactor)

File "/usr/lib64/python2.7/site-packages/twisted/application/app.py", line 323, in runReactorWithLogging

reactor.run()

File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1173, in run

self.mainLoop()

--- <exception caught here> ---

File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1185, in mainLoop

self.doIteration(t)

File "/usr/lib64/python2.7/site-packages/twisted/internet/epollreactor.py", line 364, in doPoll

l = self._poller.poll(timeout, len(self._selectables))

exceptions.OverflowError: timeout is too large

The problem does not occur if the "select" reactor is used instead:

#twistd --reactor=select -noy t.tap

comment:2 Changed 5 years ago by Jean-Paul Calderone

Type: defectregression

Regression introduced during the switch to the stdlib epoll bindings for #3114.

comment:3 Changed 5 years ago by Glyph

Description: modified (diff)

Formatting fix.

comment:4 in reply to:  2 Changed 5 years ago by Glyph

Replying to exarkun:

Regression introduced during the switch to the stdlib epoll bindings for #3114.

Is this a bug in the underlying bindings, or a bug in the way we're using the bindings?

comment:5 Changed 5 years ago by Jean-Paul Calderone

Owner: set to Jean-Paul Calderone
Status: newassigned

comment:6 Changed 5 years ago by Jean-Paul Calderone

Author: exarkun
Branch: branches/epoll-distant-delayed-call-6259

(In [36913]) Branching to 'epoll-distant-delayed-call-6259'

comment:7 Changed 5 years ago by Jean-Paul Calderone

Keywords: review added
Owner: Jean-Paul Calderone deleted
Status: assignednew

Is this a bug in the underlying bindings, or a bug in the way we're using the bindings?

The latter, I think. Our own bindings silently accepted too-large values (wrapping them around, which wasn't terribly harmful and masked this issue).

It looks like epollreactor against our bindings may have been the only reactor that didn't suffer this problem, though. select() and poll() both also have limits, albeit higher ones than epoll_wait(). I didn't check the Windows APIs, but they probably do as well.

Added a new test and a general fix in some code that is re-used by all our reactors. Here are the build results, I hope they are green.

comment:8 Changed 5 years ago by therve

Several things are confusing to me. I'll put down some notes, maybe someone can answer them:

  • I've tested 11.1 on 64 bits with our own bindings, and got the same problem. So it doesn't look like a regression.
  • On 32 bits, before or after the branch, the example fails with "exceptions.AssertionError: 1000000000000 is not greater than or equal to 0 seconds". Forgetting that the error message doesn't make sense, I guess we could remove that check against sys.maxsize now?
  • On 32 bits, if I revert the change, the new test still pass. I suppose it's due to use sys.maxsize as a callLater value.

Thanks, and sorry for the non-review.

comment:9 Changed 5 years ago by Tom Prince

Keywords: review removed
Owner: set to Tom Prince
Status: newassigned
  1. twisedchecker complains about the missing docstring for eventSource in test_distantDelayedCall.
  2. On gentoo x86 and x86_64, the tests fail before the change on poll/epoll (and select on x86_64) and succeed after.
  3. Running the example similarly fails before and works afterwards (after making the timeout small enough, for x86).
  4. One could use a process instead of threads, if a reactor supports one but not the other. But, it isn't worth it to handle this case. Is there such a reactor, anyway?
    • I was going to say that any ReactorBase based reactor has callFromThread, but it appears that that only works from signal handlers, as it appears that it depends on the signal interrupting the polling syscall. (Should that be documented somewhere?)
  5. Please add a comment specifying where 2 ** 31 / 1000 comes from.
  6. In doing some research, I just noticed (from epoll_wait(2))

BUGS In kernels before 2.6.37, a timeout value larger than approximately LONG_MAX / HZ milliseconds is treated as -1 (i.e., infinity). Thus, for example, on a system where the sizeof(long) is 4 and the kernel HZ value is 1000, this means that timeouts greater than 35.79 minutes are treated as infinity.

I'm not sure if we care.

Please merge after addressing points 1+5 above.

comment:10 Changed 5 years ago by Tom Prince

Owner: changed from Tom Prince to Jean-Paul Calderone
Status: assignednew

#5472 is a duplicate of this ticket.

comment:11 Changed 5 years ago by Jean-Paul Calderone

twisedchecker complains about the missing docstring for eventSource in test_distantDelayedCall.

Seems like a bug in twistedchecker. Docstrings on nested objects that don't escape aren't any more useful than comments or self-documenting code (in other words, you cannot even know this class exists unless you are reading the source, so you do not need documentation to be available outside of the source).

On gentoo x86 and x86_64, the tests fail before the change on poll/epoll (and select on x86_64) and succeed after. Running the example similarly fails before and works afterwards (after making the timeout small enough, for x86).

is that a good thing or a bad thing? Gentoo isn't a supported platform or even a platform with an unsupported builder...

One could use a process instead of threads, if a reactor supports one but not the other. But, it isn't worth it to handle this case. Is there such a reactor, anyway?

Not in Twisted. The current check works for all reactors in Twisted, as far as I know.

Please add a comment specifying where 2 31 / 1000 comes from

Fixed in r37254.

In doing some research, I just noticed (from epoll_wait(2))

If that matters, it's fodder for a separate ticket. Since no one has yet complained, and the affected version of Linux is now somewhat old, I'm predisposed not to care.

comment:12 Changed 5 years ago by Jean-Paul Calderone

Type: regressiondefect

I've tested 11.1 on 64 bits with our own bindings, and got the same problem. So it doesn't look like a regression.

Cool. Looking at the code, this makes sense. Not sure what I did before to come to the opposite conclusion. Changing this to a defect instead of a regression.

On 32 bits, if I revert the change, the new test still pass. I suppose it's due to use sys.maxsize as a callLater value.

Changed the test to use an integer that needs even more than 64 bits for storage, regardless of sys.maxsize.

On 32 bits, before or after the branch, the example fails with "exceptions.AssertionError: 1000000000000 is not greater than or equal to 0 seconds". Forgetting that the error message doesn't make sense, I guess we could remove that check against sys.maxsize now?

And deleted this part of the assertion to make the test pass on all platforms.

comment:13 Changed 5 years ago by Tom Prince

On gentoo x86 and x86_64, the tests fail before the change on poll/epoll (and select on x86_64) and succeed after. Running the example similarly fails before and works afterwards (after making the timeout small enough, for x86).

is that a good thing or a bad thing? Gentoo isn't a supported platform or even a platform with an unsupported builder...

It is a good thing. The tests detect the problem, and the code changes fix it.

twisedchecker complains about the missing docstring for eventSource in test_distantDelayedCall.

Seems like a bug in twistedchecker. Docstrings on nested objects that don't escape aren't any more useful than comments or self-documenting code (in other words, you cannot even know this class exists unless you are reading the source, so you do not need documentation to be available outside of the source).

That seams reasonable response. File a bug?

In doing some research, I just noticed (from epoll_wait(2))

If that matters, it's fodder for a separate ticket. Since no one has yet complained, and the affected version of Linux is now somewhat old, I'm predisposed not to care.

Sure.

comment:14 Changed 5 years ago by Jean-Paul Calderone

Resolution: fixed
Status: newclosed

(In [37301]) Merge epoll-distant-delayed-call-6259

Author: exarkun Reviewer: therve, tom.prince Fixes: #6259

Avoid asking platform APIs for a timeout longer than they support in the IReactorTime implementation.

Note: See TracTickets for help on using tickets.