Opened 9 years ago
Closed 9 years ago
#6259 defect closed fixed (fixed)
epoll reactor continually errors when given a delayed call with an excessive timeout
Reported by: | Tom Cocagne | Owned by: | Jean-Paul Calderone |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | core | Keywords: | epoll reactor timeout |
Cc: | Branch: |
branches/epoll-distant-delayed-call-6259
branch-diff, diff-cov, branch-cov, buildbot |
|
Author: | exarkun |
Description (last modified by )
Found in 12.1/python2.6 & 12.3/python2.7. Occurs when using the epoll function from the stdlib select module.
To repeat:
#echo > t.tap <<< EOF from twisted.application import service from twisted.internet import reactor application = service.Application('blah') reactor.callLater(1000000000000, lambda : None) EOF #twistd -noy t
This generates a continual stream of the following message:
2013-01-17 14:40:11-0600 [-] Unhandled Error Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/twisted/scripts/_twistd_unix.py", line 231, in postApplication self.startReactor(None, self.oldstdout, self.oldstderr) File "/usr/lib64/python2.7/site-packages/twisted/application/app.py", line 402, in startReactor self.config, oldstdout, oldstderr, self.profiler, reactor) File "/usr/lib64/python2.7/site-packages/twisted/application/app.py", line 323, in runReactorWithLogging reactor.run() File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1173, in run self.mainLoop() --- <exception caught here> --- File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1185, in mainLoop self.doIteration(t) File "/usr/lib64/python2.7/site-packages/twisted/internet/epollreactor.py", line 364, in doPoll l = self._poller.poll(timeout, len(self._selectables)) exceptions.OverflowError: timeout is too large
The problem does not occur if the "select" reactor is used instead:
#twistd --reactor=select -noy t
Change History (14)
comment:1 Changed 9 years ago by
comment:2 follow-up: 4 Changed 9 years ago by
Type: | defect → regression |
---|
Regression introduced during the switch to the stdlib epoll bindings for #3114.
comment:4 Changed 9 years ago by
comment:5 Changed 9 years ago by
Owner: | set to Jean-Paul Calderone |
---|---|
Status: | new → assigned |
comment:6 Changed 9 years ago by
Author: | → exarkun |
---|---|
Branch: | → branches/epoll-distant-delayed-call-6259 |
(In [36913]) Branching to 'epoll-distant-delayed-call-6259'
comment:7 Changed 9 years ago by
Keywords: | review added |
---|---|
Owner: | Jean-Paul Calderone deleted |
Status: | assigned → new |
Is this a bug in the underlying bindings, or a bug in the way we're using the bindings?
The latter, I think. Our own bindings silently accepted too-large values (wrapping them around, which wasn't terribly harmful and masked this issue).
It looks like epollreactor against our bindings may have been the only reactor that didn't suffer this problem, though. select()
and poll()
both also have limits, albeit higher ones than epoll_wait()
. I didn't check the Windows APIs, but they probably do as well.
Added a new test and a general fix in some code that is re-used by all our reactors. Here are the build results, I hope they are green.
comment:8 Changed 9 years ago by
Several things are confusing to me. I'll put down some notes, maybe someone can answer them:
- I've tested 11.1 on 64 bits with our own bindings, and got the same problem. So it doesn't look like a regression.
- On 32 bits, before or after the branch, the example fails with "exceptions.AssertionError: 1000000000000 is not greater than or equal to 0 seconds". Forgetting that the error message doesn't make sense, I guess we could remove that check against sys.maxsize now?
- On 32 bits, if I revert the change, the new test still pass. I suppose it's due to use sys.maxsize as a callLater value.
Thanks, and sorry for the non-review.
comment:9 Changed 9 years ago by
Keywords: | review removed |
---|---|
Owner: | set to Tom Prince |
Status: | new → assigned |
- twisedchecker complains about the missing docstring for
eventSource
intest_distantDelayedCall
. - On gentoo x86 and x86_64, the tests fail before the change on poll/epoll (and select on x86_64) and succeed after.
- Running the example similarly fails before and works afterwards (after making the timeout small enough, for x86).
- One could use a process instead of threads, if a reactor supports one but not the other. But, it isn't worth it to handle this case. Is there such a reactor, anyway?
- I was going to say that any
ReactorBase
based reactor hascallFromThread
, but it appears that that only works from signal handlers, as it appears that it depends on the signal interrupting the polling syscall. (Should that be documented somewhere?)
- I was going to say that any
- Please add a comment specifying where
2 ** 31 / 1000
comes from. - In doing some research, I just noticed (from epoll_wait(2))
BUGS In kernels before 2.6.37, a timeout value larger than approximately LONG_MAX / HZ milliseconds is treated as -1 (i.e., infinity). Thus, for example, on a system where the sizeof(long) is 4 and the kernel HZ value is 1000, this means that timeouts greater than 35.79 minutes are treated as infinity.
I'm not sure if we care.
Please merge after addressing points 1+5 above.
comment:10 Changed 9 years ago by
Owner: | changed from Tom Prince to Jean-Paul Calderone |
---|---|
Status: | assigned → new |
#5472 is a duplicate of this ticket.
comment:11 Changed 9 years ago by
twisedchecker complains about the missing docstring for eventSource in test_distantDelayedCall.
Seems like a bug in twistedchecker. Docstrings on nested objects that don't escape aren't any more useful than comments or self-documenting code (in other words, you cannot even know this class exists unless you are reading the source, so you do not need documentation to be available outside of the source).
On gentoo x86 and x86_64, the tests fail before the change on poll/epoll (and select on x86_64) and succeed after. Running the example similarly fails before and works afterwards (after making the timeout small enough, for x86).
is that a good thing or a bad thing? Gentoo isn't a supported platform or even a platform with an unsupported builder...
One could use a process instead of threads, if a reactor supports one but not the other. But, it isn't worth it to handle this case. Is there such a reactor, anyway?
Not in Twisted. The current check works for all reactors in Twisted, as far as I know.
Please add a comment specifying where 2 31 / 1000 comes from
Fixed in r37254.
In doing some research, I just noticed (from epoll_wait(2))
If that matters, it's fodder for a separate ticket. Since no one has yet complained, and the affected version of Linux is now somewhat old, I'm predisposed not to care.
comment:12 Changed 9 years ago by
Type: | regression → defect |
---|
I've tested 11.1 on 64 bits with our own bindings, and got the same problem. So it doesn't look like a regression.
Cool. Looking at the code, this makes sense. Not sure what I did before to come to the opposite conclusion. Changing this to a defect instead of a regression.
On 32 bits, if I revert the change, the new test still pass. I suppose it's due to use sys.maxsize as a callLater value.
Changed the test to use an integer that needs even more than 64 bits for storage, regardless of sys.maxsize
.
On 32 bits, before or after the branch, the example fails with "exceptions.AssertionError: 1000000000000 is not greater than or equal to 0 seconds". Forgetting that the error message doesn't make sense, I guess we could remove that check against sys.maxsize now?
And deleted this part of the assertion to make the test pass on all platforms.
comment:13 Changed 9 years ago by
On gentoo x86 and x86_64, the tests fail before the change on poll/epoll (and select on x86_64) and succeed after. Running the example similarly fails before and works afterwards (after making the timeout small enough, for x86).
is that a good thing or a bad thing? Gentoo isn't a supported platform or even a platform with an unsupported builder...
It is a good thing. The tests detect the problem, and the code changes fix it.
twisedchecker complains about the missing docstring for eventSource in test_distantDelayedCall.
Seems like a bug in twistedchecker. Docstrings on nested objects that don't escape aren't any more useful than comments or self-documenting code (in other words, you cannot even know this class exists unless you are reading the source, so you do not need documentation to be available outside of the source).
That seams reasonable response. File a bug?
In doing some research, I just noticed (from epoll_wait(2))
If that matters, it's fodder for a separate ticket. Since no one has yet complained, and the affected version of Linux is now somewhat old, I'm predisposed not to care.
Sure.
comment:14 Changed 9 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Found in 12.1/python2.6 & 12.3/python2.7. Occurs when using the epoll function from the stdlib select module.
To repeat:
This generates a continual stream of the following message:
The problem does not occur if the "select" reactor is used instead: