Opened 9 years ago

Last modified 7 years ago

#1396 enhancement new

callLater(0) does not guarantee relative ordering of calls

Reported by: jknight Owned by:
Priority: low Milestone:
Component: core Keywords:
Cc: glyph, exarkun, jknight, zooko Branch: branches/callLater-1396-3
(diff, github, buildbot, log)
Author: therve Launchpad Bug:

Description


Attachments (2)

callLater.patch.txt (11.0 KB) - added by zooko 7 years ago.
callLater.patch.2.txt (10.9 KB) - added by zooko 7 years ago.

Download all attachments as: .zip

Change History (32)

comment:1 Changed 9 years ago by jknight

Or more generally, callLater(N, A); callLater(N, B) should *always* execute A 
before B.

comment:2 Changed 9 years ago by glyph

I disagree.  callLater is not callAtExactly - call queueing should be done with
other mechanisms, and perhaps triggered by callLater.  As I mentioned
previously, we don't deal with the user setting the system clock so at worst
calling this a "guarantee" is incorrect.  Also, even providing the naive version
of this guarantee may preclude implementing more a more efficient callLater at
some point.

comment:3 Changed 9 years ago by glyph

See #1398 for more information.

comment:4 Changed 9 years ago by jknight

Ideally, we should use a timing source that isn't adjustable, but I don't think 
we have one. callLater is certainly not callAtTime. It is, however, a call 
queuing function. Ordering should most definitely be deterministic.

Having the user set system time should not cause all timeouts to fire 
immediately. However, that's a separate issue. The only part I'm concerned with 
is that the clock moving backwards should also not violate ordering.

comment:5 Changed 9 years ago by glyph

I think that the bug here is that the documentation does not include the title
of this issue.  Guaranteeing relative ordering of calls is not realistically
possible for a timed-call mechanism.  Doing it more often is probably good, so
I'm not saying that changing the win32er implementation to work around its
super-low-precision timer is a bad idea, but saying we provide a guarantee is a
ton of work.

Consider a system with a lot of clock wobble that syncs the time every hour from
NTP; there are LOTS of opportunities for time() to go backwards in time, and
it's not clear what the right thing to do is, unless time is always measured
from the beginning of the last reactor tick - and that's bad because events
often do take non trivial amounts of wall-clock time to process.

People who need deterministic ordering should NOT use callLater.  That's simply
not what it's for.

comment:6 Changed 9 years ago by exarkun

clock_gettime(CLOCK_MONOTONIC) is supposed to be non-adjustable.

I don't think we should use it though.  I think the current behavior of
callLater() is perfectly reasonable.  The documentation should be improved, but
the code should be left alone for now.

comment:7 Changed 9 years ago by jknight

Your "consider" example is flawed. That is not how NTP is used. It instead adjusts 
a slew factor, causing time to go faster or slower than the hardware clock. That's 
not to say that the clock will never be changed, of course.

Guaranteeing relative ordering *is* realistically possible, unless you see a 
reason why the code on my branch should not work.

comment:8 Changed 9 years ago by exarkun

It's relatively common to have NTP used incorrectly, where ntpdate is run once
an hour or once a day, causing the system clock to jump.

comment:9 Changed 9 years ago by glyph

That's why I said "a lot of clock wobble".  From the NTP documentation: "If
ntpdate determines the clock is in error more than 0.5 second it will simply
step the time by calling the system settimeofday() routine. If the error is less
than 0.5 seconds, it will slew the time by calling the system adjtime() routine.
The latter technique is less disruptive and more accurate when the error is
small, and works quite well when ntpdate is run by cron every hour or two."

The code on your branch doesn't work because in order to achieve a relatively
subtle change in semantics for callLater(N,...) callLater(N,...) which is of
dubious utility (the issue that spawned this discussion has since been resolved
by other means, and I think, to warner's satisfaction) you want to change the
semantics of Platform.seconds() to mean something almost completely different
than what it currently does.  Currently, if a Twisted process is started and
then the clock is changed, callLater calls started before the time is adjusted
might be delayed for an unreasonable amount of time, or run sooner than they
expected to be.  With your proposed change, if a Twisted process is started and
the clock is changed, some of those behaviors will likely be fixed (although I
note that in the absence of an external point of reference, your fix is only
heuristically correct), but application code has no opportunity to make the time
correct on its own, since platform.seconds now be reporting a random value. (In
other words, "you have no chance to survive, or make your time (correct)."  Hee
hee.)

The application code might be calling time.time() itself in this case, but why
should it?  Twisted's Platform.seconds is advertised as more portable, and is
(was) supposed to return a semantically equivalent value.

Perhaps the right thing to do here is to apply the time-going-backwards
detection from this patch and simply to provide a hook for the reactor to alert
application code that it may have gone backwards in time?  Being set forwards is
something that application code can detect on its own, if it cares about
real-time scheduling.

comment:10 Changed 9 years ago by glyph

Ugh.  Please forget I used the word "real-time" there.  I mean "accurate
wall-clock".

comment:11 Changed 9 years ago by jknight

Yes, ntpdate does that, but that generally runs once at system boot time. The 
ntp daemon does not do that. I've never seen a system that installs a cron job 
with ntpdate rather than ntp, but maybe I've not looked hard enough.

Okay, so instead, I'd like to introduce platform.realSeconds(), which returns a 
value which may or may not have anything to do with the epoch, and which is 
guaranteed monotonic and strictly increasing. The reactor timers should use 
this. It should ideally use clock_gettime if available, otherwise use the 
heuristics on top of gettimeofday. 

callLater *should not* be affected by real time changes, if it's possible to 
avoid, as it has nothing to do with real time, but rather elapsed time 
intervals. That it may be affected by real time is an implementation deficiency, 
not a feature.

And finally: before I broke it, callLater *did* guarantee ordering. There's 
really no reason to think it should not behave the way warner expected it to, 
and if both that and the misbehavior upon time-setting can be fixed with one 
change, all the better.

comment:12 Changed 9 years ago by glyph

> I'd like to introduce platform.realSeconds()

Hmm.  OK, that doesn't sound like a bad plan.  Naming's important though, and
I'd like to be careful to avoid putting "real" anywhere near the word "time",
especially since it's got a heuristic that makes it not-real in many cases.  Can
we call it "monotonicSeconds" or something?  (Perhaps we should also rename the
other one 'secondsFromEpoch' and deprecate just 'seconds', to stress the
distinction.)

Re-jigger your branch and I'll re-review?

That reminds me; if we're going to use clock_gettime and add our own time
interface, we might want to add a primitive interface to the reactor that uses
integer milliseconds.  I still occasionally get grief from DirectX developers
that they have to force the FPU back into double precision mode so that Twisted
will work :)

comment:13 Changed 9 years ago by exarkun

<http://scienceworld.wolfram.com/physics/ProperTime.html>

Regarding milliseconds, clock_gettime returns a timespec, which consists of a
time_t representing seconds and a long representing nanoseconds.  Resolution is
variable, but queryable.  On my system (x86, P4, Linux 2.6.12) it is 0.999848
milliseconds.

comment:14 Changed 9 years ago by jknight

Branch seems good, please review. 

I changed the reactor-internal APIs to use integer nanoseconds rather than 
floats, as well. One somewhat suspect bit of API is DelayedCall.getTime(), which 
cannot return a truly accurate trigger time, since DelayedCalls are defined in 
terms of an increment of time, rather than an absolute time.

Of additional note is that I changed distutils.build_extension to allow failures 
to build C modules, and not abort the build process.

comment:15 Changed 9 years ago by jknight

Oh, I guess I should mention:

svn diff -r15485:HEAD svn://svn.twistedmatrix.com/svn/Twisted/branches/
callLater-1396

comment:16 Changed 9 years ago by exarkun

A few trivial points to start off (this does not represent a comprehensive review):

twisted/python/runtime.py: _posix_clock is being imported by a relative name. 
It should be imported as "twisted.python._posix_clock" instead.

staticmethod() was being used on Platform to prevent the creation of bound
methods.  While time.time is a builtin function, and builtin functions aren't
descriptors, I am pretty sure these are both implementation details.  It'd be
nice to at least preserve the level of safety that was in this code before.

1000000000 and 10000000 could stand to be lifted out and replaced with symbolic
constants.  Was 0.01s selected for any particular reason (ie, does some common
operation result in adjustments by increments slightly smaller or greater than
that), or does it just work well in practice?

twisted/internet/{base,task}.py, twisted/test/{test_task,test_internet}.py: 
Again, 10 ** 9 and  should be a symbolic constant, preferably one imported from
someplace else.

twisted/python/_posix_clock.c: We don't have a coding standard for C written
down anyplace, but if we did I imagine it would say things about trailing
whitespace and tabs. ;)  Also, I would hope it would also recommend {} around
all blocks, even where they are not necessary.

Branch mostly looks good, but I doubt my head has fully absorbed the various
implications yet, particularly wrt DelayedCall.

comment:17 Changed 9 years ago by jknight

As I understand it, best practice is for C modules to be imported with a 
relative name, so that they can be moved to the toplevel when building a zip or 
what have you without any code changes.

1000000000 could be replaced with a variable like "TENTOTHENINTH" or 
"NANOSECONDSPERSECOND" but that's pretty much just what it means..the interface 
is defined to return nanoseconds, not 'symbolic-constant's-per-second.

0.01s is a value I just made up. The constraints are that it must be larger than 
the most number of times you can call monotonicTicks without the system clock 
increasing, and must be smaller than you "care" to lose if the clock moves 
backwards.

comment:18 Changed 8 years ago by zooko

  • Cc zooko added

So what needs to be done to make this branch ready to be reviewed to be merged into trunk?

comment:19 Changed 7 years ago by exarkun

Merged this forward to callLater-1396-2 and resolved the copious conflicts (most of which made little sense, someone should write a better merge algorithm).

_posix_clock.c also no longer compiles for me :/ _XOPEN_REALTIME doesn't seem to be defined. The man page suggests _POSIX_TIMERS and _POSIX_MONOTONIC_CLOCK might be more appropriate to check for, but I haven't dug very deeply and I'm sure there are plenty of subtle differences between these (doing a probe is probably a better idea though).

comment:20 Changed 7 years ago by jknight

I think unistd.h might need to be included first.

If not, can you do a:
getconf _XOPEN_REALTIME
in a shell?

And if that returns 1, grep all your header files for _XOPEN_REALTIME being defined?

comment:21 Changed 7 years ago by exarkun

getconf returns 1, _XOPEN_REALTIME is defined as 1 in posix_opt.h, which is included by unistd.h.

comment:22 Changed 7 years ago by therve

  • Owner changed from jknight to therve

comment:23 Changed 7 years ago by therve

  • author set to therve
  • Branch set to branches/callLater-1396-3

(In [22511]) Branching to 'callLater-1396-3'

comment:24 Changed 7 years ago by zooko

Here's a version of this patch against current trunk.

Changed 7 years ago by zooko

comment:25 Changed 7 years ago by zooko

Ah, and here is a version that actually builds on my Mac OS X system.

Changed 7 years ago by zooko

comment:26 follow-up: Changed 7 years ago by zooko

I have a fellow who is running a VirtualBox vm, and his system clock jumps around crazily, so that buildbot "times out" his jobs, thinking that more than 1200 seconds have passed:

http://buildbot.darcs.net/buildbot-darcs/builders/karel%20opensolaris%20OpenSolaris-2008.05%20nevada-86-rc3%20i386/builds/5/steps/compile/logs/stdio

So I asked him to install this patch, and now buildbot no longer does that:

http://buildbot.darcs.net/buildbot-darcs/waterfall

comment:27 in reply to: ↑ 26 Changed 7 years ago by glyph

Replying to zooko:

I have a fellow who is running a VirtualBox vm, and his system clock jumps around crazily, so that buildbot "times out" his jobs, thinking that more than 1200 seconds have passed:

So I asked him to install this patch, and now buildbot no longer does that:

This sets off alarm bells for me. Why would this patch stop timing out his jobs? This doesn't sound like "hooray we fixed a bug", it sounds like "oh no, we made certain timed events stop firing when they should".

comment:28 Changed 7 years ago by zooko

The jobs were timing out after only a few seconds, because the gettimeofday() had jumped more than 1200 seconds. That's my theory anyway. Make sense?

I think this might be hooray we fixed the bug.

By the way, along the way I found some discussions on the libevent mailing list about situations which called for CLOCK_MONOTONIC:

http://monkeymail.org/archives/libevent-users/2006-October.txt

http://monkeymail.org/archives/libevent-users/2007-July.txt

comment:29 Changed 7 years ago by zooko

Oh, my informant informs me that the 1200-second timeouts are still going off spuriously (long before 1200 actual seconds have passed). He is checking to see whether the build of Twisted actually built the posix_clock module...

comment:30 Changed 4 years ago by <automation>

  • Owner therve deleted
Note: See TracTickets for help on using tickets.