Opened 3 years ago

Closed 3 weeks ago

Last modified 3 weeks ago

#6027 defect closed fixed (fixed)

twisted.trial._dist.test.test_disttrial.DistTrialRunnerTestCase.test_runUntilFailure intermittent failure

Reported by: exarkun Owned by: adiroiban
Priority: normal Milestone:
Component: trial Keywords:
Cc: jml, therve Branch: branches/runUntilFailure-6027
(diff, github, buildbot, log)
Author: adiroiban Launchpad Bug:

Description

From http://buildbot.twistedmatrix.com/builders/winxp32-py2.7/builds/1448/steps/select/logs/problems:

[FAIL]
Traceback (most recent call last):
  File "C:\buildslave\twisted\winxp32-py2.7\Twisted\twisted\trial\_dist\test\test_disttrial.py", line 368, in test_runUntilFailure
    self.assertEqual(5, len(called))
  File "C:\buildslave\twisted\winxp32-py2.7\Twisted\twisted\trial\_synctest.py", line 356, in assertEqual
    % (msg, pformat(first), pformat(second)))
twisted.trial.unittest.FailTest: not equal:
a = 5
b = 1

Change History (17)

comment:1 Changed 3 years ago by DefaultCC Plugin

  • Cc jml added

comment:2 Changed 2 years ago by exarkun

Fails on Linux too:

[FAIL]
Traceback (most recent call last):
  File "/home/exarkun/Projects/Twisted/trunk/twisted/internet/defer.py", line 138, in maybeDeferred
    result = f(*args, **kw)
  File "/home/exarkun/Projects/Twisted/trunk/twisted/internet/_utilspy3.py", line 41, in runWithWarningsSuppressed
    reraise(exc_info[1], exc_info[2])
  File "/home/exarkun/Projects/Twisted/trunk/twisted/internet/_utilspy3.py", line 37, in runWithWarningsSuppressed
    result = f(*a, **kw)
  File "/home/exarkun/Projects/Twisted/trunk/twisted/trial/_dist/test/test_disttrial.py", line 368, in test_runUntilFailure
    self.assertEqual(5, len(called))
  File "/home/exarkun/Projects/Twisted/trunk/twisted/trial/_synctest.py", line 356, in assertEqual
    % (msg, pformat(first), pformat(second)))
twisted.trial.unittest.FailTest: not equal:
a = 5
b = 1


twisted.trial._dist.test.test_disttrial.DistTrialRunnerTestCase.test_runUntilFailure

comment:3 Changed 6 months ago by glyph

On #2673, this is now the remaining test which is still failing intermittently often enough to show up in my sample of 6 builds. Mostly on Windows, but occasionally on Linux. However, on my OS X development machine:

PASSED (successes=13)
Test Pass 18796
twisted.trial._dist.test.test_disttrial

I'm having a hard time getting it to fail on demand.

Also, reading through the test it appears to be a well-behaved test that does everything deterministically and doesn't do any wacky I/O. I am wondering if anyone has any ideas.

comment:4 Changed 6 months ago by glyph

  • Cc therve added

I am cc:ing therve in the hopes that he has some inkling we could investigate.

comment:5 Changed 6 months ago by glyph

Sorry, not #2673; I meant #3404.

comment:6 Changed 6 months ago by glyph

Just to keep track of how often this is causing problems... this failed when the tests were run for #6686.

comment:7 Changed 2 months ago by adiroiban

  • Owner set to adiroiban

I manage to reproduce it on a WinXP machine.... if I reduce the CPU power (forcing powersave mode) it greatly increase the failure rate.

Now, to simulate a slow test on my Linux machine, I have changed this... maybe it a stupid thing to do...
but if I decrease the sleep (ex 0.00001) I can see that the test will pass.

diff --git twisted/trial/_dist/disttrial.py twisted/trial/_dist/disttrial.py
index 78a73ba..4c1f919 100644
--- twisted/trial/_dist/disttrial.py
+++ twisted/trial/_dist/disttrial.py
@@ -218,6 +218,8 @@ class DistTrialRunner(object):
             self.writeResults(result)
             if not untilFailure:
                 return
+            import time
+            time.sleep(0.1)
             if not result.wasSuccessful():
                 return
             d = runTests()

now, digging deeper it looks like due to the way cooperator work (I have no idea what is a cooperator :), it only allows the test to run for 0.01 seconds.

If I change

diff --git twisted/internet/task.py twisted/internet/task.py
index e32f6da..a46d1db 100644
--- twisted/internet/task.py
+++ twisted/internet/task.py
@@ -316,7 +316,7 @@ class NotPaused(SchedulerError):
 
 
 class _Timer(object):
-    MAX_SLICE = 0.01
+    MAX_SLICE = 1
     def __init__(self):
         self.end = time.time() + self.MAX_SLICE
 

I can see that the test will pass on XP, even with slow CPU.
If I set this value really low, I see that test will always fail. On XP and Linux.

terminationPredicateFactory name and docstring is very academic :) ... I was searching the text for "timeout" to see what can cause the timeout behaviour

I will create a branch and push my current changes

comment:8 Changed 2 months ago by adiroiban

  • Author set to adiroiban
  • Branch set to branches/runUntilFailure-6027

(In [43764]) Branching to runUntilFailure-6027.

comment:9 Changed 2 months ago by adiroiban

  • Keywords review added
  • Owner adiroiban deleted

Can someone with more knowledge about trial and dist runner take a look at the current code?

I have never used this part of the code.

I think think that public method run only has the cooperate arguments to allow injecting it in tests... but my patch change the code to inject the dependency directly on the instance... I am not sure why someone would want to run a test with one Cooperator and another one with other cooperator.

All test pass , beside the spurious #2673 :( and pyflakes and twistedchecker

It code make sense I can continue in this direction.

Thanks!

comment:10 Changed 2 months ago by glyph

  • Keywords review removed
  • Owner set to adiroiban

Hi adiroiban,

Thanks for trying to track this down. It has been baffling me.

  1. There are some twistedchecker issues. If you think that they are bugs in twistedchecker, please report them and link to them.
  2. The _timeout that you're setting is pretty arbitrary. It could still time out if run on a slow machine.
  3. Pretty much anything calling time.time is a code smell in Twisted; it should be using clock.seconds(). We are even properly passing a fake reactor around here, so use that one.
  4. When run "for real", things like this should use the global Cooperator, and not magic up their own; the global cooperator will interleave many tasks cooperatively and can schedule them fairly, which is the reason it's there. So you should probably just pass in a cooperate argument and call it directly, especially since nothing needs any other methods beyond cooperate. Which is... what the tests were already doing.
  5. Most of all, I don't think this actually addresses the problem. The tests already seem to be passing a fake scheduler to all the tests that you changed. You've changed it to be mutation-after-the-fact rather than pass-as-argument, which is a slightly worse way to inject parameters (if we want to put it in the constructor it should be a constructor arg like everything else) but I don't see how this particular bit of code could be what's causing the non-determinism. Were you able to get a local test case that indicates it failed before and succeeds now?

I will feel much better once we have solid evidence as to what is causing this :).

Thanks,

-g

comment:11 Changed 2 months ago by adiroiban

Thanks for the review. Hope to have this fixed soon.

  1. they are not bugs, but I was sure that current code is not right so I did not wanted to merge it as it is.

2.3. yes.. but we can increase it in the test. I will use clock.seconds() for that as I fully agree with time.time... the existing code was already using time.time() so I went with that.

  1. If in real world this should be run with global cooperator, then why allow passing an arbitrary cooperate method in the public API ? I understand that my patch is wrong, but to enforce that global cooperator is uses maybe DistTrialRunner should not expose the cooperate method as a public API but rather use something like {{{self._cooperate = cooperate}}
  1. While the test was passing a fake Scheduler, it was still using the default _terminationPredicateFactory, which is timer _Timer. Cooperator._tasksWhileNotStopped will use the default timer.

try to change value of class _Timer(object) and you should see tests fail/pass.

or on trunk try to apply this patch

diff --git twisted/internet/task.py twisted/internet/task.py
index e32f6da..155b230 100644
--- twisted/internet/task.py
+++ twisted/internet/task.py
@@ -322,6 +322,7 @@ class _Timer(object):
 
 
     def __call__(self):
+        import pdb; import sys; sys.stdout = sys.__stdout__; pdb.set_trace()
         return time.time() >= self.end
 
 

and run

./bin/trial twisted.trial._dist.test.test_disttrial.DistTrialRunnerTestCase.test_runUntilFailure

you should see that default timer is called and will instruct the Cooperator to terminate in _tasksWhileNotStopped.

I don't know what other strong evidence you need :)

Cheers

comment:12 Changed 8 weeks ago by adiroiban

  • Keywords review added
  • Owner adiroiban deleted

Thanks for the review. Here is a new try... this time I tried to keep it to a minimum.

It only touches the tests and only the cooperator created inside the tests.

Instead of using a clock, I used a fake termination factory which will always return False... this means that a cooperative task can run forever.

Please take a look at the changes!

Thanks!

comment:13 Changed 3 weeks ago by glyph

  • Keywords review removed
  • Owner set to adiroiban

Hi adiroiban,

Please land this :-D. It's short, to the point, fixes the issue.

That said, I have some abstract concerns about this code:

  1. The naming of terminationPredicateFactory itself is pretty bad. It says "called at the beginning of each step" but it never really defines what a step is. However, the new docstring for getFakeSchedulerAndEternalCooperator now says "the task needs to be terminated", which is wrong; a task would be a CooperativeTask, and CooperativeTask is not terminated (i.e. "stop() is not called on it) when terminationPredicateFactory returns True; it's just delayed until the next scheduled iteration. If you could clear up the docstring of getFakeSchedulerAndEternalCooperator a bit helpful - perhaps just to talk about "steps" instead of "tasks", that would be helpful.
  2. If a bug were to be introduced which caused the cooperator to be of infinite length, the failure mode here would be a hang rather than an error. A better implementation of this fake termination predicate would eventually raise an exception if it were called an order of magnitude more times than expected (say, >1000, in this case).
  3. This distinction, between "one unit of work completed" and "nothing left to do" is why twisted.test.iosim.IOPump has both pump() (like FakeScheduler.pump) and flush() which calls pump() until there is nothing left to do or it has iterated too many times. It would be nice, but not strictly necessary, to have a follow-up making FakeScheduler have both of those methods so there's some consistency there. Then the fake termination predicate can just always return True so a minimal amount of work is done per step, and flush can be responsible for exhausting the predicate to the end.

Thanks a bunch, it will be great to kill another intermittent test failure!

comment:14 Changed 3 weeks ago by adiroiban

Thanks for the review.

I am confused :)

Are you concern about my patch or the code in general?

I have no idea about what are you talking about in 1,2 and 3

If they are not about the patch can you (glyph) please create a new ticket based on your comments?

Can you please use github and attach your comments to lines in a file?

Thanks!

comment:15 Changed 3 weeks ago by glyph

The first two words were "please land", so please do that :-).

We can talk about the non-blocking concerns elsewhere if they're not clear here.

I have no idea how to use github to put comments on lines in a file, since there is no PR here. Also, they're general structural comments which span multiple classes; if I were using github for this review they would have been a comment on the PR anyway and not on a line of code :).

comment:16 Changed 3 weeks ago by adiroiban

  • Resolution set to fixed
  • Status changed from new to closed

(In [44009]) Merging runUntilFailure-6027: Fix intermittent failure in test_runUntilFailure.

Author: adiroiban
Reviewer: glyph
Fixes: #6027

comment:17 Changed 3 weeks ago by adiroiban

I have created a new ticket as a follow up for your comment #7808

Note: See TracTickets for help on using tickets.