Opened 11 years ago

Closed 8 years ago

#343 enhancement closed duplicate (duplicate)

reactor fd cleanup between tests

Reported by: warner Owned by:
Priority: normal Milestone:
Component: trial Keywords: superseded
Cc: exarkun, spiv, itamarst, warner, jml, moshez Branch:
Author: Launchpad Bug:

Description


Change History (18)

comment:1 Changed 11 years ago by warner

This one could be conch, or something in the way the
gtk2 reactor handles timers or processes.

===============================================================================
FAILURE: test_addKeyByString (twisted.test.test_cred.IdentityTestCase)
-------------------------------------------------------------------------------
Traceback (most recent call last):
  File
"/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/trial/unittest.py", line
249, in _runPhase
    stage(*args, **kwargs)
  File "/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/trial/util.py",
line 43, in reactorCleanUp
    raise unittest.FailTest, msg
FailTest: 
pendingTimedCalls still pending:
 <DelayedCall 144058188 [35999.999711s] called=0 cancelled=0
SSHUserAuthServer.timeoutAuthentication()>

===============================================================================
ERROR: test_setPassword_invalid (twisted.test.test_cred.IdentityTestCase)
-------------------------------------------------------------------------------
Traceback: exceptions.IOError, [Errno 2] No such file or directory: 'dsa_test.pub'
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/internet/gtk2reactor.py:164:_doReadOrWrite
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/internet/tcp.py:242:doRead
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/conch/ssh/transport.py:204:dataReceived
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/conch/ssh/service.py:51:packetReceived
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/conch/ssh/userauth.py:233:ssh_USERAUTH_FAILURE
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/conch/ssh/userauth.py:220:tryAuth
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/conch/ssh/userauth.py:289:auth_publickey
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/test/test_conch.py:214:getPublicKey
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/conch/ssh/keys.py:61:getPublicKeyString
===============================================================================
ERROR: test_verifyPassword (twisted.test.test_cred.IdentityTestCase)
-------------------------------------------------------------------------------
Traceback: twisted.trial.unittest.FailTest, unexpectedly lost connection
<twisted.test.test_conch.SSHTestServer instance at 0x8c9d004>
[Failure instance: Traceback: twisted.internet.error.ConnectionDone, Connection
was closed cleanly.
]
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/python/log.py:65:callWithLogger
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/python/log.py:52:callWithContext
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/python/context.py:53:callWithContext
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/python/context.py:32:callWithContext
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/internet/gtk2reactor.py:181:_doReadOrWrite
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/internet/tcp.py:281:connectionLost
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/test/test_conch.py:174:connectionLost
/home/buildbot/BuildBot/slave/reactors/Twisted/twisted/trial/unittest.py:98:fail
-------------------------------------------------------------------------------

comment:2 Changed 11 years ago by moshez

It is incredibly strange that while the name says "test_cred", the traceback 
shows conchy stuff. Anyone can repro this bug outside the build-infrastructure
and see what happens? I suspect just a stupid interaction :(

comment:3 Changed 11 years ago by moshez

I'm starting to believe that this bug is because in ~line 496 of 
twisted/test/test_conch.py, the connection isn't explicitly closed.
Someone who can run conch and would like to try to see if helps would be
appreciated

comment:4 Changed 11 years ago by warner

I've investigated this a bit more, and the issue is deeper. It relates to the
earlier "clean the reactor between tests" patches (the ones which canceled
leftover timers which could expire during a later test). In this case, many tests
are finishing while their fds are still being watched by the reactor. I added some
code to trial's util.reactorCleanUp to get a feel for the scope of the problem,
and it reported that there are *81* fds still sitting in the reactor when the
full test-suite finishes!

In most of those cases the fds are benign: they are loopback connections to
other pieces of code that have also finished running.. everybody but the
reactor has forgotten about them, so their state will never change, so the
various Transports and Protocols still hanging off them will never be touched.

In the test_conch case, I think the fd is connected to an SSH child that has
been spawned off for interoperability testing. For some reason, this child
is active after test_conch has finished. The conch Protocol wakes up, looks
for the key files it was told to use, but they've been deleted by tearDown,
and the test fails. It raises unittest.FailTest while in the reactor somewhere,
and that gets blamed upon whatever poor reactor-using test happened to be
running at the moment (usually test_cred when gtk2 is driving everything.. it
appears that the default reactor is quick enough to let the child do it's work
while test_conch is still running).

One question is why test_conch is passing even though the child is still
doing stuff. It suggests that there is a protocol step or something that isn't
being verified.

But the deeper issue is how to prevent this sort of thing in general. Having
distinct Reactor instances would be nice, because that would make it easier to
run each test with a new one. Failing that we need a way (again) to make sure
the reactor is cleaned between tests. We already have the timer-scrubbing code
in trial, but we need to clear out fds as well.

reactor.disconnectAll() purports to be a cross-reactor method to accomplish
this, and my first inclination is simply to add it to util.reactorCleanUp .
The main problem with this approach is that .disconnectAll also hangs up on
the _Waker socket, and doesn't do anything (like set self.waker=None) to
let later code know that the waker needs to be re-installed. Fixing this would
currently involve adding self.waker=None to the removeAll() method in every
reactor which inherits from PosixReactorBase (i.e. almost all of them).

The other problem is that it doesn't provide a means to solve the root cause:
tests which add fds to the reactor (via listenTCP or connectTCP, etc) and do
not remove them when they shut down. One issue is whether we want to do this
or not: when we went through it with timers, we made leftover timers grounds
for failing the test, and the problems were cleaned up fairly quickly. But
there are 81 places that need to be fixed here.

In addition, there is no reactor-portable way to get a list of outstanding fds.
We would have to add one, and it would have to be able to distinguish "internal"
fds (like the _Waker socket) from ones that were added by the application or
the test cases.

I've attached a hackish patch which displays the problematic fds and cleans them
up, but a) only works on reactors that inherit from t.i.default (which includes
the default and gtk2), and b) clobbers _Waker, which might break the second and
later thread-using test cases. The _Waker problem can be hackishly addressed
by adding a self.waker=None to SelectReactor.removeAll .

*** twisted/trial/util.py.~1.9.~	Wed Oct 22 10:02:18 2003
--- util.py	Wed Oct 22 11:38:49 2003
***************
*** 47,52 ****
--- 47,64 ----
              reactor.threadpool.stop()
              reactor.threadpool = None
  
+     from twisted.internet import default
+     fds = default.reads.keys() + default.writes.keys()
+     reactor.disconnectAll()
+     if fds:
+         msg = "\nreactor has %d fds still open:\n" % len(fds)
+         for fd in fds:
+             msg += " %s\n" % fd
+         from warnings import warn
+         warn(msg) # to get it on the screen immediately after the test
+         raise unittest.FailTest, msg
+ 
+ 
  def isTestClass(testClass):
      return issubclass(testClass, unittest.TestCase)

comment:5 Changed 11 years ago by itamarst

it wouldn't bother me if the testing framework complained about fds being left
over, especially if we can write framework code to help test writers automate
clean up. Just need to add a special case for the waker.

comment:6 Changed 11 years ago by warner

in chatting on IRC, it sounds like we could use two new methods:

one would be reactor.areYouQuiescent or somesuch, and would return None if
the reactor is in the same state as it was at startup time (no timers,
no non-internal fds, maybe even no extra system event triggers). If it
were in any way unclean, it would return a string that describes exactly
what has been added or changed. We would then add code to trial to
check this method after each test case and flunk the test if it were
not None.

The other method would be a twisted-specific trial extension to make
life easier for test authors. TestCase.registerPort or somesuch would
take the return value from reactor.listenTCP and friends and stash it
in an array. When the test shuts down (just after tearDown is run),
everything on this list would have its .stopListening method invoked.
This method would be documented in howto/test-standard along with the
other Twisted-specific helper methods (like, .runReactor). Of course
listenTCP vs. connectTCP and such would need to be handled or at least
documented (it might be useable on listening sockets but not connections).

comment:7 Changed 11 years ago by warner

actually, I think we may need a reactor.clean method too. The
.areYouQuiescent would let us spot the places that make it
dirty, but it can't do any cleanup. Every test after the one
that left junk in the reactor would also fail. Once we fix
all the tests that don't clean up after themselves, this will
be moot, but it would be annoying for one mis-written test
to cause failures to be logged against half the test suite.

This may just involve fixing .disconnectAll (to handle _Waker
properly) and calling it from trial between tests.

comment:8 Changed 11 years ago by moshez

warner, can you open a new issue for reactor cleanups
and see if just closing conch's connection fixes this one?
thanks

comment:9 Changed 11 years ago by spiv

I've fixed the fd leaks in test_ftp.py (see revision 1.32).  That should account
for a some of the leaks.

comment:10 Changed 11 years ago by warner

I added cleanup code to test_conch and it appears to resolve
the original problem. There is a lot of good discussion in
this issue, so I'm renaming it to cover inter-test reactor
cleanup efforts.

 twisted/test/test_conch.py 1.50

comment:11 Changed 10 years ago by radix

slyphon, does newtrial do this? if not, can you evaluate this issue and decide
whether it should?

comment:12 Changed 9 years ago by jml

New maintainer.

comment:13 Changed 9 years ago by jml

AFAICT, a fix for 1334 will fix this.

comment:14 Changed 9 years ago by jml

  • Resolution set to duplicate
  • Status changed from new to closed

Fixing #1334 will fix this, I think.
Resolving to remove clutter.

comment:15 Changed 9 years ago by jml

  • Keywords superceded added
  • Resolution duplicate deleted
  • Status changed from closed to reopened

Re-opening because I don't like the number being crossed out in x-refs.

comment:16 Changed 9 years ago by jml

  • Keywords superseded added; superceded removed

comment:17 Changed 8 years ago by jml

  • Resolution set to duplicate
  • Status changed from reopened to closed

Nothing to see here. #1334

comment:18 Changed 4 years ago by <automation>

Note: See TracTickets for help on using tickets.