Opened 8 years ago

Closed 23 months ago

#1784 enhancement closed fixed (fixed)

Parallelize execution of Trial unit tests into multiple processes to make more efficient use of available CPU cores

Reported by: exarkun Owned by: therve
Priority: highest Milestone:
Component: trial Keywords:
Cc: mithrandi, therve, iratsu, jesstess, jml Branch: branches/disttrial-1784-17
(diff, github, buildbot, log)
Author: exarkun, therve, glyph Launchpad Bug:

Description

This would be awesome.

Change History (111)

comment:1 Changed 8 years ago by exarkun

Some detail:

  1. distribution mechanisms
    1. child processes
    2. ssh to remote hosts
  2. output handling
    1. stdout
    2. stderr
    3. realtime tracebacks
    4. test.log
    5. _trial_temp
  3. debugger integration
  4. test suite division
    1. streaming test assignments
    2. some visibility into what's run where

comment:2 Changed 7 years ago by mithrandi

  • Cc mithrandi added

comment:3 Changed 7 years ago by iratsu

  • Keywords review added
  • Priority changed from low to highest

comment:4 Changed 7 years ago by iratsu

branch disttrial-1784

comment:5 Changed 7 years ago by iratsu

  • Owner exarkun deleted

comment:6 Changed 7 years ago by exarkun

  • Keywords review removed
  • Owner set to iratsu
  • twisted/test/test_protocols.py
    • since this test appears to pass without any changes to the implementation, it probably shouldn't be added in this branch. If it's really covering a code path that the other unit tests don't cover, you should create a new ticket for adding it and move it there and out of this branch.
  • twisted/trial/dist/__init__.py should probably have a test-case-name of twisted.trial.dist.test
  • twisted/trial/dist/disttrial.py
    • #! should be removed, the contents of the __main__ part of the module should be moved into a function, and a file should be added to bin/ which imports this function and calls it (similar to the other files in bin/)
    • should have a module docstring
    • use absolute imports, not relative ones - from twisted.trial.dist import distvisitor, slave
    • DistTrialRunner's class docstring should be more descriptive:
      • explain how it is specialized (use american spelling, too, sorry :P) - what does it do that's different from other runners?
      • also document all of the instance variables
    • beginTimer and endTimer should use a parameterize time function and the unit tests should supply an alternate implementation which is deterministic. this is beneficial because:
      • then the test won't take a second to run
      • changes to the system clock won't break the test
      • load on the machine while the tests are running won't cause them to fail spuriously
      • the test can be more precise, using assertEqual instead of assertApproximates
    • document the type and meaning of:
      • the result parameter to writeResults
      • the quantity parameter to createLocalSlaves
      • the return value of createLocalSlaves
      • the spawner and protocols parameters to launchSlaveProcesses
      • the slaves and names parameters to loadSlaveTests
      • the test parameter to run
    • the list comps which have their result ignored should just be for loops
    • the type and meaning of the return value of getConfig should be documented
    • the exception getConfig can raise should be documented
    • use the raise Foo(bar) syntax instead of the raise Foo, bar syntax
    • getTrialRunner docstring should be """\ntext\n""" - also document its return value :)
    • run should have a docstring, including docs for its parameters and return value
  • twisted/trial/dist/test/test_disttrial.py
    • copyright statement missing
    • module docstring missing
    • DistTrialRunnerTestCase
      • missing class docstring
      • setUp and tearDown missing docstrings
      • tearDown looks useless, actually - any reason to explicitly drop this object here?
      • test_createLocalSlaves might want to assert something about the length of the object it gets back - as it stands, if [] is returned, the test still passes
      • Erroneous apostrophe in the docstring for test_createLocalSlaves. If you want to be fancy, you might also toss in some epytext markup - L{Name} will create a link to the named Python object, and is a nice thing to use in test method docstrings.
      • test_launchSlaveProcesses should probably invert its testing logic: if launchSlaveProcesses never invokes the function passed in (eg, if the implementation were empty), the test will pass. Instead of doing the asserts in fakeSpawnProcess, save the arguments (eg, in a list), then assert things about the saved data after launchSlaveProcesses returns.
      • test_launchSlaveProcesses's docstring says "acts as expected", but it doesn't explain what is expected. try to avoid this style of docstring, instead talk about specific things, eg that the given function is called with the path to the python interpreter and arguments which will run a disttrial slave, etc.
      • test_loadSlaveTests does well by having the Slave object keep track of the arguments passed to its load method; this is a good example of how test_launchSlaveProcesses should work.
      • similar "as expected" comment for test_loadSlaveTests's docstring.
  • twisted/trial/dist/distvisitor.py
    • give it a test-case-name
    • and a copyright statement
    • and a module docstring
    • DistributionVisitor.__init__ - half cool! document slaves parameter, though
    • document testCase parameter to distribute
    • document DistributionVisitor.readify, including its parameter
    • Hmmm. Most likely, the code which starts the reactor and the code which stops the reactor should be lexically adjacent - you should probably make DistributionVisitor publish an event (ie, by calling back a Deferred) at the same time as it is currently calling reactor.stop, then have disttrial.py add a callback to that Deferred which actually stops the reactor.
    • Although - what the heck, callWhenRunning(stop)? That looks totally insane. We might want to talk about that. :)
  • twisted/trial/dist/test/test_distvisitor.py
    • copyright statement
    • module docstring
    • docstrings for everything else
    • map -> list comp would probably help a bunch of these tests read more clearly, too
  • twisted/trial/dist/slave.py
    • test-case-name
    • copyright statement
    • module docstring
    • SelfSlave - should it even exist? Does anything use it? Does it have tests?
    • Commented out stuff - probably should be deleted
    • LocalSlave probably doesn't gain anything by subclassing Slave - so maybe Slave isn't necessary at all.
  • twisted/trial/dist/slavetrial.py
    • instead of passing resultStream into __init__, it'd probably be better to do all the attribute set up in connectionMade and use pass sys.__stdout__ to .makeConnection.
    • The while loop will run forever, since read(1) returns '' at eof, and nothing else seems to stop it.
  • twisted/trial/unittest.py - good changes, probably belong in a different branch though
  • twisted/trial/reporter.py
    • not sure why this class is defined here instead of in dist/ with everything else.
    • /tmp/tmp0 code should probably go :)
    • does the int32 logic need to be duplicated here again?

In addition to those specific issues, I also noticed that near the beginning of a test run, several test methods would be printed to the same line before any of them finish, and I wasn't able to get a run of the whole Twisted suite to finish properly.

Good first submission. Test coverage isn't perfect, but it's a good start. Likewise for documentation. We should probably talk about int32string.py at some point and what the future direction for the protocol will be. Don't be discouraged by the size of the comment. I think you're on the right track. :)

comment:7 Changed 7 years ago by iratsu

  • Keywords review added

disttrial-1784

comment:8 Changed 7 years ago by iratsu

  • Owner iratsu deleted

comment:9 Changed 7 years ago by therve

  • Cc therve added
  • Keywords review removed
  • Owner set to iratsu

OK it's a huge beast, to be afraid be the amount of comments:

General comments:

  • pyflakes errors:
    twisted/trial/dist/test/test_slave.py:6: 'TrialSuite' imported but unused
    twisted/trial/dist/test/test_disttrial.py:9: 'time' imported but unused
    twisted/trial/dist/test/test_distvisitor.py:12: 'sets' imported but unused
    twisted/trial/dist/test/test_slavetrial.py:6: 'reactor' imported but unused
    twisted/trial/dist/test/test_slavetrial.py:7: 'StringIO' imported but unused
    twisted/trial/dist/slave.py:10: 'protocol' imported but unused
    twisted/trial/dist/slave.py:10: 'reactor' imported but unused
    twisted/trial/dist/slave.py:12: 'os' imported but unused
    twisted/trial/dist/slave.py:12: 'struct' imported but unused
    twisted/trial/dist/slavereporter.py:6: 'sys' imported but unused
    twisted/trial/dist/slavereporter.py:6: 'struct' imported but unused
    twisted/trial/dist/disttrial.py:12: 'implements' imported but unused
    twisted/trial/dist/distvisitor.py:11: 'reactor' imported but unused
    twisted/trial/dist/slavetrial.py:3: 'Protocol' imported but unused
    twisted/trial/dist/slavetrial.py:7: 'fcntl' imported but unused
    twisted/trial/dist/slavetrial.py:7: 'struct' imported but unused
    twisted/trial/dist/slavetrial.py:7: 'sets' imported but unused
    twisted/trial/dist/slavetrial.py:7: 'time' imported but unused
    twisted/trial/dist/slavetrial.py:9: 'reactor' imported but unused
    twisted/trial/reporter.py:16: 'struct' imported but unused
    
  • trailing whitespaces
  • twisted/trial/unittest.py and twisted/trial/reporter.py should probably be reverted.
  • the docstring format is :
    """
    My docstring.
    """
    
    Thanks to comply to this.

distrial.py

  • beginTimer and endTimer should have a default for timeFunction to time.time. beginTimer docstring could be a little more explanatory
  • launchSlaveProcesses should use sys.executable instead of 'python' as program.
  • the '4' settings for createLocalSlaves should at least be a class variable. It would be even better if it could be set on the command line

distslave.py:

  • missing docstrings for getSlave and makeReady
  • distribute should probably return the deferred it creates.
  • there is a debug print statement to remove

reportwrapper.py:

  • commented statements to remove
  • missing docstrings everywhere
  • it is not tested

slavetrial.py:

  • create a separate main function
  • missing docstrings
  • you're using trial.Options for now, I think a specific DistOptions would offer more flexibility. Also, it seems that you don't call parseOptions? I don't know what are the effects. It's pretty unfortunate too that you have to override sys.argv (that would need an improvement to usage.Options though). It seems also there is no options pass to the slave. There is at least one that makes sense, it's the reactor option.

slave.py:

  • the Slave class looks like a mixin, it may be better named SlaveMixin.
  • I don't see the queue method used everywhere, is it used? It's at least not tested.
  • LocalSlaveAMP.connectionMade has a useless pass statement
  • in the same method, the directory used is really bad. It should probably use some tempfile.mkdtemp.
  • you use print directly; this causes problem in tests, and can't be tested. You probably have to define a stream variable to sys,stdout, use stream.write to print data, and configure it differently in the tests.
  • getHost and getPeer on ProcessProtocol make me angry, but it has nothing to do with this branch :).
  • the log file management in LocalSlave is a bit bad: it should be open once and closed once, and located in configurable place (_trial_temp for example).
  • in local slave, you don't override errReceived, so you lost some information that has to be logged.
  • lots of docstrings missing

slavereporter.py:

  • methods shouldn't use '*args'. Arguments are well defined for these methods, or should be.
  • it is not tested, at least not directly


test_slave.py:

  • test_errWrite and test_outWrite produce spurious data:
        test_errWrite ... error[0.404724526196]: testError
                                                         [OK]
        test_outWrite ... output[0.312514138113]: testOutput
                                                         [OK]
    
    This has to be removed (see above comment on slave.py).
  • writeSequence and loseConnection of LocalSlave are not tested

test_disttrial.py:

  • in test_getConfig, the sys.argv restore should be done in a finally clause, to be restored if the test fails
  • writeResults is not tested
  • run is not tested. This method would be a bit difficult to test, which may show that you have to split it in smaller chunks. It may also just need some parametrization (TrialSuite, and reactor).

test_distvisitor.py:

  • some docstrings missing
  • the case where makeReady is called with something in self.waiting is not tested

test_slavetrial.py:

  • these are really bad unit tests :/. The test_load is too dependent on external code, and fairly long, and the other depends on an outside unit test (remember that conch may not be installed with twisted), and a horrible one unit test :). Using a string transport is a first good step, but you have to instrument trial to load local suites and test method. This is done a lot in trial test suite, so it should that hard.
  • SlaveProtocol.chdir is not tested

That's it for now. I don't think there are lots of hard things, testing slavetrial being probably the hardest. There are some rough edges, but the base is very good. My first tests using local processes show already a great improvement on performance.

Don't hesitate to come back to me if I wasn't clear. Thanks!

comment:10 Changed 7 years ago by iratsu

  • Keywords review added
  • Owner iratsu deleted

up for review again, still in disttrial-1794

comment:11 Changed 7 years ago by jml

I would very much like to review this before it lands.

comment:12 Changed 7 years ago by therve

  • Keywords review removed
  • Owner set to iratsu

General:

  • try to fit the code to 80 columns
  • choosing a strict blank line policy would be good (like 2 between classes and 1 between methods or 3/2). That's a shame there's still no rule for Twisted...

slave.py:

  • there are some trailing whitespaces

distrial.py:

  • you've changed beginTimer/endTimer, but not the calls to them
  • the class variable slaveNumber is cool, but it would be even better with some documentation :). That's also true for most instance variables of the class.

distvisitor.py:

  • thanks for finding my mistake above :). The new docstring of getSlave is cool, but that makes think that it should always return a Deferred (using defer.succeed). This way, you can remove the maybeDeferred call in distribute.

reportwrapper.py has been removed, I didn't spot it was unused :);

slavetrial.py:

  • you're still using trial.Options() and not calling parse. Can you explain me why?

slave.py:

  • the new log management is definitely better, but I think you can still improve it by making the directory configurable (first, maybe a class variable, and then an option of disttrial).
  • to format string, even if there is only on argument, always pass it as a list:
    a = "foo %s" % (bar,)
    
    That can prevent problem later.

test_slave.py:

  • connectionLost and errReceived of LocalSlave are not tested.

test_disttrial.py:

  • run is still not tested :/

test_slavetrial.py:

  • test_load docstring is outdated.

Thanks!

comment:13 Changed 7 years ago by iratsu

  • Keywords review added
  • Owner iratsu deleted

up for review again! disttrial-1784

comment:14 follow-up: Changed 7 years ago by therve

  • Keywords review removed
  • Owner set to iratsu

disttrial.py:

  • you have to remove the top-level reactor import, to be able to use a custom reactor.

distvisitor.py:

  • in __init__ self.ready initialization would be better spelled self.ready = slaves[:]
  • in getSlave, I think you should call pop(0) instead of pop()

test_distvisitor.py

  • the change in getSlave broke test_getSlave

test_slavereporter.py:

  • create Failure with a real exception (like RuntimeError). You should have a warning emitted with trial.

slave.py:

  • you've defined a logDirectory variable, but don't use it

There are other problems. disttrial doesn't use _trial_temp anymore, it puts file everywhere. It prints error[0.186871352856] in the middle of the tests.

disttrial must use its own options. Most trial options don't make sense with disttrial, you have to remove it. Furthermore, we need to customize some properties. Options needs to be forwarded to slave (like the reactor chosen). The number of processes is also very important.

Also, I did't see anything yet about the --hosts option. Has it been left out volontary for now?

comment:15 in reply to: ↑ 14 Changed 7 years ago by iratsu

Replying to therve:

slave.py:

  • you've defined a logDirectory variable, but don't use it

There are other problems. disttrial doesn't use _trial_temp anymore, it puts file everywhere. It prints error[0.186871352856] in the middle of the tests.

Hmm, afaict, I am using the logDirectory variable and all my log files are going in _trial_temp

Also, I did't see anything yet about the --hosts option. Has it been left out volontary for now?

Yes, I should have mentioned this, the --hosts ability has been left out and will be for another branch.

comment:16 Changed 7 years ago by iratsu

  • Keywords review added
  • Owner iratsu deleted

Fixed everything up, except what was mentioned in my last reply

comment:17 follow-up: Changed 7 years ago by therve

  • Keywords review removed
  • Owner set to iratsu

test_disttrial is broken (Small note: it's very hard to review a ticket with tests failing. It would be cool if you could check this out before asking for review).

distrial.py:

  • the way you pass is a good first step, but I think it should be done a more generic way: what if I want to pass another option to trial (for example gc, or random)? You'll end up adding all options one by one. Also, the default behavior should not be to pass the select reactor, but to pass nothing.

options:

  • you've mostly copy it from trial Options. A better way is to create a BaseOptions class in trial, and to inherit it in both trial and disttrial.

slave.py:

  • I was talking about the os.mkdir("_trial_temp"). I think that could removed, or use os.mkdir(self.logDirectory).

I'll try to explain the problems with a run. For example, I do that:

PYTHONPATH=. ./bin/disttrial twisted.web
  • I have error[0.663072777035] printed in the middle of the tests. That must be removed.
  • more annoying, it doesn't use _trial_temp directory: I have all the temporary files directly in the current path.

Thanks!

comment:18 in reply to: ↑ 17 Changed 7 years ago by iratsu

Replying to therve:

distrial.py:

  • the way you pass is a good first step, but I think it should be done a more generic way: what if I want to pass another option to trial (for example gc, or random)? You'll end up adding all options one by one. Also, the default behavior should not be to pass the select reactor, but to pass nothing.

The problem i have with this, is that a lot of options don't need to be passed down to the slaves. For example, random, since the randomizing occurs in the master.

options:

  • you've mostly copy it from trial Options. A better way is to create a BaseOptions class in trial, and to inherit it in both trial and disttrial.

Hmm, shouldn't this go in another branch though?

comment:19 Changed 7 years ago by iratsu

  • Keywords review added
  • Owner iratsu deleted

Sorry I forgot to check the tests last time, btw.
Fixed the rest of it now.

comment:20 Changed 7 years ago by therve

  • Keywords review removed
  • Owner set to iratsu

Still failing, but probably new failures :).

comment:21 Changed 7 years ago by iratsu

  • Keywords review added
  • Owner iratsu deleted

My bad, forgot to commit =\

comment:22 Changed 7 years ago by therve

  • Keywords review removed
  • Owner set to iratsu

Alright, we're not so far:

  • the way you pass options to subscript doesn't work: you're overriding sys.argv just before calling parseOptions. There are different ways to do this, but maybe the best would be to add an amp method configure called once to add options (this way, it'll work naturally when distributed).
  • _trial_temp must be deleted at the beginning of the tests
  • I have some intermittent failures, for example running twisted.mail suite. I think it's linked with your management of _trial_temp: each slave should get its own directory inside the global _trial_temp. Also you could put the out and err files inside, so you don't need to have to put the id in the file name

comment:23 Changed 7 years ago by therve

Some comments on the current options:

  • until-failure doesn't work
  • reporter doesn't work
  • temp-directory doesn't work

I haven't tested the others, but it would be great to make a check.

comment:24 Changed 7 years ago by iratsu

  • Keywords review added
  • Owner iratsu deleted

up for review again. Merged forward to disttrial-1784-2

comment:25 Changed 7 years ago by therve

  • Keywords review removed
  • Owner set to iratsu

This looks awesome. A few comments:

  • due to #2275 merge, your tests are broken sorry :/. You have to merge forward.
  • in options.py _loadReporterByName is defined 2 times
  • the copyright must only mention 2007, in both modules and tests.
  • distreporter is a bit problematic. The more I think about it, the more I'm convinced you have to use a __getattr__. You could remove stream, separator, tbformat and realtime variables, and only define the methods you need. This way, when the problems will be resolved in trial about the IReporter interface, nothing will need to be done here. DistReporter must be new style class too.
  • You've postpone the configuration management in slavetrial. I guess that's OK, but doing it is not that hard: parseOptions takes a list, you just have to add a configure method to the amp protocol, which takes this list and pass it to the config object (non blocking).
  • I have some overlap in text running twisted.mail suite (see twisted.mail.test.test_pop3):
    twisted.mail.test.test_imap
      Timeout
        testLongFetchDoesntTimeout ... twisted.mail.test.test_pop3
      IndexErrorCommandTestCase
        testRetrieveUpdatesHighest ...                                         [OK]
    
    I think this comes from warnings printed or something like that. That's not critical though, just wanted to report it.

Thanks!

comment:26 Changed 7 years ago by iratsu

  • Keywords review added
  • Owner iratsu deleted

disttrial-1784-3.

passing of Options is still not optimal, but I want to get this branch in first.

comment:27 Changed 7 years ago by therve

  • Keywords review removed
  • Owner set to iratsu
  • test_opt_testmodule in test_options is broken in a very strange way. That may be a bug in theSystemPath, but I think you could test it another way. (To reproduce the problem I have, remove twisted/__init__.pyc, then run the test). test_options would need some doc, too.
  • in distreport, you can now remove the separator, stream, tbformat and realtime attributes. __getattr__ may benefit from a docstring ("Lookup for attributes in corresponding reporter" or something like that). The reporter attribute needs to be documented too. It's a bit a shame that default values for reporter are now in 3 places (disttrial, distreporter, and reporter itself).

That's it for now.

comment:28 Changed 7 years ago by therve

iratsu, some news? Do you have some time to work on that?

comment:29 Changed 7 years ago by therve

(In [21481]) Fix the tests

Refs #1784

comment:30 Changed 7 years ago by therve

(In [21482]) Remove uused attributes

Refs #1784

comment:31 Changed 7 years ago by therve

(In [21498]) Change the way to pass options to slave

Refs #1784

comment:32 Changed 7 years ago by therve

(In [21676]) Merge forward with some move.

Refs #1784

comment:33 Changed 7 years ago by therve

(In [21678]) Add a manpage, some fixes.

Refs #1784.

comment:34 Changed 7 years ago by therve

  • Branch set to disttrial-1784-5
  • Cc iratsu added
  • Owner iratsu deleted

OK, I think this is ready to review. There are probably still a few problems, but I'm stuck on the decisions to make.

comment:35 Changed 7 years ago by therve

  • Keywords review added

comment:36 follow-up: Changed 7 years ago by exarkun

Possibly the largest remaining issue with this code is that sometimes disttrial hangs indefinitely. This is quite a difficult problem to address. I'm not sure what the right thing to do now is. We could just spend the time necessary to track down the bugs, but races are scary: how will we decide when we have fixed them all? Or do we just decide at some point that it is "good enough" and fix whatever problems we encounter as we encounter them?

For what it's worth, the two times I ran the tests, they hung after test_brokenSystem__str__.

comment:37 Changed 7 years ago by therve

(In [21724]) Fix stdout overloading

Refs #1784

comment:38 Changed 7 years ago by therve

(In [21725]) That'll do the trick for now.

Refs #1784

comment:39 in reply to: ↑ 36 ; follow-up: Changed 7 years ago by therve

Replying to exarkun:

Possibly the largest remaining issue with this code is that sometimes disttrial hangs indefinitely. This is quite a difficult problem to address. I'm not sure what the right thing to do now is. We could just spend the time necessary to track down the bugs, but races are scary: how will we decide when we have fixed them all? Or do we just decide at some point that it is "good enough" and fix whatever problems we encounter as we encounter them?

I don't understand your problem: how is this different from any piece of code? What are the other options except resolving the bugs? Deleting all the code?

Anyway I fixed the problem, it came from the reload(sys) in test_log. I tracked down another problem with test_ssl, which may be a problem in disttrial though.

comment:40 in reply to: ↑ 39 Changed 7 years ago by exarkun

  • Keywords review removed
  • Owner set to therve

Replying to therve:

I don't understand your problem: how is this different from any piece of code? What are the other options except resolving the bugs? Deleting all the code?

I dunno what we should do, I'm just saying the code is scary. :)

Anyway I fixed the problem, it came from the reload(sys) in test_log. I tracked down another problem with test_ssl, which may be a problem in disttrial though.

Okay, cool. There should probably be a unit test for whatever behavior r21724 changed. A comment explaining the intent of the del WriteDataTestCase would be handy.

comment:41 Changed 7 years ago by therve

  • Owner changed from therve to iratsu

comment:42 Changed 7 years ago by therve

  • Priority changed from highest to normal

comment:43 Changed 6 years ago by therve

  • author set to therve
  • Branch changed from disttrial-1784-5 to branches/disttrial-1784-6

(In [23641]) Branching to 'disttrial-1784-6'

comment:44 Changed 5 years ago by exarkun

  • Author changed from therve to exarkun, therve
  • Branch changed from branches/disttrial-1784-6 to branches/disttrial-1784-7

(In [27529]) Branching to 'disttrial-1784-7'

comment:45 Changed 5 years ago by exarkun

Okay, so here's a new thought. The way this implementation convinces child processes of which tests they are to run is silly. First it tells them to redo discovery and then it tells them, one by one, about tests they should not run.

Instead, the parent should do discovery and tell each child which tests it is to run. Possibly it should do this incrementally to make maximum use of all processes.

Aside from that, visitors in trial are deprecated (though I don't really understand why). So the implementation should use the private _iterateTests helper instead of the visitor pattern.

comment:46 Changed 5 years ago by exarkun

  • Owner changed from iratsu to exarkun
  • Status changed from new to assigned

comment:47 Changed 5 years ago by exarkun

  • Branch changed from branches/disttrial-1784-7 to branches/disttrial-1784-8

(In [27750]) Branching to 'disttrial-1784-8'

comment:48 Changed 5 years ago by exarkun

  • Branch changed from branches/disttrial-1784-8 to branches/disttrial-1784-9

(In [27807]) Branching to 'disttrial-1784-9'

comment:49 Changed 5 years ago by exarkun

A couple remaining issues:

  • doctests can't be loaded by the slaves - their id isn't something TrialLoader can deal with.
  • using stdio to communicate is unreliable. Any test that puts something onto stdout corrupts the communication channel (which results in a silent failure and an indefinite hang by the parent! fix that too)

comment:50 Changed 5 years ago by exarkun

  • Branch changed from branches/disttrial-1784-9 to branches/disttrial-1784-10

(In [27969]) Branching to 'disttrial-1784-10'

comment:51 Changed 4 years ago by exarkun

  • Branch changed from branches/disttrial-1784-10 to branches/disttrial-1784-11

(In [29735]) Branching to 'disttrial-1784-11'

comment:51 Changed 4 years ago by exarkun

  • Branch changed from branches/disttrial-1784-11 to branches/disttrial-1784-10

Split out the generator_failure_tests fix into #4562.

comment:52 Changed 4 years ago by exarkun

  • Branch changed from branches/disttrial-1784-10 to branches/disttrial-1784-11

comment:53 Changed 4 years ago by exarkun

Split out the ReactorSelectionMixin change into #4563.

comment:54 Changed 4 years ago by exarkun

I moved the refactoring of the _trial_temp handling code into #4564

comment:56 Changed 4 years ago by exarkun

  • Branch changed from branches/disttrial-1784-11 to branches/disttrial-1784-12

(In [30413]) Branching to 'disttrial-1784-12'

comment:57 Changed 3 years ago by <automation>

  • Owner exarkun deleted

comment:58 Changed 3 years ago by exarkun

  • Branch changed from branches/disttrial-1784-12 to branches/disttrial-1784-13

(In [31519]) Branching to 'disttrial-1784-13'

comment:59 follow-up: Changed 3 years ago by free.ekanayaka

This great.

Running 8 slaves on 8 cores reduced the running time of the test suite of one of the projects I'm working on from about 19 seconds to abut 4.5 seconds.

It would be nice if each slave could have an identifier assigned, so for example it could set test resource like databases, that need to be different for each slave. Something as simple as:

        for i, s in enumerate(protocols):
            env = os.environ.copy()
            env["SLAVE_TRIAL"] = str(i)
            spawner(s, sys.executable,
                    args=[sys.executable, slavetrialPath],
                    env=env)

in disttrial.DistTrialRunner.launchSlaveProcesses would work for that.

comment:60 Changed 3 years ago by glyph

  • Author changed from exarkun, therve to exarkun, therve, glyph
  • Branch changed from branches/disttrial-1784-13 to branches/disttrial-1784-14

(In [32935]) Branching to 'disttrial-1784-14'

comment:61 Changed 3 years ago by glyph

I just merged forward again. One test fails, but only when run under 'disttrial'. (This is the same when run under the whole suite or in isolation.)

$ disttrial twisted.test.test_log.FileObserverTestCase.test_startLogging
Running 1 tests.
twisted.test.test_log
  FileObserverTestCase
    test_startLogging ...                                               [ERROR]

===============================================================================
[ERROR]
'cStringIO.StringO' object has no attribute 'encoding'

twisted.test.test_log.FileObserverTestCase.test_startLogging
-------------------------------------------------------------------------------
Ran 1 tests in 0.002s

comment:62 follow-up: Changed 3 years ago by glyph

Some random comments if someone would like to work on this:

  • I'd rather we switch to some other terminology instead of "master/slave". "manager/worker" maybe?
  • Do we really need another command-line tool? I'd rather say 'trial -j' or something than 'disttrial'.
  • The 'localnumber' option's documentation is very confusing, and '4' seems like a lame default. It took me like 20 minutes of reading the implementation to realize that it meant 'process count'. I take it that it's named this way in anticipation of a 'remote number'?
  • The 'disttrial' man page appears to be missing at least 'localnumber', and probably quite a few other options. (Good reason to avoid having a new tool, I'd say.)
  • NEWS file.

Not a review, I know, but nevertheless, thanks to everyone who has worked on this thus far... it would be great to get this merged; it would be great to cut our test-run times in half or better.

comment:63 in reply to: ↑ 59 ; follow-up: Changed 3 years ago by glyph

Replying to free.ekanayaka:

It would be nice if each slave could have an identifier assigned, so for example it could set test resource like databases, that need to be different for each slave.

Why not just use os.getpid()?

comment:64 in reply to: ↑ 63 ; follow-up: Changed 3 years ago by free.ekanayaka

Replying to glyph:

Why not just use os.getpid()?

Because I'd like to re-use existing resources.

For example let's say you issue a first disttrial run with 2 slaves or (workers), each of them creates a database and applies an identical schema to it in order to run the tests in parallel but in isolation. Now you want to issue a second run of disttrial still with 2 slaves, but it'd be nice to re-use the databases and schemas that were created in the first run. Assigning an index (or some other identifier) to a slave lets the slave re-use a previously setup resource without re-creating it and with the confidence that no other slave its going to use it.

Maybe I'm missing something and this can be don with os.getpid().

comment:65 in reply to: ↑ 64 ; follow-up: Changed 3 years ago by glyph

Replying to free.ekanayaka:

Replying to glyph:

Why not just use os.getpid()?

Because I'd like to re-use existing resources.

For example let's say you issue a first disttrial run with 2 slaves or (workers), each of them creates a database and applies an identical schema to it in order to run the tests in parallel but in isolation. Now you want to issue a second run of disttrial still with 2 slaves, but it'd be nice to re-use the databases and schemas that were created in the first run. Assigning an index (or some other identifier) to a slave lets the slave re-use a previously setup resource without re-creating it and with the confidence that no other slave its going to use it.

Maybe I'm missing something and this can be don with os.getpid().

Ah, you want state that can be set up in advance of a test run for each process that will be run, and then re-use it. If you actually have code that can set up a fresh database in each process, you should just use that, and avoid bleeding state between test runs - but there are definitely situations where those external resources need to be pre-allocated; for example, databases might need to be set up by DBAs in advance.

comment:66 in reply to: ↑ 65 Changed 3 years ago by free.ekanayaka

Replying to glyph:

Ah, you want state that can be set up in advance of a test run for each process that will be run, and then re-use it. If you actually have code that can set up a fresh database in each process, you should just use that, and avoid bleeding state between test runs - but there are definitely situations where those external resources need to be pre-allocated; for example, databases might need to be set up by DBAs in advance.

Right, one other reason for not setting up a fresh database in each process each time is TDD. You might want to iterate between writing code and run just a few tests, and the quicker you can run the tests the better it is for your velocity.

Note that if the DB setup code of your tests simply realizes that the database is there and has the correct schema, for most unit-testing scenarios it's sufficient for the test setup to issue DELETE queries to make sure the tables are empty. Dropping the database entirely and re-creating the tables doesn't really add anything to the tests (unless they are special tests asserting the database and schema creation itself).

comment:67 follow-up: Changed 2 years ago by therve

Is there an agreement that the current implementation is good enough? Can we move forward with latest review comments?

comment:68 in reply to: ↑ 67 Changed 2 years ago by glyph

Replying to therve:

Is there an agreement that the current implementation is good enough? Can we move forward with latest review comments?

+1 from me. It seems like we can always add better stuff later. I think we could add the common-across-processes identifier stuff in a separate branch, since that's a separate feature.

(I still think we should get rid of the separate command-line tool, but I suspect that making it into a command-line option is substantially less work than cleaning everything up so we could merge the new tool, like writing a new manpage and so on.)

comment:69 Changed 2 years ago by therve

  • Branch changed from branches/disttrial-1784-14 to branches/disttrial-1784-15

(In [34066]) Branching to 'disttrial-1784-15'

comment:70 in reply to: ↑ 62 Changed 2 years ago by therve

  • Owner set to therve
  • Status changed from assigned to new

Replying to glyph:

Some random comments if someone would like to work on this:

  • I'd rather we switch to some other terminology instead of "master/slave". "manager/worker" maybe?
  • Do we really need another command-line tool? I'd rather say 'trial -j' or something than 'disttrial'.

So, some comments here. I'd be happy to remove the specific tool, but it has some consequences, mainly that there are options that disttrial doesn't support (debug, profile, etc). Should we raise an error when -j is passed along those?

  • The 'localnumber' option's documentation is very confusing, and '4' seems like a lame default. It took me like 20 minutes of reading the implementation to realize that it meant 'process count'. I take it that it's named this way in anticipation of a 'remote number'?

Yes. Do you have suggestions for a better name? processes? So we would have -j/--processes? What would be a better default than 4? I guess it doesn't matter if the option is activated on trial?

Thanks.

comment:71 Changed 2 years ago by exarkun

I suggest not trying to fix the separate toolness of this feature before merging this ticket. Smashing it into trial doesn't really improve things much - what would improve things is if it were a well integrated part of trial: that means being invoked by the trial command, but it also means proper support for debugging, profiling, etc. The way the functionality is exposed now makes it obvious these things won't work, because there aren't even any options for them.

Moving the code into trial mostly seems like makework at this point.

comment:72 Changed 2 years ago by therve

OK, I think I see your point. So what remains:

  • How do we call the localnumber option?
  • What would be a good default for it?

comment:73 Changed 2 years ago by exarkun

How about "processes" instead of "localnumber"? I think 4 is an okay default. At least, as okay as any other arbitrary constant. Perhaps file a ticket for adding CPU count detection code, and the default can be NCPU +/- 1 or something. PyPy has some code for this that maybe we can borrow or at least be inspired by.

comment:74 Changed 2 years ago by therve

  • Keywords review added
  • Owner therve deleted

Alright, I think the branch has improved a bit. I fixed stdio handling, did a bunch of renaming, cleaned up the man page, improved error reporting, added the NEWS fragment. I'll open the bug and add a TODO about detecting number of cores. Please review!

comment:75 Changed 2 years ago by thijs

  • Keywords review removed
  • Owner set to therve
  1. It seems the buildfarm is currently down but the following tests fail when running bin/trial twisted or bin/disttrial twisted:
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/thijs/workspaces/opensource/Twisted/branches/disttrial-1784-15/twisted/trial/dist/test/test_disttrial.py", line 309, in test_getSuite
    trialRunner = DistTrialRunner.getTrialRunner(config)
  File "/home/thijs/workspaces/opensource/Twisted/branches/disttrial-1784-15/twisted/trial/dist/disttrial.py", line 241, in getTrialRunner
    tracebackFormat=config['tbformat'],
exceptions.KeyError: 'tbformat'

twisted.trial.dist.test.test_disttrial.DistTrialTestCase.test_getSuite
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/thijs/workspaces/opensource/Twisted/branches/disttrial-1784-15/twisted/trial/dist/test/test_disttrial.py", line 283, in test_getTrialRunner
    trialRunner = DistTrialRunner.getTrialRunner(config)
  File "/home/thijs/workspaces/opensource/Twisted/branches/disttrial-1784-15/twisted/trial/dist/disttrial.py", line 241, in getTrialRunner
    tracebackFormat=config['tbformat'],
exceptions.KeyError: 'tbformat'

twisted.trial.dist.test.test_disttrial.DistTrialTestCase.test_getTrialRunner
-------------------------------------------------------------------------------
  1. Also, the program description that is printed when running bin/disttrial --help and bin/trial --help is exactly the same and doesn't explain what the difference is between the 2:
trial loads and executes a suite of unit tests, obtained from modules, packages
and files listed on the command line.
  1. the new classes/modules/packages need @since markers

comment:76 Changed 2 years ago by therve

  • Keywords review added
  • Owner therve deleted

Thanks for the review!

  1. Fixed.
  1. I added a bit of text.
  1. I added them to the new package and the new module. It sounds redundant to do all the classes.

comment:77 Changed 2 years ago by jesstess

  • Owner set to jesstess

comment:78 Changed 2 years ago by jesstess

  • Cc jesstess jml added
  • Keywords review removed
  • Owner changed from jesstess to jml

Thank you therve for sticking with this and driving it home.

This ticket has a huge backstory, so this is an honest-to-god incremental review following from thijs' review.

  • Build results look good: http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/disttrial-1784-15
  • When I run bin/trial twisted on this branch on an Ubuntu 11.04 Natty machine, 2 tests reliable time out:
    [ERROR]
    Traceback (most recent call last):
    Failure: twisted.internet.defer.TimeoutError: <twisted.test.test_process.PosixProcessTestCase testMethod=test_errorInProcessEnded> (test_errorInProcessEnded) still running at 120.0 secs
    
    twisted.test.test_process.PosixProcessTestCase.test_errorInProcessEnded
    ===============================================================================
    [ERROR]
    Traceback (most recent call last):
    Failure: twisted.internet.defer.TimeoutError: <twisted.test.test_process.PosixProcessTestCasePTY testMethod=test_errorInProcessEnded> (test_errorInProcessEnded) still running at 120.0 secs
    
    twisted.test.test_process.PosixProcessTestCasePTY.test_errorInProcessEnded
    -------------------------------------------------------------------------------
    Ran 7947 tests in 326.083s
    
    FAILED (skips=669, expectedFailures=11, errors=2, successes=7265)
    

I can't reproduce this elsewhere, and this doesn't manifest on the build slaves (we don't have a Natty build slave, though), so this is just an fyi; I don't think there's anything actionable here unless other people say they have this problem too.

After the API doc changes and an ack from jml, I think this is good to merge! Let's get people really bashing on it before the next release (perhaps with a mailing list announcement).

comment:79 Changed 2 years ago by therve

  • Keywords review added

Thanks for the review! I fixed the epydoc errors (only 2 were relevant afaict). The test errors you're having may be because the branch is slightly out of date, I can merge forward if you want. Putting it up for review for jml.

comment:80 Changed 2 years ago by jml

Hello,

Thank you for asking me to review this patch. I very much appreciate the work
and sheer persistence that's gone into it so far.

I have to confess I haven't read all of the comments so far, sorry. I hope to
read them before I publish part 2 of this review.

Yes, this is an incomplete review. Sorry. I figured that there's enough for
you to go on with.

Hope this helps,

jml

1. Must fix

  1. Rename dist to _dist. Despite the age of this ticket, this code still hasn't been used in anger, afaict. As such, it should be in a private package.
  2. distreporter.py
    1. "Concurent" should read "Concurrent"
  3. options.py
    1. coverdir should probably be spelled coverDirectory or even getCoverDirectory.
  4. workerreporter.py
    1. Rename amprotocol instance variable to ampProtocol
    2. WorkerReporter.__init__ is missing a docstring. Needs one especially for amprotocol.
    3. addSuccess should lower-case "success"
    4. printSummary should say I{Don't} rather than _Don't_.
  5. disttrial.py
    1. Unless there's a good reason otherwise, the module-level function run() should call a method called DistTrialRunner.run() and that should call DistTrialRunner._run(), rather than the way around it is now.
    2. mode is unused. Please delete it.
    3. workerNumber should be _workerNumber, as it's not used outside the class or tests.
    4. Ditto stream, distReporterFactory, reporterFactory, tbformat, rterrors, uncleanWarnings, workingDirectory.
    5. Delete the call to _unusedTestDirectory or justify its existence.
  6. workercommands.py
    1. ChDir seems an oddly low-level worker command. Can you please rename it to something that communicates what it is for, rather than what it does?

2. Questions

  1. How much flexibility is there to change the management commands if we discover we want to send more information back? In similar projects, there's a facility to attach arbitrary MIME-econded blobs (see subunit protocol and details)
  2. The arguments of TestWrite lead me to believe that only stdout from the worker is being captured. Looking further though, it appears to be used to catch log emissions. What happens to stdout & stderr emitted by the workers?

3. Please consider

  1. I'm not 100% sure that this should go into Twisted proper. It could work just as well as a separate project, I think.
  2. DistReporter looks a lot like ThreadSafeForwardingResult
  3. I think it's a bad idea for DistTrialRunner to have a default number of workers. In the code as stands, it will always be passed through. If tests need to communicate that they don't care, they can define ARBITRARY_WORKER_NUMBER = 4 or similar.

4. Points of interest

  1. dist/__init__.py is a thing of beauty. Thank you.
  2. For better or worse, the testr command in testrepository does much of this already

5. Not yet reviewed

  • DistTrialRunner.run()
  • worker.py
  • workertrial.py
  • tests
  • Are any objects leaking into public API use?
  • Evaluate all of this in the light of past comments.

comment:81 Changed 2 years ago by therve

  1. Everything done, except 3.1 renaming coverdir: it's the name of the command line option and the name in trial, so I shouldn't change it.

2.1. It's fairly flexible, we can customize managercommands as we need.

2.2. It's captured by the process protocol, see LocalWorker.outReceived/errReceived.

3.1. Yeah we can discuss it. It's nice to have it readily available for Twisted development.

3.2. Agreed, though the solution used here is much simpler.

3.3. Good idea, fixed.

4.2. One difference I seemed to get with a quick look is that the tests are split beforehand, whereas disttrial feeds them on demand. Technically disttrial should be more efficient if it's the case.

Thanks a lot for the detailed review. Among the comments 1.5.5 helped fix a nice bug!

comment:82 Changed 2 years ago by jml

Thanks!

Most recent changes:

  1. Start is a much better name than ChDir, thanks. Please change the docstring to say Set up (verb) rather than Setup (noun).

Regarding 3.2, ThreadSafeForwardingReporter is more complicated, but it's also had actual use driving some of that.

Regarding 4.2, correct.

Now, on to review the rest of it.

comment:83 Changed 2 years ago by jml

  • Keywords review removed
  • Owner changed from jml to therve

OK. Wow. Done. At least, this review is done.

Thank you again for your persistence and perseverance. This is the last slog,
I hope.

I'm sorry if the comments get a bit ratty toward the end. It's a Saturday and
I have a rare night in at the computer able to program, and I'm wondering why
I'm doing code reviews instead of learning things or making something cool. I
have to keep reminding myself that Twisted is better than something cool, it's
something good.

1. Must fix

  1. DistTrialRunner._run is too big and too hard to understand. Here are some things that I think might help.
    1. launchWorkerProcesses could return ampProtocol for each of the workers given to it.
    2. driveWorker could be split out to a top-level private function and given a docstring.
    3. driveWorker could be completely replaced with either:
      1. Something that uses inlineCallbacks
      2. A use of DeferredSemaphore(1) and DeferredList(..., fireOnOneErrback=True)
      3. cooperate((f() for f in tasks)).whenDone() (thanks Itamar)
    4. I'd probably shift the reactor start / stop business out to run()
  2. worker.py
    1. The id attribute is undocumented and unused. Please remove it.
    2. _buildFailure should document its parameters. It's not immediately clear why it needs to exist at all until you learn that frames is a list of strings.
    3. If possible, rename testCase, result, makeReady and testStream to start with underscores
    4. LocalWorker implements ITCPTransport but the docstring doesn't say why, and it's certainly not obvious to me. Please update the docstring to say why.
    5. LocalWorker.logDirectory isn't a cvar, it's an ivar
    6. LocalWorker.logDirectory isn't used externally. Please rename to _logDirectory.
    7. LocalWorker.ampFactory isn't used externally. Please rename to _ampFactory.
    8. Likewise, ampProtocol, outLog, errLog, and testLog.
    9. There should be a comment before d.addErrback(lambda x: None) explaining why we are swallowing errors here.
    10. getHost returning "string" seems wrong. Please either fix it, or explain why we're returning something so bogus.
    11. getPeer returning "string", "string" seems wrong. Please either fix it, or explain why we're returning something so bogus.
  3. workertrial.py
    1. protocolIn.read(1) and surrounding error handling largely duplicate twisted.internet.fdesc.readFromFD. Where they differ, it's not clear to me that the differences are correct. Please try to re-use readFromFD, parametrizing it where appropriate. Note that:
      1. Reading 1 byte at a time is comment worthy, at least
      2. I can delete the call to sys.exc_clear() without causing any tests to fail. Thus, it is both comment and test worthy
    2. The '3' and '4' get mentioned more than once in these modules. Please make constants that refer to them and use those constants throughout.
  4. test_disttrial
    1. DistTrialTestCase has docstrings that don't communicate very much. Please change them to describe the expected behaviour, rather than say "as expected" or "appropriate" or nothing at all.
    2. DistTrialRunnerTestCase.test_run says "rector", should say "reactor".
  5. test_worker
    1. LocalWorkerAMPTestCase docstring should begin with capitalized "Test" rather than "test".
    2. Each of these tests methods is largely the same. Please refactor. A good place to start might be to take each of the internal comments in test_runSuccess and turn those into three separate methods.
    3. ResultHolder is just plain weird. Please delete it and replace it with test-local lists that are appended to in callbacks, or something better than that.
  6. workertrial
    1. start doesn't say why it changes directory. Sorry, I should have noticed this before.

2. Please consider

  1. I'll wager a drachma that DistTrialRunner._run wasn't written with TDD.
  2. Putting "Check that" and "Test that" in test docstrings always strikes me as silly.
  3. In test_distreporter asserting against self.stream.getvalue() will get better failure messages than self.stream.tell().
  4. In general, the tests could be tightened up. I don't think code review is a great way of achieving that.

comment:84 Changed 2 years ago by therve

  • Branch changed from branches/disttrial-1784-15 to branches/disttrial-1784-16

(In [34924]) Branching to 'disttrial-1784-16'

comment:85 Changed 2 years ago by therve

  • Keywords review added
  • Owner therve deleted

Well that was intense, thanks for the careful review again. It leads to some nice cleanups. I handled most of points, except:

1.4 Moving start/stop would break lots of tests, and I think the logic is managed correctly now.

3.1 I didn't reuse readFromFD, mostly because it's hard to use and to customize. I don't know exactly why we need to read one byte at a time, unfortunately. It's due to some limitations of the pipe afair.

This bug is painful, mostly because trying to fit all of this in one branch is insane. Still, here we are, and I think we're close to have something useful. One last round?

comment:86 Changed 2 years ago by glyph

My battery's going to run out before I have time for a full review, but for starters, the behavior of this branch in the face of hung tests and control-C is kinda weird. For example, if I run this python file:

from twisted.trial.unittest import TestCase
from twisted.internet.defer import Deferred
class Forever(TestCase):
    def test_fast(self):
        pass
    def test_forever(self):
        return Deferred()

With regular trial, I get:

$ trial test_forever.py
test_forever
  Forever
    test_fast ...                                                          [OK]
    test_forever ... ^C                                                   [ERROR]
Traceback (most recent call last):
  File ".../bin/trial", line 18, in <module>
    run()
  File ".../twisted/scripts/trial.py", line 398, in run
    test_result = trialRunner.run(suite)
  File ".../twisted/trial/runner.py", line 797, in run
    return self._runWithoutDecoration(test, self._forceGarbageCollection)
  File ".../twisted/trial/runner.py", line 826, in _runWithoutDecoration
    run()
  File ".../twisted/trial/runner.py", line 821, in <lambda>
    run = lambda: suite.run(result)
  File ".../twisted/trial/runner.py", line 275, in run
    TestSuite.run(self, result)
  File ".../twisted/trial/unittest.py", line 1407, in run
    test(result)
  File ".../twisted/trial/unittest.py", line 1395, in __call__
    return self.run(result)
  File ".../twisted/trial/runner.py", line 179, in run
    super(LoggedSuite, self).run(result)
  File ".../twisted/trial/unittest.py", line 1407, in run
    test(result)
  File ".../twisted/trial/unittest.py", line 1395, in __call__
    return self.run(result)
  File ".../twisted/trial/runner.py", line 152, in run
    test(result)
  File ".../twisted/trial/unittest.py", line 1395, in __call__
    return self.run(result)
  File ".../twisted/trial/runner.py", line 152, in run
    test(result)
  File ".../twisted/trial/unittest.py", line 1395, in __call__
    return self.run(result)
  File ".../twisted/trial/runner.py", line 152, in run
    test(result)
  File ".../twisted/trial/unittest.py", line 731, in __call__
    return self.run(*args, **kwargs)
  File ".../twisted/trial/unittest.py", line 1120, in run
    _collectWarnings(self._warnings.append, runThunk)
  File ".../twisted/trial/unittest.py", line 178, in _collectWarnings
    result = f(*args, **kwargs)
  File ".../twisted/trial/unittest.py", line 1112, in runThunk
    self._wait(d)
  File ".../twisted/trial/unittest.py", line 1287, in _wait
    raise KeyboardInterrupt()
KeyboardInterrupt
[Error: 1]

But with disttrial, I get:

$ disttrial test_forever.py
Running 2 tests.
test_forever
  Forever
    test_fast ...                                                          [OK]
^C
-------------------------------------------------------------------------------
Ran 1 tests in 7.619s

PASSED (successes=1)

It seems odd that the test does not fail or show up in any way as interrupted (or even that it start to run).

comment:87 follow-up: Changed 2 years ago by therve

It should be better with r34934.

comment:88 Changed 2 years ago by glyph

  • Summary changed from disttrial --hosts=kunai,takkun,muon twisted to Parallelize execution of Trial unit tests into multiple processes to make more efficient use of available CPU cores

The subject line of the ticket here is basically describing the use-case that we have explicitly decided not to support, so I'm changing it to be more descriptive of the work that's actually been done in this branch.

comment:89 in reply to: ↑ 87 Changed 2 years ago by glyph

Replying to therve:

It should be better with r34934.

Indeed, it was. I still find the behavior somewhat weird - I don't know what test I'm canceling until I hit ^C, and the error is far from descriptive, but at least it indicates that not all the tests were run.

comment:90 Changed 2 years ago by glyph

For what it's worth, I'm starting to lean in the direction of accepting this. It's not perfect but it looks like it meets all of our requirements for new code, and to the extent that parts are sub-optimal they have been properly hidden behind an intimidating array of underscores.

However, the disttrial command-line still bothers me, for a bunch of reasons:

  1. It's a new script, which means it increases the "API surface" of Twisted somewhat more than a new command-line flag. This will require packaging updates, fixes to everybody's distutils workarounds (and we all have our own). Now, obviously this cost is worth incurring if the functionality is a really a separate tool with distinct functionality, but...
  2. ... disttrial actually duplicates tons of code from trial. Now, the private-ness of all of its internal APIs is great, because we can refactor and eliminate this duplication without further delaying the merge of this ticket, but the separate entry-point is problematic because it creates an unfortunate legacy we have to hold on to.
  3. trial is a word. disttrial is a silly, easy to misspell ("I forget, was it disttrial or distrial?") abbreviation tacked on to the front of a word.
  4. trial -j follows a long and venerable tradition for local parallelism. disttrial evokes distcc, which provides exactly the functionality that this tool does not (yet), i.e. utilization of multiple hosts, as I just noted in my comment about the summary change. You still have to use make -j to run distcc! So it's somewhat misleading.
  5. The documentation is all duplicated, and the man page (which is already duplicating information) is also duplicated. We are already bad enough about just updating the narrative documentation; we don't need more things that our users will be confused by being out of date.

exarkun raised two significant counterpoints to this criticism:

  1. Shuffling around the command name is makework, let's get it merged already.
  2. The separate command-line and separate documentation makes it more obvious which features are not supported by disttrial as opposed to trial.

To the first point: I feel somewhat strongly about this and I will volunteer to do all the necessary work to make it happen. In fact, in order to make sure that we don't hold anything up unnecessarily, I will volunteer to do it after this is merged in its current state, presuming that at least the release manager for 12.2 will at least give me a week's heads-up before cutting a pre-release. I'd rather do it before commit, but I'd more rather make sure that this does make it in time for 12.2.

To reduce makework, especially if we are going to do this pre-landing, then I would say that the only thing to do would be to cause the -j command-line to short-circuit all processing of the trial command-line, and jump straight into the existing code in _dist. This would make even trial -j 2 --help function differently, which I think is OK.

To the second point, I think it would be better to have a ticket to explicitly document which options aren't supported. Right now if they want to get a list, they have to manually diff the options between the two commands. A brief note saying "this option will not work in combination with the -j option" would be better. Also, as I mentioned above, my suggested implementation technique will cause -j / --help in combination to print help for only options supported with -j ;-).

comment:91 Changed 2 years ago by exarkun

To the second point, I think it would be better to have a ticket to explicitly document which options aren't supported. Right now if they want to get a list, they have to manually diff the options between the two commands. A brief note saying "this option will not work in combination with the -j option" would be better. Also, as I mentioned above, my suggested implementation technique will cause -j / --help in combination to print help for only options supported with -j ;-).

This approach fails the usability test to a greater extent than disttrial fails it (to be sure, disttrial fails it). It's also an additional documentation burden across all of trial's interface, instead of a burden segregated to the disttrial corner.

By all means, the functionality should be integrated into trial, but not in a half-assed way (ie, we shouldn't document -j as not working with 95% of the rest of the features, we should make it work with them). I guess with your plan, doing that would follow incrementally? But I guess I'm skeptical that it would ever really follow. I suppose we can pretend that someone will do the work though. Who knows, maybe someone actually will.

comment:92 follow-up: Changed 2 years ago by therve

I see 2 suggestions, and none which makes this branch moving forward. I agreed with Glyph that we should remove the disttrial command, but let's do it a bit better than bootstrapping the current command when a flag is passed. I disagree with Exarkun that documenting some options don't work with -j are a problem.

First, -j doesn't exist right now. It's not like we're breaking a contract here. Then it's not 95% of the of the features, there are 6 options which don't work with disttrial: debug, debug-stacktraces, nopm, profile, dry-run, until-failure. Some of them are already not working between each other (dry-run and debug?). nopm only works when debug is set, and it's not an usability problem. If merged with trial, dry-run would work so we go back to 4. Same for until-failure I think, we get to 3.

So my proposal is to change the branch to merge the changes into a -j option, and change the 3 options to raises an error if passed concurrently with -j. We get to open 3 bugs to fix the behavior of those options. What do you think?

comment:93 in reply to: ↑ 92 Changed 2 years ago by glyph

  • Keywords review removed
  • Owner set to therve

Replying to therve:

I see 2 suggestions, and none which makes this branch moving forward. I agreed with Glyph that we should remove the disttrial command, but let's do it a bit better than bootstrapping the current command when a flag is passed. I disagree with Exarkun that documenting some options don't work with -j are a problem.

First, -j doesn't exist right now. It's not like we're breaking a contract here. Then it's not 95% of the of the features, there are 6 options which don't work with disttrial: debug, debug-stacktraces, nopm, profile, dry-run, until-failure. Some of them are already not working between each other (dry-run and debug?). nopm only works when debug is set, and it's not an usability problem. If merged with trial, dry-run would work so we go back to 4. Same for until-failure I think, we get to 3.

So my proposal is to change the branch to merge the changes into a -j option, and change the 3 options to raises an error if passed concurrently with -j. We get to open 3 bugs to fix the behavior of those options. What do you think?

This sounds perfect to me.

In fact, with this modification... I would say that the branch is ready to merge, albeit with some hesitation. It may touch a lot of code, so it may require a re-review just to make sure everything is still OK. But, I am just about out of objections, and I am willing to commit to doing the re-review once you're done. Let me know when it's ready.

And don't forget to file a separate ticket for the actual multi-host parallelism that this one was originally going to be about.

comment:94 Changed 2 years ago by teratorn

FYI, we have the meta-data to express mutually-exclusive command-line options now. See t.p.usage.Completions.mutuallyExclusive

It's oriented for the completion system, but I see no why we can't use this metadata for documentation, verification, or other purposes.

comment:95 Changed 2 years ago by therve

  • Keywords review added
  • Owner therve deleted

Alright, the changes are made. This is a pretty big change, so it definitely needs an additional look. I didn't use the mutuallyExclusive system, but I can if it's worth it.

comment:96 Changed 2 years ago by glyph

  • Owner set to glyph
  • Status changed from new to assigned

Reviewing.

comment:97 Changed 2 years ago by glyph

Cooking up a new batch of build results just to be sure.

comment:98 Changed 2 years ago by glyph

  • Keywords review removed
  • Owner changed from glyph to therve
  • Status changed from assigned to new

"Thank you" doesn't quite cover it. Thomas, this is a heroic effort.

There's basically one must-handle point:

  1. The unit tests are failing on a bunch of different platforms, in a big way. Please force a build when submitting anything for review in the future, so we will discover these problems faster.

Other than that I'd say it's ready to land, except for these minor cleanup issues.

  1. The word distrial appears in 3 places. This should probably say trial -j, although disttrial would be okay too, depending on context.
  2. When running the tests on Mountain Lion, I noticed that several worker processes were left hanging around. This might be a bug in the CoreFoundation reactor tests. We don't have a buildbot for this OS version yet, so I don't expect it will necessarily work immediately, but one thing I noticed was that it was not entirely obvious to map an individual worker process to a temporary directory, so that I could read its logs. This is a separate feature though, and can easily be a separate ticket.
  3. As a result of getting rid of disttrial, you've also got rid of the automatic detection of the number of CPU cores / ideal jobs to run. It would be good to put this back as --jobs=auto, or something like that.
  4. LocalWorkerTransport is a class that I am sure that every Twisted developer has written at least twice. It is too bad that this branch has yet another copy of it, although I guess this is the first version to actually show up in Twisted. Please try to coordinate a bit with ashfall to remove this duplication, perhaps after the process endpoint ticket, which should need roughly the same infrastructure, has been merged. (I guess I know what I'm reviewing next!)
  5. test_jobsConflictWithDebugStackTraces, test_jobsConflictWithDebug, and test_jobsConflictWithProfile all say "not supported yet", which is dangerously close to a "TODO" comment; please file and link to tickets. Bonus points for the use of the "U{}" epytext feature.
  6. As long as you're fixing up the man page (and even this exact line!), --temp-directory should have plural agreement. i.e. "Do not use this option unless you know what you are doing."
  7. twisted.scripts.trial.Options has no docstring. (To be fair, it didn't before, but the whole thing has been modified so I think I feel justified in asking for one ;-)).
    1. _makeRunner should have a @return/@rtype.
    2. new attributes _workerFlags and _workerParameters should have @ivars.
  8. LocalWorkerAMPTestCase.pumpTransports - should be merged with something else. There's implementations of this already in twisted.protocols.loopback, twisted.test.iosim, and twisted.test.test_pb, and there needs to be a better one with more granular control in twisted.test.proto_helpers. A ticket / docstring link to this effect would be fine.
  9. pydoctor has some complaints:
    twisted.trial._dist.test.test_options.WorkerOptionsTestCase.test_coverage:41
    invalid ref to WorkerOptions.covedir
    twisted.trial._dist.test.test_worker.LocalWorkerTestCase.test_childDataReceived:260
    invalid ref to ProcessProtocol.childDataReceived
    twisted.trial._dist.test.test_worker.LocalWorkerTestCase.test_outReceived:276
    invalid ref to _outLog
    twisted.trial._dist.test.test_workerreporter.WorkerReporterTestCase.test_addExpectedFailure:100
    invalid ref to managercommands.addExpectedFailure
    

comment:99 Changed 2 years ago by therve

  • Resolution set to fixed
  • Status changed from new to closed

(In [35034]) Merge disttrial-1784-16

Authors: iratsu, exarkun, glyph, therve
Reviewers: exarkun, jml, glyph, thijs, jesstess, therve
Fixes: #1784

Add a -j option to trial allowing it to run the tests using multiple local processes.

comment:100 Changed 2 years ago by therve

  • Resolution fixed deleted
  • Status changed from closed to reopened

(In [35035]) Revert r35034: failures on the py-without-modules buildslave

Reopens: #1784

comment:101 Changed 2 years ago by therve

  • Keywords review added

comment:102 Changed 2 years ago by therve

  • Owner therve deleted
  • Status changed from reopened to new

comment:104 Changed 23 months ago by therve

  • Branch changed from branches/disttrial-1784-16 to branches/disttrial-1784-17

(In [35690]) Branching to 'disttrial-1784-17'

comment:105 Changed 23 months ago by therve

(In [35692]) Branching to 'disttrial-1784-17'

comment:106 Changed 23 months ago by glyph

Forcing another build to test the forward-merged version.

comment:107 Changed 23 months ago by glyph

There are a couple of legitimate pydoctor reference errors which it would be great to clean up before landing.

comment:108 Changed 23 months ago by glyph

Otherwise the forced builds look good.

comment:109 Changed 23 months ago by glyph

  • Keywords review removed
  • Owner set to therve

It passes on the py-without-modules slave now.

A few things to clean up before landing, since every review of a branch this big is bound to yield something:

  1. There are a bunch of @since flags which need to be updated to be 12.3.
  2. Fix up those two pydoctor reference errors.
  3. launchWorkerProcess's documentation of its spawner argument is wrong; it needs to take more stuff than a "command" (type unspecified) and a process protocol.
  4. _driveWorker should have a @return and an @rtype since it does in fact return something.

Things to consider for the future, and file (or link existing) tickets for before landing:

  1. The narrative documentation needs to be updated to mention trial -j. Also, possibly, be updated to mention things about constraints that running with -j might impose on your code, like mutating state that might be shared between those testing processes.
  2. LocalWorkerProcess et. al. (launchWorkerProcess, workertrial.main) have kernels of generally useful functionality and we should keep in mind that this should be refactored into a general process pool (something which Twisted needs generally).

So... go crank up some music and commit this to trunk.

Thanks!

comment:110 Changed 23 months ago by glyph

  • Priority changed from normal to highest

(merge it now merge it RIGHT NOW)

comment:111 Changed 23 months ago by therve

  • Resolution set to fixed
  • Status changed from new to closed

(In [35718]) Merge disttrial-1784-17

Authors: iratsu, exarkun, glyph, therve
Reviewers: exarkun, jml, glyph, thijs, jesstess, therve
Fixes: #1784

Add a -j option to trial allowing it to run the tests using multiple local
processes.

Note: See TracTickets for help on using tickets.