Opened 15 years ago

Closed 6 years ago

#2673 defect closed fixed (fixed)

test_synchronization fails sometimes

Reported by: Jean-Paul Calderone Owned by: Glyph
Priority: normal Milestone:
Component: core Keywords:
Cc: Adi Roiban, David Petar Novakovic, radix Branch: branches/desynchronize-2673-4
branch-diff, diff-cov, branch-cov, buildbot
Author: glyph

Description

Test Pass 3935
Running 7 tests.
twisted.test.test_threadpool
  RaceConditionTestCase
    test_singleThread ...                                                  [OK]
    test_synchronization ...                                             [FAIL]
  ThreadPoolTestCase
    testCallInThread ...                                                   [OK]
    testDispatch ...                                                       [OK]
    testExistingWork ...                                                   [OK]
    testPersistence ...                                                    [OK]
    test_threadCreationArguments ...                                       [OK]

===============================================================================
[FAIL]: twisted.test.test_threadpool.RaceConditionTestCase.test_synchronization

Traceback (most recent call last):
  File "/home/exarkun/Projects/Twisted/trunk/twisted/test/test_threadpool.py", line 216, in test_synchronization
    self.fail("Actions not synchronized")
twisted.trial.unittest.FailTest: Actions not synchronized
-------------------------------------------------------------------------------
Ran 7 tests in 120.308s

FAILED (failures=1, successes=6)

Change History (58)

comment:1 Changed 11 years ago by <automation>

Owner: Glyph deleted

comment:2 Changed 9 years ago by Jean-Paul Calderone

#6214 was a duplicate of this.

comment:3 Changed 7 years ago by Glyph

Author: glyph
Branch: branches/desynchronize-2673

(In [43158]) Branching to desynchronize-2673.

comment:4 Changed 7 years ago by Glyph

Author: glyph
Branch: branches/desynchronize-2673
Keywords: easy review added

This test is nonsense and it should be deleted. It should have been deleted a long time ago.

comment:5 Changed 7 years ago by Glyph

Author: glyph
Branch: branches/desynchronize-2673

comment:6 Changed 7 years ago by Glyph

Test failures are all due to #7651 and some weird multicast thing that is unrelated.

comment:7 Changed 7 years ago by Jean-Paul Calderone

This test is nonsense and it should be deleted. It should have been deleted a long time ago.

A little bit more detail than this might be nice.

comment:8 in reply to:  7 Changed 7 years ago by Glyph

Replying to exarkun:

This test is nonsense and it should be deleted. It should have been deleted a long time ago.

A little bit more detail than this might be nice.

OK, I can expound a little bit on how it is specifically nonsense :).

The test takes the following steps:

  1. Set an event, in a pool thread.
  2. Wait for the event, on the IO thread.
  3. Clear the event.
  4. Now we wait on the event in 3 pool threads. Or, I should say, we tell the pool to do this; the 'wait' may happen at any time later, of course, since it's happening concurrently.
  5. In a 4th pool thread, we set the event.
  6. On the I/O thread, we wait on the event again to allow the 'set' to complete.
  7. Then we assert the event is set.

Let's say that Python is having a slow day, and the wait call in step 2 times out. Step 1 doesn't necessarily happen before step 3.

The same thing can happen if the wait in step 7 times out. There's no guarantee that threads get scheduled to run within 5 seconds, and on a heavily loaded system, they may indeed not be.

More importantly though: why should it? Nothing about any of the code we've written for Twisted really controls any of this stuff, and the test is mistaken about the semantics of threads in general. We could mess around with the assertions but it makes more sense to just delete the test and maybe write new tests if we think of interesting properties of twisted.python.threadpool if we can think of any.

comment:9 Changed 7 years ago by Jean-Paul Calderone

Let's say that Python is having a slow day, and the wait call in step 2 times out.

This problem generalizes to practically all tests that return an unfired Deferred (I know this test doesn't but the problem is the same: timing of events). Most tests like that in Twisted don't intermittently fail, though. Why is this test so much more likely to fail because of timing going wrong?

There's no guarantee that threads get scheduled to run within 5 seconds, and on a heavily loaded system, they may indeed not be.

This part is a little bit funny. I thought that getTimeout returned 120 because that's what TestCase.getTimeout returns. Then I noticed that someone decided to override it to return 5 for this test case (presumably because this is a SynchronousTestCase instead of a TestCase and thus doesn't inherit the normal getTimeout implementation).

Does the test just fail because 5 seconds is too short? Would most of the rest of the test suite have intermittent failures if TestCase.getTimeout returned 5 instead of 120?

More importantly though: why should it? Nothing about any of the code we've written for Twisted really controls any of this stuff, and the test is mistaken about the semantics of threads in general. We could mess around with the assertions but it makes more sense to just delete the test and maybe write new tests if we think of interesting properties of twisted.python.threadpool if we can think of any.

The primary criticism I can raise against this test is that it's totally unclear what it's trying to test. I can certainly imagine this exercising some important codepath through twisted.python.threadpool - a code path which must be correct or the thread pool will be broken, despite the fact that the actual thread implementation just comes from the Python runtime. The difficulty is that the test docstring makes no sense ("actions run in the pool synchronize with actions run in the main thread"? Huh?) and since it's thread-related code, whatever behavior it's trying to exercise is extremely difficult to divine from the implementation.

For what it's worth, ignoring load considerations (in other words, problems which would prevent one of the threads involved in this test from being scheduled within any 5 second window), I don't see why the test fails.

My understanding of threading.Event is that step 1 actually does complete before step 3 begins because step 2 cannot complete until the action initiated by step 1 has been taken.

The I/O thread will remain blocked until the event is set. The I/O thread therefore cannot proceed to clearing the event until the thread pool thread has completed the set operation on the event. 1 -> 2 -> 3 is the guaranteed order - again, according to my understanding of threading.Event.

Oh, except for the part where the return value of Event.wait is ignored. Perhaps the author thought this would raise an exception if the timeout was hit. If you imagine every call to wait(timeout) as assert wait(timeout) then at least the test would fail in a more coherent way when it is failing because of thread scheduling importunities.

comment:10 in reply to:  9 Changed 7 years ago by Glyph

Replying to exarkun:

Let's say that Python is having a slow day, and the wait call in step 2 times out.

This problem generalizes to practically all tests that return an unfired Deferred

Sorry, focusing on this part of the explanation was the wrong thing to put first. Really the thing to focus on is:

The primary criticism I can raise against this test is that it's totally unclear what it's trying to test.

I have a hard time choosing between "primary" and "secondary", but the top three criticisms I have are:

  1. it makes our test suite intermittently fail,
  2. it is totally unclear what it is trying to test, and
  3. if it is testing something worthwhile (which is also not entirely clear), the worthwhile thing is almost certainly in Python or POSIX somewhere, not in our code.

I can certainly imagine this exercising some important codepath through twisted.python.threadpool - a code path which must be correct or the thread pool will be broken

While I can imagine this, and I would not put it beyond the realm of possibility, I don't believe that it is so. First of all, since it fails intermittently, it can't be correctly covering whatever cases need covering, and its failure certainly isn't going to tell us anything useful. I have an existence proof here: its (ongoing, intermittent) failure hasn't told us anything useful ;). It might error in some condition that none of the other tests do, but... consider this: the only API invoked by "test_synchronization" is callInThread. We have 7 other tests for that in test_threadpool alone, let alone all the reactor tests which also cover it.

callInThread is just an alias (no branches) for callInThreadWithCallback. callInThreadWithCallback has two 'if' statements in it, one of which is just guarding "joined" state (which this test definitely isn't covering, as it doesn't join the threadpool).

Of course, this indirectly invokes _worker and _workerState and _startSomeWorkers, but it doesn't directly test any boundary conditions of those. _startSomeWorkers' only branch is a loop, and we don't get anywhere close to its liminal conditions (we don't get close to maxthreads units of work here). _workerState is incredibly simple and doesn't have any branches beyond a "yield", the onResult and try/except and WorkerStop branches in _worker are all covered explicitly, if somewhat imperfectly (due to threading being awful in all the usual ways) by other tests.

Finally, it's worth noting that a coverage report with and without this test doesn't look appreciably different. I know that line coverage isn't everything, but once you move beyond line coverage, you move into the realm where the intent and the system properties being verified by the tests become important, and as covered above, we don't even know what the intent is or what properties are being verified.

So, regardless of how the explanation of why the test is failing goes (I'll cover that in another comment) I think this should be deleted. We shouldn't keep test code around because it covers some conditions that we don't understand in some way we can't predict. If the code is exercised and the properties we do understand are covered by other tests, we should prune dead test code just as we'd prune dead implementation code.

This code isn't even dead - it's worse than dead, it's an anti-feature. By intermittently failing, it decreases our confidence in our test suite, bogs down reviews, and complicates diagnosing real bugs.

comment:11 in reply to:  9 Changed 7 years ago by Glyph

Oh my goodness there is an actual bug here. Wow. Wow.

comment:12 Changed 7 years ago by Glyph

So, on my very very fast desktop computer, I typically get a failure after only 2-300 iterations, which is the opposite I would have expected if the issue was just that things weren't running fast enough and wait timed out.

comment:13 Changed 7 years ago by Glyph

Keywords: easy removed

Yeah, not "easy" by any stretch of the imagination. What was I thinking, putting that on a threading ticket.

comment:14 Changed 7 years ago by Glyph

The problem is that not enough worker threads are started. Sometimes, when we go to run step 5, above, the 3 existing worker threads are blocked in self.event.wait, and _startSomeWorkers decides that it doesn't need an additional worker to call set.

The first part of the test (before clear) is entirely a red herring. The for loop is a red herring. Pretend for a moment that it just does a callInThread(wait), callInThread(set), wait().

First, we put wait into the queue, ensuring that we will have just one thread and it will be blocked.

self.workers goes up by one.

self.q.qsize goes up by one, since there's something waiting in the queue.

If we were to call _startSomeWorkers now, it would indeed notice that the only worker is blocked, so so far so good.

Now here's the problem:

In the thread, _worker gets to self.q.get(), gets the value - self.q.qsize goes down by one. Then it's just about to enter that _workerState context manager to update working, but, before it does...

(qsize is 0, self.workers is 1, self.working is [])

Context switch!

Back to the main thread. We put set into the queue. self.q.qsize goes up by one. Time to start some workers.

(qsize is 1, self.workers is 1, self.working is [])

In _startSomeWorkers:

  • neededSize is qsize (1) - len(working) (0); so, 1
  • self.workers is 1.

1 is not less than 1.

Therefore, no need to start a worker.

Our only worker thread now proceeds to block in wait. set sits on the queue, forever.

The main thread calls wait, and times out.

~ fin ~

comment:15 Changed 7 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: set to Glyph

Thanks for digging further into that.

I think it's now safe to say we should clean up this test and then fix the implementation so it passes reliably?

Or if that's impossible (it may be, I seem to recall coming to that conclusion in the past) we should deprecate IReactorThreads and twisted.python.threadpool after replacing them with an API that it's actually possible to implement correctly using the primitives available in Python?

Either way, it sounds like there's a bit more to do here.

comment:17 Changed 7 years ago by Glyph

Oh, apparently you can't use Queue or RLock either. A safer mechanism would be to just use pipes everywhere, which is really quite a lot like using a reactor in every thread.

Last edited 7 years ago by Glyph (previous) (diff)

comment:18 Changed 7 years ago by Glyph

The problems with Queue and RLock are with signal safety, and we handle signal safety by using callFromThread for processing signals, which properly uses a file descriptor. While it might be nice to move away from those as concurrency primitives, we don't need to do that in this branch (and everything in the new package added by this branch is so ridiculously flexible that we can easily replace Queue, we just need an object with put and get.)

comment:19 Changed 7 years ago by Glyph

Branch: branches/desynchronize-2673branches/desynchronize-2673-2

(In [43318]) Branching to desynchronize-2673-2.

comment:20 Changed 7 years ago by Glyph

Since sometimes per-module/per-package stuff can be a little hard to see on the buildbots, FYI reviewers:

$ twistedchecker -f parseable twisted/python/threadpool.py twisted/threads
$ coverage report
Name                                     Stmts   Miss Branch BrMiss  Cover
--------------------------------------------------------------------------
twisted/threads/__init__                     6      0      0      0   100%
twisted/threads/_convenience                11      0      2      0   100%
twisted/threads/_ithreads                    4      0      0      0   100%
twisted/threads/_memory                     26      0      6      0   100%
twisted/threads/_team                       75      0     29      0   100%
twisted/threads/_threadworker               45      0     10      0   100%
twisted/threads/test/__init__                1      0      0      0   100%
twisted/threads/test/test_convenience       23      0      0      0   100%
twisted/threads/test/test_memory            31      0      4      0   100%
twisted/threads/test/test_team             147      0     30      0   100%
twisted/threads/test/test_threadworker     124      0     12      0   100%
--------------------------------------------------------------------------
TOTAL                                      493      0     93      0   100%
$ 

comment:21 Changed 7 years ago by Glyph

OK, I think this is ready for review now.

It's more code than I'd like, but the composition of objects here is way more straightforward than the all-in-one spaghetti that was previously in ThreadPool.

Most importantly, this seems to conclusively address the issue; with trial -u I was able to get failures, or hangs, within a dozen or so runs. Now I can easily get to hundreds of successful runs reliably. And when I was doing development and getting failures along the way, the failures - even in the existing, integration-y threadpool tests, as opposed to the new ones which are more unit-y - were almost always immediate tracebacks rather than timeouts or hangs.

comment:22 Changed 7 years ago by Glyph

Keywords: review added
Owner: Glyph deleted

comment:23 Changed 7 years ago by Jean-Paul Calderone

Thanks for working on this. Fixing ThreadPool has been on my todo list for a long time but I never got over the hurdle of having to learn exactly how the existing implementation was broken or what alternatives were available and would actually work. Providing Twisted with a more reliable thread pool will make it even easier to recommend Twisted to a wider audience of people (mostly those who aren't able to completely buy in to non-blocking/async).

comment:24 in reply to:  23 Changed 7 years ago by Glyph

Replying to exarkun:

Thanks for working on this.

Thanks for saying this. It was a bit of a slog and it's nice to know it's appreciated :-).

Fixing ThreadPool has been on my todo list for a long time but I never got over the hurdle of having to learn exactly how the existing implementation was broken or what alternatives were available and would actually work.

It ... was a mess. I thought I did understand it! But I think the one way I figured out that it was broken indicated by the ticket here was just the tip of the iceberg; the one-queue-multiple-consumers model makes it really, really hard to think about coordinating access to the relevant shared resources.

I've been hearing these very occasional second- or third-hand anecdotes about twistd hanging at shutdown or refusing to exit, and occasionally getting stuck on hostname resolution with the default resolver. It's really bugged me because this is all "should not be possible" type behavior. Having observed the behavior of the threadpool under lots of stress testing while doing this, both with the old code and the new, and I think that this is going to really improve reliability. The only other bug like this I think is the one about spawnProcess hanging.

Providing Twisted with a more reliable thread pool will make it even easier to recommend Twisted to a wider audience of people (mostly those who aren't able to completely buy in to non-blocking/async).

On that note, I should point out that the new interface in the twisted.threads package in this branch is amenable to dispatching work using other concurrency models as well.

For one thing, almost nothing actually imports anything, you get to inject your dependencies at construction time, so there's no need to monkeypatch or inject anything if you have different objects.

The new core interface, IWorker, has only 2 methods: do and quit. A Team is a composite implementation of that interface that has 2 types of worker: 1 coordinator where tasks related to the shared state on the Team itself are performed, and N workers where the work is actually done.

In the configuration that the new ThreadPool implementation uses, the pooling is done with a coordinator-worker factory that acquires a lock and unwinds reentrancy for the coordinator, and a worker-worker factory that puts things into a queue and performs them into a thread. It uses threading.Thread for the thread and Queue.Queue for the queue. But you can replace the coordinator with a thread-based worker, and the logic all still works (the tests still pass) - it just consumes an additional thread. So conceptually, you can replace the workers with things based on green threads, or maybe even something based on subprocesses.

The fact that these are decoupled and exposed also makes it easier to implement different coordination patterns. For example, if you want to have a thread pool (echoes of adbapi...) where particular resources are bound to particular threads, and you want to give one specific threads something to do, the implementation of one-thread-reading-from-a-queue-to-do-work is separately exposed so you don't need to hack an extra layering of queueing in.

Although I've removed the reactor-based implementation of this for brevity, you can also have an implementation that wraps a callFromThread and then put a reactor on each of the threads if we isolate them :-).

But most of all the fact that it's only two methods decouples things that want to do scheduling from a lot of extraneous details, and makes critical shared-state threaded code a lot easier to reason about.

comment:25 Changed 7 years ago by Jean-Paul Calderone

Owner: set to Jean-Paul Calderone
Status: newassigned

comment:26 Changed 7 years ago by Jean-Paul Calderone

Keywords: review removed
Owner: changed from Jean-Paul Calderone to Glyph
Status: assignednew

Here's a start. Sorry it's incomplete but I'd like to look at some user-facing documentation before I think a lot more about the implementation.

  1. (required) Missing prose-style documentation or examples demonstrating the new public APIs

In _ithreads.py:

  1. (required)

Interfaces related to therads.

therads

  1. (required)

def do(task): # Pragma: nocover

I don't think # Pragma: nocover is something we do. I'd prefer we not do it, in any case. Please raise this in a general Twisted discussion forum and try to get some agreement there if you want these to stay.

  1. (required)

Perform the given task.

Should this say anything about thread-related pre- or post-conditions? Is this interface method thread-safe? Maybe this information doesn't belong in the interface? If you think so, just explain why here I guess.

In _convenience.py:

  1. (optional)

Private; not exposed in a public interface anywhere.

It looks like this documentation idiom isn't used uniformly in this changeset. What's the goal?

In _threadworker.py:

  1. (required)

callablea

typo

in threadpool.py:

  1. (required)

# Called only from the workforce's coordinator.

It's not totally clear what this comment applies to or why it's interesting information.

  1. (required)

self._team = Team(lambda: LockWorker(threading.Lock(),

threading.local()),

createWorker=limitedWorkerCreator, logException=log.err)

This log.err would be better if it incorporated some information for the reason somehow (like a worker identifier or some other contextual information). Also... shouldn't this be something in the new logging system, not the old logging system?

in _team.py:

  1. (optional)

@self._coordinator.do

This is a cute idiom but I don't think it helps maintainability at all. I'd rather see normal function calls.

  1. (required)

# (Explicitly do _not_ honor _quit.)

Why not?

  1. (required)

Team class docstring is missing docs for several private attributes.

comment:27 in reply to:  26 Changed 7 years ago by Glyph

Replying to exarkun:

Here's a start. Sorry it's incomplete but I'd like to look at some user-facing documentation before I think a lot more about the implementation.

  1. (required) Missing prose-style documentation or examples demonstrating the new public APIs

Would it be possible to skip this (and have you think hard about the implementation in the absence of such docs) twisted._threads were a private package at first? Coming up with documentation that adequately explains its purpose within Twisted, especially in the context of IReactorThreads and twisted.internet.threads is going to be very challenging, and it is a lot less interesting to me than fixing the reliability issue that this new code addresses. I'd be happy to write those docs later once this has landed and do other preparatory work to get it ready for a larger audience.

comment:28 Changed 7 years ago by Jean-Paul Calderone

Would it be possible to skip this (and have you think hard about the implementation in the absence of such docs) twisted._threads were a private package at first? Coming up with documentation that adequately explains its purpose within Twisted, especially in the context of IReactorThreads and twisted.internet.threads is going to be very challenging, and it is a lot less interesting to me than fixing the reliability issue that this new code addresses. I'd be happy to write those docs later once this has landed and do other preparatory work to get it ready for a larger audience.

There are three sections of documentation I can imagine existing for this code:

  1. How are users meant to use this new library?
  2. How do the pieces of this new library fit together?
  3. How does this library benefit the implementations of other parts of Twisted - IReactorThreads, twisted.internet.threads, and twisted.python.threadpool?

(1) strikes me as necessary before this library becomes a public API in Twisted. I think that ideally, this documentation would be written now as the code is fresh in your mind and because "ticket not approved" is pretty much the only leverage I have to get you to write it. ;) If we introduce this library as a private API then the lack of this documentation isn't a blocking concern, I suppose, since there won't be any users outside of Twisted (so only Twisted maintainers will get frustrated).

Even if the new API is private to Twisted, the addition of (2) is still more likely to get me to (in practice) re-review this sooner. To some extent (2) probably implies (1) - since some of the pieces of this new library are the user-facing pieces - but the point of (2) is really to make it so I can think a little bit less hard but still understand what's going on. Lacking this, I might still do a re-review but since it will probably take harder thinking it will probably be longer before I actually manage to do it (simply as a practical consequence).

(3) doesn't concern me all that much. There is already documentation for the existing threading APIs. If there is documentation for the new threading APIs then I feel like it should be possible to figure out what's going on, even without additional documentation that ties the two together. It would probably be a nice addition but I'd much rather have (and would be satisfied with) (1) or (2).

comment:32 Changed 7 years ago by David Petar Novakovic

Cc: David Petar Novakovic added

comment:36 Changed 7 years ago by radix

It would be nice if someone updated the ticket title and description to describe what the actual bug is, instead of just saying that the test fails.

comment:37 Changed 7 years ago by radix

Cc: radix added

comment:38 in reply to:  36 Changed 7 years ago by Glyph

Replying to radix:

It would be nice if someone updated the ticket title and description to describe what the actual bug is, instead of just saying that the test fails.

It is … super hard to describe the problem here succinctly.

The unreliable test is in fact a real problem though and should be fixed :). i could probably open 5 or 6 more bugs which would all be resolved at once to describe the broken behavior of the current thread pool implementation.

Hmm. Possibly a good idea.

comment:39 Changed 7 years ago by Glyph

Branch: branches/desynchronize-2673-2branches/desynchronize-2673-3

(In [43977]) Branching to desynchronize-2673-3.

comment:41 Changed 7 years ago by hawkowl

T GLYPH I AM REVIEWING A TICKET BUT THIS ONE IS NOT ACTUALLY IN REVIEW BUT I'M DOING IT ANYWAY

  • docs/core/howto/threading.rst:15 -- twisted -> Twisted
  • docs/core/howto/threading.rst:73 -- this sentence seems a bit awkward
  • docs/core/howto/threading.rst:125 -- remove one "do this"
  • twisted/python/threadpool.py:82 -- should ThreadPool be a new-style class?
  • Could this all be ported to Python 3 (if this is too much to ask for now, could you please file a ticket so I can do it?)

Otherwise, I love it. Lemme know when the docs are fully formed and I'll do a proper review.

comment:42 in reply to:  41 Changed 7 years ago by Glyph

Replying to hawkowl:

T GLYPH

Really hoping nobody else gets that reference.

I AM REVIEWING A TICKET BUT THIS ONE IS NOT ACTUALLY IN REVIEW BUT I'M DOING IT ANYWAY

Thanks for following up on this, I need to be prodded on this one.

  • docs/core/howto/threading.rst:15 -- twisted -> Twisted
  • docs/core/howto/threading.rst:73 -- this sentence seems a bit awkward
  • docs/core/howto/threading.rst:125 -- remove one "do this"
  • twisted/python/threadpool.py:82 -- should ThreadPool be a new-style class?
  • Could this all be ported to Python 3 (if this is too much to ask for now, could you please file a ticket so I can do it?)

Oops, did I not put it in dist3? I should do that.

Otherwise, I love it. Lemme know when the docs are fully formed and I'll do a proper review.

OK. I'm about halfway there - I restructured the existing docs so I could say something sensible about this stuff, the next step is to extract the parts in twisted.python.threadpool without which twisted.threads doesn't make much sense. Also I need to describe the properties of the coordinator versus the worker IWorker providers in Team...

comment:47 Changed 6 years ago by Glyph

Branch: branches/desynchronize-2673-3branches/desynchronize-2673-4

(In [45353]) Branching to desynchronize-2673-4.

comment:48 Changed 6 years ago by Glyph

twisted/python/threadpool.py:82 -- should ThreadPool be a new-style class?

No, it's already changing incompatibly in a few ways, and there's no need for it to be new-style for it to be correct. So I am not changing that.

> Could this all be ported to Python 3 (if this is too much to ask for now, could you please file a ticket so I can do it?)

I think that's taken care of now; I wish we had more visibility.

comment:49 Changed 6 years ago by Glyph

Keywords: review added
Owner: Glyph deleted

Otherwise, I love it. Lemme know when the docs are fully formed and I'll do a proper review.

After 9 months of struggling to turn this into a complete overhaul of everything related to threading including the documentation, I have concluded that it is just too much work to cram into one branch, and that we should land this as a bug-fix I should do a follow-up that adds documentation. The documentation may well change some of the structure (and some of the refactorings I thought of from trying to write the docs are present in the branch right now).

The documentation has still been overhauled somewhat, and hopefully some of the caveats now present in the docs indicate why we need a better thread pool API. I also started splitting out the reactor interfaces, because we will not want to deprecate callInThread and callFromThread but we will definitely want to deprecate and get rid of the other junk on IReactorThreads.

comment:50 Changed 6 years ago by hawkowl

Keywords: review removed
Owner: set to Glyph

Hi Glyph,

  • Please add a license header+etc to twisted/_threads/_pool.py
  • Add __future__ imports
  • Twisted isn't capitalised in the docs in a few places
  • "We can do this do this ::" in the docs

I like. Pls fix those and land. It's time to put this one to bed.

comment:51 Changed 6 years ago by Glyph

As part of running the tests repeatedly on the builders in order to ensure we really have nailed the intermittent bug with the version of the code to be merged, I ran across this:

[ERROR]
Traceback (most recent call last):
  File "/home/buildslave/run/fedora17-x86_64-py2.7/Twisted/twisted/internet/base.py", line 430, in _continueFiring
    callable(*args, **kwargs)
  File "/home/buildslave/run/fedora17-x86_64-py2.7/Twisted/twisted/internet/base.py", line 982, in _stopThreadPool
    self.threadpool.stop()
  File "/home/buildslave/run/fedora17-x86_64-py2.7/Twisted/twisted/python/threadpool.py", line 269, in stop
    self._team.quit()
  File "/home/buildslave/run/fedora17-x86_64-py2.7/Twisted/twisted/_threads/_team.py", line 204, in quit
    self._coordinator.do(self._quitIdlers)
  File "/home/buildslave/run/fedora17-x86_64-py2.7/Twisted/twisted/_threads/_threadworker.py", line 99, in do
    self._quit.check()
  File "/home/buildslave/run/fedora17-x86_64-py2.7/Twisted/twisted/_threads/_convenience.py", line 46, in check
    raise AlreadyQuit()
twisted._threads._ithreads.AlreadyQuit: 

twisted.internet.test.test_threads.ThreadTestsBuilder_EPollReactorTests.test_isNotInIOThread

So it looks like there's still a little work to do...

comment:52 Changed 6 years ago by Glyph

Of course, on my local machine:

twisted.internet.test.test_threads
  ThreadTestsBuilder_SelectReactorTests
    test_isNotInIOThread ...                                               [OK]

-------------------------------------------------------------------------------
Ran 1 tests in 0.001s

PASSED (successes=1)
Test Pass 60642

comment:53 Changed 6 years ago by Glyph

Finally:

[ERROR]
Traceback (most recent call last):
  File "/Users/glyph/Projects/Twisted/twisted/internet/base.py", line 430, in _continueFiring
    callable(*args, **kwargs)
  File "/Users/glyph/Projects/Twisted/twisted/internet/base.py", line 982, in _stopThreadPool
    self.threadpool.stop()
  File "/Users/glyph/Projects/Twisted/twisted/python/threadpool.py", line 269, in stop
    self._team.quit()
  File "/Users/glyph/Projects/Twisted/twisted/_threads/_team.py", line 204, in quit
    self._coordinator.do(self._quitIdlers)
  File "/Users/glyph/Projects/Twisted/twisted/_threads/_threadworker.py", line 99, in do
    self._quit.check()
  File "/Users/glyph/Projects/Twisted/twisted/_threads/_convenience.py", line 46, in check
    raise AlreadyQuit()
twisted._threads._ithreads.AlreadyQuit: 

twisted.internet.test.test_threads.ThreadTestsBuilder_SelectReactorTests.test_isNotInIOThread

comment:54 Changed 6 years ago by Glyph

The problem is that in a thread, Team._recycleWorker is running concurrently with Team.quit.

self._quit.set() sets the flag, which causes the already-running _recycleWorker to call _quitIdlers, and that invocation of _quitIdlers to call self._coordinator.quit().

Then, Team.quit calls self._coordinator.do, but self._coordinator is already quit and therefore raises an exception.

comment:55 Changed 6 years ago by Glyph

Fixed that one - wrote a test for it even - and then saw this one.

[ERROR]
Traceback (most recent call last):
  File "/Users/buildbot/BuildSlave/osx10.10-py2.7/Twisted/twisted/python/threadpool.py", line 241, in inContext
    result = inContext.theWork()
  File "/Users/buildbot/BuildSlave/osx10.10-py2.7/Twisted/twisted/python/threadpool.py", line 257, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/Users/buildbot/BuildSlave/osx10.10-py2.7/Twisted/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/Users/buildbot/BuildSlave/osx10.10-py2.7/Twisted/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/Users/buildbot/BuildSlave/osx10.10-py2.7/Twisted/twisted/test/test_threadpool.py", line 297, in raiseError
    raise NewError()
twisted.test.test_threadpool.NewError: 

twisted.test.test_threadpool.ThreadPoolTests.test_callInThreadWithCallback

comment:56 Changed 6 years ago by Glyph

Keywords: review added
Owner: Glyph deleted

OK, figured that one out too.

I figure since we had a couple of bugs review didn't catch on to, this deserves another trip around the merry-go-round. To help out the reviewer, since this is true shared-state preemptive concurrency code and therefore even theoretically impossible to review, I have taken the unusual step of forcing ten builds rather than one, and you can view them all here: https://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/desynchronize-2673-4&num_builds=10

There are still other intermittent failures, especially on osx10.10-py2.7, but if we could get through a full 10 builds on all supported builders without seeing any thread-related failures, I'd be pretty happy.

comment:57 Changed 6 years ago by hawkowl

Keywords: review removed
Owner: set to Glyph

Hi Glyph,

The tests pass, the tests (ran 10 times, and I checked each) seem to be sane, the code seems to be as correct as you can get, there's (code) documentation, and it really seems to be fine. Minor points:

  • Extra newlines at the end of twisted/test/test_threadpool.py
  • Newsfile needs correct capitalisation

Otherwise it's fine??? I can't really find any problems with it. The tests pass. I kinda understand the API. Fix those and merge.

comment:58 Changed 6 years ago by Glyph

Resolution: fixed
Status: newclosed

(In [45376]) Merge desynchronize-2673-4: Brief description

Author: glyph

Reviewer: exarkun, hawkowl

Fixes: #2673

Fix an intermittent test failure, rewriting nearly everything about multithreading in Twisted in the process.

Note: See TracTickets for help on using tickets.