[Twisted-Python] Why deferToThread is so slow?

Fri Jun 3 16:22:57 MDT 2016

On Fri, Jun 03, 2016 at 10:06:43AM +0200, Nagy, Attila wrote:
> Hi,
>
> I have a thread safe synchronous library, which I would like to use
> in a threadpool using deferToThread.
>
> Without using (deferTo)threads I get consistent 1-3 ms response
> times, with deferring to threadpool, I get 30-300, varying wildly.
> [snip...]
> # python /tmp/deft.py 1
> deferToThread: avg 316.17 us, sync: avg 1.38 us, 228.71x increase
> ...
> # python /tmp/deft.py 8
> deferToThread: avg 2542.90 us, sync: avg 1.37 us, 1854.14x increase
> ...
> # python /tmp/deft.py 16
> deferToThread: avg 4745.76 us, sync: avg 1.26 us, 3770.11x increase
>  ...
> On 16 threads it takes nearly 5 ms(!) to call a local function...
>
> The test program is here:
> https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41

There's a lot going on here!  I'm going to break my responses down into three
categories:

1) those about the way your test program is written;
2) those about Queue-based thread pools.
3) those about threads in Python;

1) The way your test program is written.

There are two puzzling lines in your test program.

The first is the use of reactor.callWhenRunning(dtt):
https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert-py-L22

I'm guessing the goal here was to call dtt over and over again
infinitely?  Why not replace this with a while True: loop?  The
reactor won't begin scheduling threads until it's begun running, so
you could rely on the yield blocking the coroutine until the reactor's
started.  I'll use a while True: loop in my version of your test program.

...the second, and more interesting line:
https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert-py-L26

Why run this code in a separate thread?  Besides the fact that doing
so requires you to write thread-safe code (which is hard!), this also
means your dtt function will compete with deferToThread for access to
the reactor's thread pool.

In general, if you're writing performance-sensitive code, you should
not use deferToThread.  Instead, you should use deferToThreadPool:

https://twistedmatrix.com/documents/current/api/twisted.internet.threads.html#deferToThreadPool

That's because the reactor's threadpool, which deferToThread uses,
also handles things like DNS lookups.  That means other parts of your
Twisted program will compete with your performance-sensitive code.  To
ensure that your performance-sensitive code gets the best chance to
run it should be given its own thread pool, which *only* runs that
code.  I'll use such a thread pool in my version of your test program.

2) The way Queue-based thread pools work.

There's another source of scheduling contention in your program.
sys.argv[1] does *not* increase the number of threads -- instead, it
increases the *demand* on threads.  The number of threads will be no
more than 64, per:

https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert-py-L24

With a value 16 threads, you'll have 16 * 10000 scheduled function calls
that can only run across 64 threads.

To make this behavior clear, consider this minimal implementation of a
Queue-based thread pool:

https://gist.github.com/markrwilliams/2e40ed0fa06a9e653609dd61fd80ca95#file-simple_threadpool-py

Note that each testRun needs its own thread, so that waiting for a
result doesn't prevent more function calls from being scheduled.  This
isn't necessary with Twisted because deferToThread(Pool) because
that's exactly what Deferreds are for!

Also note that this implementation cheats -- results of a function
call f(1) aren't routed back to the place where they were requested.
Twisted's thread pool does in fact do this.

With those caveats, let's try this with just 1 * 10000 scheduled
function calls across 64 threads:

$ python /tmp/tp.py 1
threadpool: avg 46.067005 us, sync: avg 0.416845 us, 110.51x increase

We see a pretty significant difference between a simple synchronous
function call and dispatching to a thread via a Queue and obtaining
its result.

What happens if we bump things up to 16 * 10000?

$ python simple_threadpool.py 16
threadpool: avg 667.252771 us, sync: avg 1.551497 us, 430.07x increase
threadpool: avg 668.004651 us, sync: avg 1.498873 us, 445.67x increase
threadpool: avg 667.386251 us, sync: avg 1.686789 us, 395.65x increase
threadpool: avg 668.936246 us, sync: avg 1.448275 us, 461.88x increase
threadpool: avg 666.702565 us, sync: avg 1.459888 us, 456.68x increase
threadpool: avg 669.740488 us, sync: avg 1.620622 us, 413.26x increase
threadpool: avg 666.529074 us, sync: avg 1.657008 us, 402.25x increase
threadpool: avg 666.151787 us, sync: avg 1.583210 us, 420.76x increase
threadpool: avg 666.713700 us, sync: avg 1.509841 us, 441.58x increase
threadpool: avg 666.670518 us, sync: avg 1.533161 us, 434.83x increase
threadpool: avg 666.880252 us, sync: avg 1.536666 us, 433.98x increase
threadpool: avg 666.819521 us, sync: avg 1.507123 us, 442.45x increase
threadpool: avg 664.132656 us, sync: avg 1.624818 us, 408.74x increase
threadpool: avg 667.016451 us, sync: avg 1.490790 us, 447.42x increase
threadpool: avg 668.779542 us, sync: avg 1.488286 us, 449.36x increase
threadpool: avg 667.080139 us, sync: avg 1.524386 us, 437.61x increase

Not only did the the difference between the two increase, it also took
longer to run our synchronus function in a given thread!  The
difference between the synchronous and threadpool based calls is due
to the fact that as the amount of work increases, the number of
workers to do it doesn't.  That means the queue keeps getting longer
and longer.

3) The way threads work in Python.

As for our synchronous function, remember that it's running in its own
thread; that means it's very likely that it took longer to run because
of contention for the GIL.

So two things to remember about using thread pools in Python:

a) Increasing the amount of work without increasing the number of
workers always makes latency worse.  This is true of queues and
threads in any environment.

b) The GIL *will* become a bottleneck with larges amounts of work and
an increasing number of threads.  That means you can't just keep
adding threads to address the workload problem described in a).

Now, let's take what we've learned and apply it to a different version
of your test program.
>
> BTW, I've run this on pypy 5.1.1 with even worse results:
> https://mail.python.org/pipermail/pypy-dev/2016-June/014477.html
>
...that also has an affordance to not use inlineCallbacks, as these
likely interfere with PyPy's JIT:
https://twistedmatrix.com/trac/ticket/6278

Here's the program:

https://gist.github.com/markrwilliams/2e40ed0fa06a9e653609dd61fd80ca95#file-twisted_threadpool-py

Let's try it with 1 * 10000 scheduled functions:

$ python twisted_threadpool.py 1
deferToThread: avg 121.014680 us, sync: avg 0.463771 us, 260.94x increase
deferToThread: avg 130.383369 us, sync: avg 0.475454 us, 274.23x increase
deferToThread: avg 125.587505 us, sync: avg 0.466942 us, 268.96x increase
deferToThread: avg 124.141280 us, sync: avg 0.469970 us, 264.15x increase
deferToThread: avg 135.672952 us, sync: avg 0.505450 us, 268.42x increase
deferToThread: avg 130.711776 us, sync: avg 0.483895 us, 270.12x increase
deferToThread: avg 118.456074 us, sync: avg 0.438543 us, 270.11x increase
deferToThread: avg 137.559747 us, sync: avg 0.504330 us, 272.76x increase
deferToThread: avg 121.825337 us, sync: avg 0.459836 us, 264.93x increase
deferToThread: avg 141.214092 us, sync: avg 0.539142 us, 261.92x increase
...

So Twisted's thread pool is about twice as slow as our toy one.  Not too bad!

With 16 * 10000 scheduled functions:

deferToThread: avg 1691.969863 us, sync: avg 0.444171 us, 3809.28x increase
deferToThread: avg 1693.141545 us, sync: avg 0.434609 us, 3895.78x increase
deferToThread: avg 1693.619694 us, sync: avg 0.443431 us, 3819.35x increase
deferToThread: avg 1693.614949 us, sync: avg 0.439235 us, 3855.83x increase
deferToThread: avg 1694.422696 us, sync: avg 0.435658 us, 3889.34x increase
deferToThread: avg 1694.554698 us, sync: avg 0.431748 us, 3924.87x increase
deferToThread: avg 1694.615262 us, sync: avg 0.430913 us, 3932.61x increase
deferToThread: avg 1694.747073 us, sync: avg 0.428314 us, 3956.79x increase
deferToThread: avg 1694.701245 us, sync: avg 0.427885 us, 3960.65x increase
deferToThread: avg 1695.052327 us, sync: avg 0.419444 us, 4041.19x increase
deferToThread: avg 1695.149564 us, sync: avg 0.432749 us, 3917.16x increase
deferToThread: avg 1695.340127 us, sync: avg 0.432845 us, 3916.74x increase
deferToThread: avg 1695.489345 us, sync: avg 0.433822 us, 3908.26x increase
deferToThread: avg 1695.886709 us, sync: avg 0.436207 us, 3887.81x increase
deferToThread: avg 1696.334289 us, sync: avg 0.438710 us, 3866.64x increase
deferToThread: avg 1696.634512 us, sync: avg 0.434704 us, 3902.96x increase

Our synchronous function hasn't gotten worse because there are fewer
threads at play (thanks, Twisted!), and the difference between the two
has gone up by about 16x (3809.28 / 16 = 238.08 -- and we were seeing
~260x)

For fun, here are the PyPy results for the code that *doesn't* use
inlineCallbacks:

$ pypy twisted_threadpool.py 1 deferred
deferToThread: avg 248.463297 us, sync: avg 0.979066 us, 253.78x increase
deferToThread: avg 89.544964 us, sync: avg 0.260448 us, 343.81x increase
deferToThread: avg 63.347292 us, sync: avg 0.204158 us, 310.29x increase
deferToThread: avg 63.825631 us, sync: avg 0.205946 us, 309.91x increase
deferToThread: avg 55.110717 us, sync: avg 0.193667 us, 284.56x increase
deferToThread: avg 51.918244 us, sync: avg 0.212955 us, 243.80x increase
...
deferToThread: avg 62.353158 us, sync: avg 0.239372 us, 260.49x increase
deferToThread: avg 40.433884 us, sync: avg 0.148630 us, 272.04x increase
deferToThread: avg 65.630174 us, sync: avg 0.165820 us, 395.79x increase
deferToThread: avg 85.674763 us, sync: avg 0.259972 us, 329.55x increase
deferToThread: avg 65.085721 us, sync: avg 0.198150 us, 328.47x increase
deferToThread: avg 44.353342 us, sync: avg 0.153947 us, 288.11x increase
...

Looks like PyPy is doing a pretty good job!

(I won't show it here, but there's a comment in that gist that
contains links to vmprof PyPy runs for both the inlineCallback and
inlineCallback-free versions.  Looks like inlineCallbacks *does* get
optimized pretty well...)

Thins to consider:

1) Use your own thread pool - see the gist I wrote for how to do that.

2) The behavior you were observing was due to increasing the amount of
work but not the workers to do it, and occurs with any thread pool
implementation.

3) PyPy can be really fast!

Hope this helps.  Let me know if anything's unclear!

Best,

Mark