Opened 3 years ago

Last modified 3 years ago

#5298 enhancement new

ensmartenify threadpool resource limits

Reported by: glyph Owned by:
Priority: normal Milestone:
Component: core Keywords:
Cc: Branch:
Author: Launchpad Bug:

Description

Our current reactor threadpool design leaves something to be desired.

Sometimes developers run into the threadpool maximum in difficult-to-diagnose ways. For example, there may be a deadlock between a deferToThread that needs a thread to free up in order to run and a blockingCallFromThread (taking up a free slot in the threadpool)

This is of course exacerbated by the following:

  • the threadpool is necessary for name resolution if clients want platform-defined hostname resolution behavior
  • the pool is frequently used for database interaction too (adbapi)
  • setting up an external threadpool (as twisted's built-in WSGI support encourages you to do) is error-prone and confusing, so applications sometimes use the reactor threadpool for that too
  • the default maximum number of threads is relatively small (20 threads)
  • there is no configuration file for Twisted, so applications need to call a method to increase the threadpool maximum size

Most of these problems would go away if we just did away with the thread pool maximum size entirely. So, why not do that? What's the purpose of the threadpool max size? Given that this code has been with it for so long, it's probably a good idea to re-consider this.

Threads consume some resources to exist. Each one consumes a couple of megabytes of stack space, at least, so they're relatively heavyweight objects to create, and we don't want to allocate a zillion of them. Still, it's not like we attempt to limit allocation of other objects, however large; we just let users catch MemoryError if they care. So what's special about threads?

The thing about threads is that you might reasonably queue up a thousand or so callInThread items at once; each one's just a function and an entry in a Queue. A couple dozen bytes at most, not something to get worried about. But if you actually attempted to start a thousand threads in response to that, that's 2 gigabytes of memory, which definitely is something to worry about. So one reason is just to rate-limit this potentially significant memory consumption, especially in the case where each job may complete very quickly, making room for the next in the queue, making it pointless to start so many parallel threads at once.

Another reason for providing a simple maximum is that on some systems (those which have been carefully tuned to a particular workload) it's actually known in advance how many threads the system needs to do its job. This is a much more advanced use-case though, and given that the only way to tune your operating environment is to edit your code, I suspect that it is not as frequently used.

There are two problems with this strategy. Either over-allocates (once, at startup, your server needs to do a couple hundred threaded tasks: then, it holds on to these useless thread stacks for its whole lifetime) or under-allocates (leading to poor performance or possibly deadlocks in the worst case).

My suggestion would be to provide a hard maximum, but to make it very large, and instead of spawning new threads immediately, providing a hook which spawns new threads in the main reactor thread when a certain amount of time has passed without making progress, attempting to break deadlocks or provide additional thread resources on demand when progress is slowed by a slow blocking resource or heavy computation. Then, when threads are idle for a certain amount of time, spinning them down (stopping/joining them) to release those resources associated with those threads.

Change History (4)

comment:1 follow-up: Changed 3 years ago by exarkun

the threadpool is necessary for name resolution if clients want platform-defined hostname resolution behavior

no it's not, use twisted.names

the pool is frequently used for database interaction too (adbapi)

no it's not, ConnectionPool creates its own thread pool

setting up an external threadpool (as twisted's built-in WSGI support encourages you to do) is error-prone and confusing, so applications sometimes use the reactor threadpool for that too

there's a much nicer ticket for this already

comment:2 follow-up: Changed 3 years ago by itamar

It's possible most of the *actual* problems with the thread pool involve DNS lookups, and so a DNS-specific solution might be better:

  1. Make the timeout smarter, so it only starts from when lookup actually starts, rather than from addition to thread pool.
  1. Figuring out that twisted.names is a possibility is not easy... and it's not a complete replacement (e.g. mDNS, the equivalent stuff Windows does, etc.). Maybe make a new default, e.g. "try twisted.names, fall back to other nss plugins or whatever if that returns nothing"? Or maybe not, that has its own problems. We'd also have to be certain default twisted.names implementation is good enough.

2MB per concurrent DNS lookup is pretty harsh...

comment:3 in reply to: ↑ 1 Changed 3 years ago by glyph

Replying to exarkun:

the threadpool is necessary for name resolution if clients want platform-defined hostname resolution behavior

no it's not, use twisted.names

Yes it is. twisted.names doesn't provide platform-defined hostname resolution behavior, it provides DNS-defined hostname resolution behavior. In particular it will not resolve mDNS or WINS names, nor will it use any local name resolution techniques that might be defined, like those provided by libnss-ldap.

the pool is frequently used for database interaction too (adbapi)

no it's not, ConnectionPool creates its own thread pool

Oops. I use the reactor threadpool in adbapi2 so I was confused. (Arguably that threadpool should be pluggable, but it doesn't actually confer any benefits to make it separate and it complicates startup and shutdown, so I just adjust the reactor threadpool's size to accommodate the extra capacity.) I was actually contemplating a similar mechanism for reducing the resource utilization of adbapi2's pooling logic which is what lead me to file this ticket.

setting up an external threadpool (as twisted's built-in WSGI support encourages you to do) is error-prone and confusing, so applications sometimes use the reactor threadpool for that too

there's a much nicer ticket for this already

#4347 addresses a bunch of important issues, and there is some overlap, but I don't think it looks at the problem of persistent resource utilization or the deadlock issue I described here.

comment:4 in reply to: ↑ 2 Changed 3 years ago by glyph

Replying to itamar:

It's possible most of the *actual* problems with the thread pool involve DNS lookups, and so a DNS-specific solution might be better:

This is purely conjecture, but I suspect as we see more applications that throw Twisted into the blender with something else (Django leaps most readily to mind), there will be more non-DNS-related issues. But your ideas might be good by themselves...

  1. Make the timeout smarter, so it only starts from when lookup actually starts, rather than from addition to thread pool.

If the timeout is specified for some technical reason, then, maybe ... but if the timeout is specified because that's as long as the user (or some external system) is willing to wait, then it may not be useful to specify it that way.

  1. Figuring out that twisted.names is a possibility is not easy... and it's not a complete replacement (e.g. mDNS, the equivalent stuff Windows does, etc.). Maybe make a new default, e.g. "try twisted.names, fall back to other nss plugins or whatever if that returns nothing"? Or maybe not, that has its own problems.

One big problem with this approach is that it violates local policy and thereby RFC3484. We might be able to actually parse gai.conf in twisted.names and do something sensible though, so this might not be as intractable as, e.g. resolving names out of LDAP.

We'd also have to be certain default twisted.names implementation is good enough.

2MB per concurrent DNS lookup is pretty harsh...

Considerably less harsh if you can free the 2MB after you're done!

Note: See TracTickets for help on using tickets.