[Twisted-Python] Should I use asynchronous programming in my own modules?

Thu Oct 18 09:31:51 EDT 2007

On Thu, 18 Oct 2007 14:41:38 +0200, Jürgen Strass <jrg718 at gmx.net> wrote:
>Hello,
>
>I'm rather new to twisted and asynchronous programming in general. Overall, 
>I think I've understood the asynchronous programming model and its 
>implications quite well. Nevertheless, there are some remaining questions.
>
>To give some example, I'd like to develop my own simplified document format 
>in XML and a corresponding parser. The output of the parser (a specialized 
>document object model) will be traversed and translated into HTML 
>afterwards. This module could be useful outside any twisted application, of 
>course. Instead of generating HTML one could develop a generator that 
>produces LaTeX, for example. But it could also be used to render HTML pages 
>in a twisted web application.

Have you seen Lore?

>The question is this: since parsing and 
>generating large documents could block the reactor in a twisted app, should 
>I use any of twisted's asynchronous programming features in this module (for 
>better integration with twisted) or should I rather develop it in a 
>traditional way and run it in a thread?

Incremental parsing is often useful and simpler than the alternative.  If
you are accepting a document over the network, why buffer it yourself and
then parse it when you could just be giving each piece directly to the
parser?  Done this way, it often is the case that even large documents can
be parsed without blocking for an unreasonable amount of time.

>
>The question came to my mind, because somewhere I read that long lasting 
>operations in third party modules should be called in a thread. This is 
>clear. I also read that if one has the opportunity to develop an application 
>from scratch, one should rather go for using twisted's asynchronous 
>programming features and divide long lasting operations into small chunks. 

The CPU differs from the network.  There are rarely points in a CPU-bound
task where suspending to work on something else would not be an arbitrary
decision.  When dealing with the network, these points are obvious and
not at all arbitrary.  So, when dealing with the network, it's almost
unarguable that you should use Twisted's APIs instead of using blocking
APIs.  However, Twisted doesn't provide any functionality specifically
for breaking up CPU-bound tasks, primarily because any such functionality
would be arbitrary.

>In principal, this approach is clear to me, but does it also apply for 
>modules which are entirely independent from twisted networking code? And if 
>so, is there any way to decouple them from the twisted library for reuse in 
>other applications?

It's typically trivial to drive code written to be used asynchronously in
a synchronous manner.  The opposite is rarely, if ever, true.  Consider a
parser API which consists of a "feed" method taking a string giving some
more bytes from the input document.  You can use this by passing in small
chunks repeatedly until the entire document has been passed in, or you can
pass in the entire document at once.  Now consider an API where the entire
document must be supplied at once: how do you use that without blocking?

>
>The last question is what criteria I could use to divide long lasting 
>operations into chunks. In almost all books about asynchronous programming I 
>only read that if they're too big, they could block the event loop. Of 
>course, but how big is too big? And what's the measure for it? Milliseconds, 
>number of operations, number of code lines - or what? Doesn't it depend 
>entirely on the application at hand and how reactive it has to be?

Yes.

>Moreover, 
>depending on the hardware used, on a Pentium II less chunks can be processed 
>at the same time than on a Athlon 64, for example.

True as well.  However, is your primary goal to provide ideal scheduling
behavior both on a CPU released this year and a CPU released ten years ago?

>And couldn't chunks also 
>be too small, spending more time than necessary in putting them into the 
>reactor's queue, then maybe sorting them and then calling them? In case the 
>overhead involved in scheduling some chunk is bigger than the processing 
>time of the chunk itself, the chunks are too small, aren't they?

Correct again.

These problems can all be mitigated, at least partially, by allowing the
application to decide how much work is done at once.  Parsing one byte from
an input document should take less time than parsing one megabyte.  Let the
application decide how much work is done at a time.  Size of input is only
one way in which this can be controlled.  You could support explicit tuning
of these parameters with a dedicated API, or you could support stepwise
processing and let the application explicitly step it as far as it wants to
at a time.  In this direction, there are some extremely primitive tools in
twisted.internet.task.  They will not solve the problem for you, but they
may give you some ideas or save you a bit of typing.

>
>Thanks in advance for any answers,
>Jürgen
>

Jean-Paul