Opened 5 years ago

Last modified 7 months ago

#3901 enhancement new

deferToProcess

Reported by: benliles Owned by: benliles
Priority: normal Milestone:
Component: core Keywords: multiprocessing
Cc: exarkun, truekonrads, jason.heeris@…, drednout.by@…, karabacakcengiz@…, julian@… Branch:
Author: Launchpad Bug:

Description

Create an implementation of the threadpool and deferToThread using the multiprocessing library instead of the threading library.

There is a backport of the multiprocessing library currently available from http://pypi.python.org/pypi/multiprocessing

Change History (16)

comment:1 Changed 5 years ago by exarkun

  • Cc exarkun added
  • Owner changed from glyph to benliles

This ticket could be improved in several ways:

  1. Re-assign the ticket to yourself and commit to do the work. :) This ticket proposes a significant feature addition to Twisted, but as you left the ticket assigned to glyph, I assume that you do not intend to implement this feature yourself. While it's possible that someone else will think this is such a good idea that they will undertake the task themselves, it's not very likely. A more likely outcome is for the ticket to remain open for many years with no activity.
  2. Specify the proposed feature more thoroughly. For something as complicated as what's being proposed, more than a couple sentences are required. For example, what objects will be acceptable as arguments to deferToProcess? How will these objects be transmitted to worker processes? What access to the configuration of the process pool will be available? How will jobs be distributed? et cetera.
  3. Remove proposed implementation details until you have a working implementation (or at least couch them as general ideas for development direction - do not make them a mandatory part of the specification). As it happens, the multiprocessing library cannot be used with Twisted (at least on POSIX), due to conflicting SIGCHLD handling requirements.

Taking advantage of multiple processors is an oft requested feature, so the spirit of this ticket is not at all misplaced! However, it is not really actionable as-is. Please clarify it with respect to the above mentioned points.

Thanks!

comment:2 Changed 5 years ago by glyph

Someone interested in this ticket should also be aware of Ampoule.

In "normal" python programs, multiprocessing often appears to work, but in Twisted's case, the undocumented serialization algorithm and poor error-reporting behavior are likely to be highly confusing, since un-serializable things like sockets are lying around all over the place. Consider the output of this program:

from multiprocessing import Process, Queue
from socket import socket
def f(q):
    print 'QUEUE PUT'
    skt = socket()
    q.put(skt)
    print 'PUTTED!'
def go():
    q = Queue()
    proc = Process(target=f, args=(q,))
    proc.start()
    skt = q.get()
    print 'skt', skt
go()

So, personally I believe that something like Ampoule which has very clear, explicit serialization rules would be a better basis for deferToProcess than multiprocessing, even if it's a bit more work to define the interface between processes.

The other thing about Ampoule is that it opens the door to the possibility of spreading the process pool around multiple hosts. But thanks for filing the ticket!

comment:3 Changed 5 years ago by benliles

  • Author set to benliles

Ampoule is a bit more powerful and requires more code changes than we were willing to do. The idea was to have something that was as close to be directly replaceable as possible.

Along those lines, I've been working on http://pypi.python.org/pypi/twisted.internet.processes/. It provides a deferToProcess function.

I haven't written tests for it yet, but I will as soon as I have time and will close the ticket at that point.

comment:4 Changed 5 years ago by exarkun

  • Author benliles deleted

(having a branch author when there is no branch doesn't make sense, so removing that field)

Ampoule is a bit more powerful and requires more code changes than we were willing to do. The idea was to have something that was as close to be directly replaceable as possible.

Can you elaborate on this a bit? What API are you porting from? What differences in Ampoule make using it too much work?

I haven't written tests for it yet, but I will as soon as I have time and will close the ticket at that point.

If you want to propose this code for inclusion in Twisted, I cannot overemphasize the need for comprehensive test coverage. Ideally, you would be developing all of this code in a test-driven manner so that you never introduce any logic without first writing tests for it. Dealing with processes is not entirely trivial, so this is all the more important.

(Also, somewhat less interestingly, but unfortunately just as importantly, it would be best if any code you intend to contribute to Twisted were MIT/X licensed.)

If you're not intending to contribute this code for inclusion in Twisted, then renaming it would be a good idea. Also, in that case, the code, even when complete and perfect, probably cannot resolve this ticket.

Having only skimmed the implementation, I can't say I have a complete grasp of the code in question, so my next question may have an obvious answer, which I hope you will feel free to point out to me. If the API of Ampoule is what causes problems for you, then perhaps you can create the API you desire but implement it in terms of Ampoule? Hopefully this will remove most of the non-trivial implementation work from your code while still resolving your issue.

comment:5 Changed 5 years ago by truekonrads

  • Cc truekonrads added

I did some work on a similar topic with pb and reactor.spawnProcess a while ago. Approach I used was to execute a stand-alone twisted application and communicate with (can't remember now exactly) pb or plain LineReceiver. This brutish approach worked and didn't have serialization problems. The expected usage would be

# myapp.py
from twisted.internet.defer import succeed
from twisted.someplace.processhandler import otherEnd
def doSomething()
   a=complexComputation()
   return succeed(a)
if __name__=="main":
  otherEnd(doSomething)

#
# and on the spawning end
d=reactor.deferToProcess('my.project.myfile.py')

comment:6 Changed 5 years ago by exarkun

This probably shouldn't be exposed as a method of the reactor. Instead, there should be some other object which wraps the reactor and provides the feature. There may be multiple implementations, too. For example, a pool-based implementation which re-uses processes for multiple calls; or an implementation that runs processes on another host via SSH. I think Ampoule is more like this.

comment:7 Changed 5 years ago by truekonrads

Pondering about this, a realistic approach would be to go for limited data passing between process.
Simple serialization and separation:

  • subprocess can't access data/methods via this interface on parent.
  • If it can't be passed via AMP/pb then it can't. No sockets, C objects, etc. Pure Python objects.

I see this as a 100 line thing. Probably not part of reactor.deferToProcess but rather twisted.multiprocessing

If it will use pb, then I think spawning on different hosts shouldn't be much of a problem.
I could try to tackle this if we'd have a definitive design.

comment:8 Changed 5 years ago by exarkun

Please look at Ampoule.

comment:9 Changed 5 years ago by darkporter

All processors today are multicore, why have this functionality in a separate project (Ampoule)? I'm envisioning something simple:

# In same process
reactor.listenTCP(1234, MyFactory())

# Forwarding to a pool of processes
reactor.listenTCP(1234, MyFactory(), loadBalanceOverAllCores=True) or
reactor.listenTCPAndLoadBalance(1234, MyFactory(), stickyByIP=True, distribution=RoundRobin)

Or this information could be in the factory:

class MyFactory(protocol.Factory):
  def __init__( self ):
    self.protocol = MyProtocol
    self.multiprocess = OneForEachCore

Or maybe there's a subclass of Factory called MultiprocessFactory.

I don't really understand the discussion about non-picklable data being a problem. The only data being sent would be arguments to dataReceived, connectionMade, etc. The only thing they need are bytes from the socket and integer socket descriptors right?

comment:10 Changed 5 years ago by exarkun

Hi darkporter,

I'm glad you're interested in this. The functionality is currently in a separate project because this was a convenient way to develop it. The primary author has expressed interest in moving the functionality into Twisted, though, and there is no fundamental reason why this shouldn't happen. It's mainly a question of effort. I suggest you get in contact with him (dialtone) to discuss how you can help with the issue. I'm not intimately familiar with the Ampoule code, so I can't say it is ready for inclusion as-is. Identifying what remains to be done to make it suitable for inclusion is probably the first task to knock down (unless someone has done it already - the place to check for that would be the Ampoule issue tracker, though). There are also some known bugs in Ampoule that it would make sense to fix before trying to include the code in Twisted.

As for your listenTCP API suggestions, I don't think these are really related to the feature this ticket is describing. The deferToProcess API serves a distinct use case. Changes to listenTCP serve a different set of use cases and should probably be addressed separately. Feel free to raise these ideas on the mailing list (I think you'll get some push back, but the discussion should hopefully result in something that satisfies your requirements and ours :).

Thanks again for getting involved! :)

comment:11 Changed 5 years ago by darkporter

You're right -- deferToProcess is a general solution for moving work to another process. What I'm thinking of rather is to make the main reactor loop "core aware" and smart enough to spawn subprocesses on each core (which are then load balanced over). I'll raise the issue on the mailing list.

comment:12 Changed 4 years ago by detly

  • Cc jason.heeris@… added

comment:13 Changed 3 years ago by drednout

  • Cc drednout.by@… added

comment:14 Changed 2 years ago by therve

I'll mention here the new multi process capability of trial, in twisted.trial._dist, which can be a based for this new mechanism, and/or use this new mechanism once in place.

comment:15 Changed 9 months ago by cengizkrbck

  • Cc karabacakcengiz@… added

comment:16 Changed 7 months ago by jmehnle

  • Cc julian@… added
Note: See TracTickets for help on using tickets.