[Twisted-Python] Scheduling of modules

Moof moof at metamoof.net
Mon Jul 4 10:53:58 EDT 2005


I've more or less had twisted sold to me as the be-all and end-all in terms
of avoiding having to deal with threads, amongst other things, so I thought
I'd ask if twisted is the right thing to use for my problem.

I have a number of modules which use John J Lee's mechanize module to do
some web scraping to different web sites. Right now mechanize uses standard
urllib calls to do its thing, though it may be possible after much
refactoring to get it use something more asyncronous. So for the short term
rapid development thing, I'm more or less stuck with threads if I want to
get any vaguely concurrent running out of these modules.

Specifically, I'm going to need to poll a database every minute or so (or
write some form of SQL Server 2000 trigger which will call me when things
change) and if there are any changes in the database, potentially fire off
thirty of these web scraping modules at once. The modules are
self-contained, and don't communicate with anything other than the thing
that calls them by returning values, so I'm vaguely certain they're more or
less thread-safe, inasmuch as I ever can be.

My beef is that I can't have more than one scraping module modifying the
same site concurrently. This introduces race conditions on the site, rather
than in my code. Each module touches only one site, so I need to basically
have a module-level lock either in the module or in the thread scheduler to
ensure that I'm not running the same module more than once.

This makes me think of some sort of queue structure. I either need to have
one queue that just works through its requests ignoring any that are
currently running, or one queue per module with some sort of central
dispatcher that will place a request in the appropriate queue.

In real terms, these modules may take up to three minutes to complete the
web scraping they are required to do, though most take 20 seconds or so. I'd
rather not just have them called one after the other in a blocking manner,
as I'd sort of like to have a five or six minute response time whenever a
request is placed in the database to fire off a bunch of updates, rather
than the close to 20 minute response time I'm currently getting when I fire
a complete unittest suite off. These requests may come in several times a
day, most commonly hours apart, but I need to be able to react if I get two
or three different requests within a five minute period, which would mean
firing off the next request to the module as soon as it has completed the
current request.

Is this something Twisted can help me with? If so, what are my options
within Twisted, and what should I be reading up on how to use? I have a
vague idea that Twisted has a thread pool, but I'm not sure if it has an
event queue that would be suitable for this sort of control, or how I'd go
about modifying whatever's there to be useful for this sort of thing.

If not, any pointers to patterns that might help me code such a thing up?

I'm running on Windows. I need no GUI integration as such, though it will
need to run as an NT service. Any input other than though the database could
potentially be triggered off by a client programme going through something
like perspective broker, or the Windows NT service controller telling me to
start up or shut down.

Thanks for your help,

Giles Antonio Radford, alias Moof
"Too old to be a chicken and too young to be a dirty old man"
Serving up my ego over at <http://metamoof.net/>

More information about the Twisted-Python mailing list