[Twisted-web] Efficient server-side web-scraping?

Mon Apr 18 04:54:03 MDT 2005

Hi,

I've developed a small Python CGI app which I'm porting to Twisted Web 
in order to add some in-memory caching.

The app (you can see the current version at etrays.net) sticks a bunch 
of hits from the ebay advanced search into a box for folks to stick on 
their site. The server makes an HTTP request to the eBay search form, 
and scrapes the result using Beautiful Soup.

Right now, I simply do all the work in a .rpy script. If I've understood 
Twisted correctly, the whole server blocks while my render_GET method 
runs, right. (Twisted is single threaded)

So the search on eBay blocks Twisted (I just call urllib.urlopen) which 
is bad because it's pretty slow. Could anyone suggest a setup where the 
eBay search can take place in the background, leaving twisted free to 
process other incoming requests. When the eBay results come back, the 
corresponding Twisted request would wake up, scrape the HTML and complete.

I guess I need to use threads here? And have a Twisted callback 
triggered when the thread completes?

Thanks

Tom.