[Twisted-Python] Scrapy spiders waiting in reactor thread when callFromThread gets call repeatedly

Tue Dec 23 08:49:47 MST 2014

What *is* happening?  Underneath, callFromThread is basically just
setting a flag and writing to a file descriptor or some similar thing
to wake the reactor from its polling sleep.  Even at very high load,
the reactor should be multiplexing reads from that file descriptor
(which can act as a form of batching) with actual scraping.

Dustin

On Sun, Dec 21, 2014 at 6:47 AM, Adi Lavi <adi.lavi at cortica.com> wrote:
> Hi,
> I am using Pika's asynchronous consumer implementation with Scrapy and
> Twisted. I have twisted reactor running on the main thread, and Rabbit
> consumer running on a background thread. When I get a message and want to
> start my spider, I use 'callFromThread' to wake the reactor thread, init the
> spider and start crawling.
>
> Alas, on high load of Q messages, I find that because 'callFromThread' is
> called all the time, Scrapy does not start downloading until there is some
> 'break' in these calls.
>
> I am wondering what is the best approach to gain high scale with Scrapy,
> Twisted and RabbitMQ. Should I continue using the current design, and simply
> do some buffering or batching to reduce the 'callFromThread' frequency?
> Perhaps I should use a synchronous design?
>
> Thanks
>
> _______________________________________________
> Twisted-Python mailing list
> Twisted-Python at twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>