[Twisted-Python] High throughput database logger

Thu Sep 4 07:20:15 MDT 2014

On 12:51 pm, adi.libotean at proatria.com wrote:
>Hi,
>
>I'm looking at various options for implementing a high throughput 
>database logger that will work with Twisted.
>
>My requirements, listed by importance:
>
>1) small memory footprint
>2) high speed
>3) low garbage generation
>
>The application I'm working on runs continuously (24/7). I've 
>experimented a bit with pysqlite and Twisted to see which approach is 
>better suited (see attached example).
>
>----
>
>Question 1: I noticed that all of the Twisted based versions are very 
>slow compared to the plain sqlite3 test. This seems to be caused by 
>atomic transaction management, namely a commit after each insert.

Not only this but in some of the Twisted versions you've introduced a 
round-trip communication from the reactor thread to a worker thread 
between each operation.  This will greatly impact throughput by adding 
lots of latency to each insert.
>Would be interested to know if there is a simple way to avoid this and 
>do my own transaction management (aka batch commit).

Using twisted.enterprise.adbapi?  You could probably hack something 
horrible together but it would definitely be a hack.  I suggest you take 
a look at adbapi2 instead - <http://trac.calendarserver.org/wiki/twext>.
>One other thing is the greatly varying amounts of garbage generated 
>(peak memory) and memory usage between the Twisted variants.

"Garbage" and "peak memory" are different things.  The Twisted-using 
version does a lot more - and some of your Twisted-using versions put 
the *entire* data set into memory (in a vastly expanded form, where each 
insert is represented by multiple large objects including Deferreds). 
So it's not too surprising the memory usage is greater.
>----
>
>Question 2: I would have expected B (Twisted ADBAPI) to behave very 
>similar to C/E since I'm using a connection pool of size 1 and all 
>requests are queued and handled sequentially.
>
>Could any of you please give me some pointers as to why this is 
>happening?

You didn't actually label the code with these letters. :)  I'm guessing 
B is the `adbapi` function, C is `inline_callbacks`, and E is 
`cooperator`.

Also you didn't say in what respect you expected them to behavior 
similarly.  You expected their memory usage to be the same?  You 
expected their runtime to be the same?  You expected them to put the 
same data into the database?

As far as memory usage goes, B uses lots of memory for the same reason 
`semaphore` (D?) uses lots of memory.  You queue up the entire dataset 
in memory as piles of tuples, lists, Deferreds, etc.

adbapi might be executing the operations one at a time, but the *loop* 
inside `adbapi` runs all the way to the end all in one go.  It starts 
every one of those `runOperation`s before any of them (probably) has a 
chance to execute.
>----
>
>Question 3: Even though objgraph lists the exact same reference count 
>once the code is ran, the amount of used memory greatly differs. Any 
>ideas what might be causing this?

Hopefully the above helps explain this.

Something else you might consider is batching up your inserts (not just 
only committing after a batch of inserts).  Since SQLite3 can only write 
from a single thread at a time, you're effectively limited to serialized 
inserts - so it doesn't make sense to try to start a second insert 
before the first has finished.

When the first finishes, if 50 more data points have arrived, you should 
do one insert for all 50 of those - not 50 inserts each for one piece of 
data.  This cuts off a bunch of your overhead - Python objects, round- 
trip latency for inter-thread communication, function calls, etc.

Jean-Paul
>Any suggestions and/or pointers on how to improve/do this are more than 
>welcome.
>
>Thank you for your time,
>Adrian