[Twisted-Python] High throughput database logger
exarkun at twistedmatrix.com
exarkun at twistedmatrix.com
Thu Sep 4 07:20:15 MDT 2014
On 12:51 pm, adi.libotean at proatria.com wrote:
>Hi,
>
>I'm looking at various options for implementing a high throughput
>database logger that will work with Twisted.
>
>My requirements, listed by importance:
>
>1) small memory footprint
>2) high speed
>3) low garbage generation
>
>The application I'm working on runs continuously (24/7). I've
>experimented a bit with pysqlite and Twisted to see which approach is
>better suited (see attached example).
>
>----
>
>Question 1: I noticed that all of the Twisted based versions are very
>slow compared to the plain sqlite3 test. This seems to be caused by
>atomic transaction management, namely a commit after each insert.
Not only this but in some of the Twisted versions you've introduced a
round-trip communication from the reactor thread to a worker thread
between each operation. This will greatly impact throughput by adding
lots of latency to each insert.
>Would be interested to know if there is a simple way to avoid this and
>do my own transaction management (aka batch commit).
Using twisted.enterprise.adbapi? You could probably hack something
horrible together but it would definitely be a hack. I suggest you take
a look at adbapi2 instead - <http://trac.calendarserver.org/wiki/twext>.
>One other thing is the greatly varying amounts of garbage generated
>(peak memory) and memory usage between the Twisted variants.
"Garbage" and "peak memory" are different things. The Twisted-using
version does a lot more - and some of your Twisted-using versions put
the *entire* data set into memory (in a vastly expanded form, where each
insert is represented by multiple large objects including Deferreds).
So it's not too surprising the memory usage is greater.
>----
>
>Question 2: I would have expected B (Twisted ADBAPI) to behave very
>similar to C/E since I'm using a connection pool of size 1 and all
>requests are queued and handled sequentially.
>
>Could any of you please give me some pointers as to why this is
>happening?
You didn't actually label the code with these letters. :) I'm guessing
B is the `adbapi` function, C is `inline_callbacks`, and E is
`cooperator`.
Also you didn't say in what respect you expected them to behavior
similarly. You expected their memory usage to be the same? You
expected their runtime to be the same? You expected them to put the
same data into the database?
As far as memory usage goes, B uses lots of memory for the same reason
`semaphore` (D?) uses lots of memory. You queue up the entire dataset
in memory as piles of tuples, lists, Deferreds, etc.
adbapi might be executing the operations one at a time, but the *loop*
inside `adbapi` runs all the way to the end all in one go. It starts
every one of those `runOperation`s before any of them (probably) has a
chance to execute.
>----
>
>Question 3: Even though objgraph lists the exact same reference count
>once the code is ran, the amount of used memory greatly differs. Any
>ideas what might be causing this?
Hopefully the above helps explain this.
Something else you might consider is batching up your inserts (not just
only committing after a batch of inserts). Since SQLite3 can only write
from a single thread at a time, you're effectively limited to serialized
inserts - so it doesn't make sense to try to start a second insert
before the first has finished.
When the first finishes, if 50 more data points have arrived, you should
do one insert for all 50 of those - not 50 inserts each for one piece of
data. This cuts off a bunch of your overhead - Python objects, round-
trip latency for inter-thread communication, function calls, etc.
Jean-Paul
>Any suggestions and/or pointers on how to improve/do this are more than
>welcome.
>
>Thank you for your time,
>Adrian
More information about the Twisted-Python
mailing list